Detailed explanation of HBase basic principle


Introduction to HBase

HBase is a distributed, column oriented open source database. Built on HDFS. The source of the name of HBase is Hadoop database, or Hadoop database. The computing and storage capacity of HBase depends on Hadoop cluster.

It is between NoSQL and RDBMS. It can only retrieve data through row key and range of primary key. It only supports single row transactions (complex operations such as multi table join can be realized through hive support).

The characteristics of HBase are as follows

  1. Large: a table can have billions of rows and millions of columns
  2. Column oriented: for column (family) storage and permission control, column (family) independent retrieval.
  3. Sparse:For the null column, it does not occupy storage space, so the table can be designed to be very sparse

Basic principles of HBase

system architecture

Detailed explanation of HBase basic principle

According to this figure, explain the components in HBase


  1. Contains the interface to access HBase,The client maintains some caches to speed up the access to HBase, such as the location information of the region


HBase can use either built-in zookeeper or external zookeeper. In the actual production environment, external zookeeper is generally used to maintain uniformity.

The role of zookeeper in HBase is as follows

  1. Ensure that there is only one master in the cluster at any time
  2. Address entry to store all regions
  3. Monitor the status of the region server in real time, and inform the master of the online and offline information of the region server in real time


  1. Assign region to region server
  2. Responsible for load balancing of region server
  3. Find the invalid region server and reallocate the regions on it
  4. Garbage collection on HDFS
  5. Processing schema update request

HRegion Server

  1. HRegion serverMaintain the regions assigned to it by hmasterTo handle IO requests to these regions
  2. Hregion server is responsible for segmenting regions that become too large in the running process

As you can see from the figure,Hmaster is not required to participate in the process of client accessing data on HBase(address access zookeeper and hregion server, data read and write access hregione server)

Hmaster only maintains the metadata information of table and hregion, so the load is very low.

Table data model of HBase

Detailed explanation of HBase basic principle

Row key

Like NoSQL database, row key is the primary key used to retrieve records. There are only three ways to access rows in HBase table

  1. Access through a single row key
  2. Through the range of row key
  3. Full table scan

Row key can be any string(The maximum length is 64KBIn practical application, the length is usually 10-100bytes). In HBase, row key is saved as a byte array.

HBase sorts the data in the table by rowkey (dictionary order)

During storage, the data is sorted and stored according to the byte order of row key. When designing the key, we should make full use of the feature of sorting storage, and put the rows that are often read together. (location dependence).

be careful:
The result of dictionary ordering int is
1,10,100,11,12,13,14,15,16,17,18,19,2,20,21 … 。To maintain the natural order of shaping, the row key must be left padded with 0.

One read and write of a row is an atomic operation (no matter how many columns are read and written at a time). This design decision can make it easy for users to understand the behavior of the program in Concurrent update operation on the same line.

Column family

Each column in HBase table belongs to a column family. A column family is part of a table’s schema (not a column),Must be defined before using a table

Column names are prefixed with column families. For example courses:history , courses:math They all belong to the course family.

**Access control, disk and memory usage statistics are performed at the column family level.
The more column families there are, the more files to participate in io and search when fetching a row of data. Therefore, if it is not necessary, do not set too many column families. **


The specific columns under the column family belong to a column family, similar to the specific columns created in MySQL.


In HBase, what is determined by row and columns is a storage unit called cell. Each cell holds multiple versions of the same data. Versions are indexed by timestamps. The type of timestamp is 64 bit integer.The time stamp can be assigned by HBase (automatically when data is written)In this case, the timestamp is the current system time in milliseconds. Timestamps can also be assigned explicitly by the customer. If an application wants to avoid data version conflict, it must generate its own unique timestamp.In each cell, different versions of data are sorted in reverse chronological orderThat is, the latest data is at the top.

In order to avoid the burden of management (including storage and index) caused by too many versions of data, HBase provides two ways of data version recovery

  1. Save the last n versions of the data
  2. Save the latest version (set the life cycle TTL of the data).

You can set it for each column family.


The unit uniquely determined by {row key, column (= < family > + < label >), version}.
There is no type of data in the cell, all of which are stored in the form of bytecode.

Version num

The version number of the data. Each data can have multiple version numbers. The default value is the system timestamp and the type is long.

Physical storage

1. Overall structure

Detailed explanation of HBase basic principle

  1. All the rows in the table are arranged in the dictionary order of the row key.
  2. The table is divided into multiple hregions in the direction of the row.
  3. Hregion is divided by size (the default is 10g). At the beginning of each table, there is only one hregion. As data is constantly inserted into the table, hregion increases. When it increases to a threshold value, hregion will be equally divided into two new hregions. As the number of rows in the table increases, there will be more and more hregions.
  4. Hregion is the smallest unit of distributed storage and load balancing in HBase.The smallest cell means that different hregions can be distributed on different hregion servers. butAn hregion will not be split into multiple servers.
  5. Although hregion is the smallest unit of load balancing, it is not the smallest unit of physical storage.

In fact, hregion consists of one or more stores,Each store stores a column family.
Each strore consists of a memstore and 0 or more storefiles. As shown in the picture above.

2. Storefile and hfile structure

Storefile is saved on HDFS in hfile format.

The format of hfile is as follows:

Detailed explanation of HBase basic principle

First of all, hfile files are of variable length, and only two of them are of fixed length: trailer and FileInfo. As shown in the figure, there are pointers in the trailer to the starting points of other data blocks.

File info records some meta information of the file, such as AVG_ KEY_ LEN, AVG_ VALUE_ LEN, LAST_ KEY, COMPARATOR, MAX_ SEQ_ ID_ Key et al.

The data index and meta index blocks record the starting point of each data block and meta block.

Data block is the basic unit of HBase I / O. in order to improve efficiency, there is a block cache mechanism based on LRU in hregionserver. The size of each data block can be specified by parameters when creating a table. Large block is conducive to sequential scan, and small block is conducive to random query. In addition to the magic at the beginning of each data block, it is a mosaic of keyValue pairs. The magic content is some random numbers to prevent data damage.

Each keyValue pair in hfile is a simple byte array. But this byte array contains many items and has a fixed structure. Let’s take a look at the specific structure inside

Detailed explanation of HBase basic principle

It starts with two fixed length values, representing the length of key and the length of value. Next is the key, which starts with a fixed length value, indicating the length of the rowkey, followed by the rowkey, followed by a fixed length value, indicating the length of the family, then the family, then the qualifier, and then two fixed length values, indicating time stamp and key type (put / delete). The value part doesn’t have such a complex structure, it’s pure binary data.

Hfile is divided into six parts

  1. Data block section – stores the data in the table, which can be compressed
  2. Meta block segment (optional) – save user-defined kV pairs, which can be compressed.
  3. File Info section – meta information of hfile, which is not compressed. Users can also add their own meta information in this section.
  4. Data block index section – the index of the data block. The key of each index is the key of the first record of the indexed block.
  5. Meta block index section (optional) – the index of the meta block.
  6. Trailer – this section is fixed length. The offset of each segment is saved. When a hfile is read, the trailer is read first. The trailer saves the starting position of each segment (the magic number of the segment is used for security check), and then the datablock Index will be read into memory, so that when retrieving a key, it does not need to scan the whole hfile, but only need to find the block where the key is located from memory, read the whole block into memory through disk IO once, and then find the required key. The data block index is eliminated by LRU mechanism.

Hfile’s data block and meta block are usually stored in compression mode. After compression, network IO and disk IO can be greatly reduced. Of course, CPU is required for compression and decompression.
At present, hfile supports two compression methods: gzip and LZO.

3. Memstore and storefile

**An hregion consists of multiple stores, and each store contains all the data of a column family
The store includes a memstore located in the memory and a storefile located on the hard disk. **

When the amount of data in the memory reaches a certain threshold, hregionserver starts the flashcache process to write to the storefile. Each write forms a separate storefile

When the size of the storefile exceeds a certain threshold, the current hregion will be divided into two, and the hmaster will assign them to the corresponding hregion servers to achieve load balancing

When the client retrieves data, it first finds it in the memstore, and then finds the storefile if it cannot find it.

4. HLog(WAL log)

Wal means write ahead log, which is similar to binlog in MySQL. It is used for disaster recovery. All changes of data are recorded in Hlog. Once the data is modified, it can be recovered from the log.

Each region server maintains an Hlog instead of one for each region. In this way, logs from different regions (from different tables) will be mixed together. The purpose of this is to continuously add a single file. Compared with writing multiple files at the same time, it can reduce the number of disk addressing,Therefore, it can improve the write performance of the table. The trouble is that if a region server goes offline, it is necessary toTo recover the region on it, you need to split the log on the region serverAnd then distribute it to other region servers for recovery.

The Hlog file is a common Hadoop sequence file

  1. The key of the Hlog sequence file is the hlogkey object. The hlogkey records the ownership information of the written data. In addition to the name of table and region, it also includes the sequence number and timestamp. The timestamp is the write time. The starting value of the sequence number is 0, or the sequence number that was last saved in the file system.
  2. The value of the Hlog sequence file is the keyValue object of HBase, which corresponds to the keyValue in hfile. See the description above.

Reading and writing process

1. Read request process:

Hregion server stores the meta table and table data. To access the table data, the client first accesses zookeeper to obtain the location information of the meta table from zookeeper, that is, to find which hregion server the meta table is stored on.

Then, the client accesses the hregion server where the meta table is located through the hregion server IP just obtained, so as to read the meta and obtain the metadata stored in the meta table.

The client accesses the corresponding hregion server through the information stored in the metadata, and then scans the memstore and storefile of the hregion server to query the data.

Finally, hregion server responds the data to the client.

View meta table information

hbase(main):011:0> scan 'hbase:meta'

2. Write request process:

The client also accesses zookeeper first, finds meta table, and obtains meta table metadata.

Determine the hregion and hregion server corresponding to the data to be written.

The client sends a write request to the hregion server, and the hregion server receives the request and responds.

The client writes the data to the Hlog to prevent data loss.

Then write the data to the memory.

If both Hlog and memory are successfully written, the data is successfully written

If the memstore reaches the threshold, the data in the memstore will be flushed into the storefile.

When there are more and more storefiles, compact merge will be triggered to merge too many storefiles into a large one.

When the storefile becomes larger and larger, the region will become larger and larger. When the threshold is reached, the split operation will be triggered to divide the region into two.

Detailed description:

HBase uses memstore and storefile to store updates to tables.
When data is updated, it is first written into log (wal log) and memory (memstore). The data in memstore is sorted,When the memstore accumulates to a certain threshold, a new memstore will be createdAnd the old memstore is added to the flush queue, which is flushed to the disk by a separate thread to form a storefile. At the same time, the system will record a redo point in zookeeper, indicating that the changes before this time have been persisted.
When there is an accident in the system, the data in the memory (memstore) may be lost. At this time, log (wal log) is used to recover the data after checkpoint.

The storefile is read-only and cannot be modified once it is created. Therefore, the update of HBase is actually a continuous operation. When the storefile in a store reaches a certain threshold, it will be merged once_ compact, major_ When the size of the storefile reaches a certain threshold, it will split the storefile into two equal storefiles.

Because the table updates are constantly appended, when compact, you need to access all the storefiles and memstores in the store and merge them by row key. Because both storefiles and memstores are sorted, and storefiles have indexes in memory, the process of merging is faster than that of merging.

Hregion management

Hregion allocation

At any time,An hregion can only be assigned to one hregion server. Hmaster records which hregion servers are currently available. And which hregions are currently assigned to which hregion servers and which have not been assigned. When a new hregion needs to be allocated and there is space available on one hregion server, hmaster will send a load request to the hregion server to allocate the hregion to the hregion server. After the hregion server receives the request, it begins to provide services for this hregion.

Hregion server goes online

Hmaster uses zookeeper to track hregion server status. When an hregion server starts, it will first create its own znode in the server directory of zookeeper. Because hmaster subscribes to the change messages in the server directory, hmaster can get real-time notification from zookeeper when files in the server directory are added or deleted. So once the hregion server goes online, hmaster can get the message immediately.

Hregion server offline

When the hregion server is offline, the session between it and zookeeper is disconnected, and zookeeper automatically releases the exclusive lock on the file representing the server. Hmaster can determine:

  1. The network between hregion server and zookeeper is disconnected.
  2. Hregion server is down.

In either case, the hregion server can no longer provide services for its hregion. At this time, hmaster will delete the znode data representing the hregion server in the server directory, and assign the hregion of the hregion server to other living nodes.

Working mechanism of hmaster

Master online

The master starts with the following steps:

  1. From zookeeperGets the only lock representing the active masterTo prevent other hmasters from becoming masters.
  2. Scan the server parent node on zookeeper to get the list of currently available hregion servers.
  3. Communicate with each hregion server to obtain the corresponding relationship between the currently allocated hregion and hregion server.
  4. Scan META.region Calculate the currently unallocated hregions and put them into the list of hregions to be allocated.

Master offline

becauseHmaster only maintains metadata of tables and regionsInstead of participating in the process of table data IO, hmaster offline only results in the freezing of all metadata modifications (unable to create and delete tables, unable to modify the table schema, unable to load balance hregion, unable to process hregion up and down, unable to merge hregion. The only exception is that hregion splitting can be performed normally, because only hregion server participates),The data reading and writing of the table can be carried out normally. thereforeHmaster offline has no effect on the whole HBase cluster in a short time

From the online process, we can see that all the information stored in hmaster is redundant information (which can be collected or calculated from other parts of the system)

Therefore, in a general HBase cluster, there is always one hmaster providing services, and more than one ‘hmaster’ is waiting for the opportunity to seize its position.

Three important mechanisms of HBase

1. Flush mechanism

1.(; 40% of heap size
The size of the global memstore of the region server. Exceeding this size will trigger the flush to disk operation. The default is 40% of the heap size, and the flush at the region server level will block the client’s reading and writing

2.(hbase.hregion.memstore.flush.size)Default: 128M
If the cache size of the memstore in a single region exceeds, the entire hregion will be flushed,

3.(hbase.regionserver.optionalcacheflushinterval)Default: 1H
The longest time a file in memory can survive before it is automatically refreshed

4.( heap size 0.4 0.95
Sometimes the “write load” of the cluster is very high, and the number of writes has always exceeded the amount of flush. At this time, we hope that the memstore does not exceed a certain security setting. In this case, the write operation will be blocked until the memstore is restored to a “manageable” size, which is the default heap size 0.4 95, that is, when the region server level flush operation is sent, the client writing will be blocked until the size of the whole region server level memstore is the heap size 0.4 95

5.(hbase.hregion.preclose.flush.size)The default value is 5m
When the size of the memstore in a region is larger than this value and we trigger the region close, we will first run the “pre flush” operation to clean up the memstore that needs to be closed, and then offline the region. When a region goes offline, we can’t write any more. If a memstore is large, the flush operation will consume a lot of time. The “pre flush” operation means that the memstore will be cleared before the region goes offline. In this way, when the close operation is finally executed, the flush operation will be very fast.

6.(hbase.hstore.compactionThreshold)Default: more than 3
The number of hfiles allowed to be stored in a store. If the number exceeds, it will be written to a new hfile In other words, when the flush of the corresponding memstore of each column family in each region is hfile, by default, when there are more than three hfiles, these files will be merged and rewritten into a new file. The larger the number is, the less time it takes to trigger the merge, but the longer the time it takes to merge each time

2. Compact mechanism

Merge small storefile files into large hfile files.
Clean up expired data, including deleted data
Save the version number of the data as one.

Split mechanism

When the hregion reaches the threshold, it will divide the excessive hregion into two.
By default, a hfile will be segmented when it reaches 10GB.

Search the official account for five minutes to learn big data, and dig into big data technology.