HBase learning 1 (basic introduction)

Time:2021-11-20

1、 What is HBase?

HBase is a distributed, column oriented open source database. This technology comes from the Google paper “BigTable: a distributed storage system of structured data” written by Fay Chang. Just as BigTable makes use of the distributed data storage provided by Google file system, HBase provides capabilities similar to BigTable on Hadoop. HBase is a subproject of Apache’s Hadoop project. HBase is different from the general relational database. It is a database suitable for unstructured data storage. Another difference is that HBase is column based rather than row based.

2、 What are the characteristics of HBase?

  1. Large: a table can have hundreds of millions of rows and millions of columns.
  2. Column oriented: list (cluster) oriented storage and permission control, column (cluster) independent retrieval.
  3. Sparse: for empty (null) columns, it does not occupy storage space. Therefore, the table can be designed very sparse.
  4. Modeless: each row has a primary key that can be sorted and any number of columns. Columns can be added dynamically as needed. Different rows in the same table can have different columns.
  5. Multiple versions of data: there can be multiple versions of data in each cell. By default, the version number is automatically assigned. The version number is the timestamp when the cell is inserted.
  6. Single data type: the data in HBase is a string without type.
  7. Support Expiration: HBase supports TTL expiration feature. When the user sets the expiration time, the data exceeding the TTL will be automatically cleared by the system.

3、 HBase data model?

HBase stores data in the form of tables. A table consists of rows and columns. The column is divided into several column families, as shown in the following figure.

HBase learning 1 (basic introduction)

  • Table: HBase will organize data into tables, but it should be noted that the table name must be a legal name that can be used in the file path, because the table of HBase is mapped to the file above HDFS.
  • Row: in the table, each row represents a data object. Each row is uniquely identified by a row key. The row key has no specific data type and is stored in binary bytes.
  • Column family: when defining the HBase table, you need to set the column cluster in advance. All columns in the table need to be organized in the column cluster. Once the column cluster is determined, it cannot be easily modified because it will affect the real physical storage structure of HBase. However, the column qualifier and its corresponding values in the column cluster can be dynamically added or deleted. Each row in the table has the same column cluster, but it is not necessary to have consistent column qualifier and value in the column cluster of each row, so it is a sparse table structure.
  • Column qualifier: the data in the column cluster is mapped through the column identifier. In fact, the concept of “column” can not be rigidly adhered to here, but can also be understood as a key value pair. Column qualifier is the key. The column ID also has no specific data type and is stored in binary bytes.
  • Cell: each row key, column cluster and column ID form a cell. The data stored in the cell is called cell data. Cell and cell data have no specific data type and are stored in binary bytes.
  • Timestamp: by default, the data in each cell is inserted with a timestamp to identify the version. When reading cell data, if the timestamp is not specified, the latest data will be returned by default. When writing new cell data, if no timestamp is set, the current time is used by default. The version number of cell data of each column cluster is maintained separately by HBase. By default, HBase retains three versions of data.

RowKey

Rowkey can use any string (the maximum length is 64KB, which is generally 10 ~ 100bytes in practical application). In HBase, row key is saved as a byte array.

During the use of HBase, designing rowkey is a very important link. We can refer to the following steps when designing rowkey:

  1. Combined with the characteristics of the business scenario, select the appropriate fields as the rowkey, and place the fields in order according to the query frequency.
  2. The designed rowkey can disperse the data to the whole cluster as much as possible, balance the load and avoid hot issues.
  3. The designed rowkey should be as short as possible.

Like NoSQL, rowkey is the primary key used to retrieve records. There are only three ways to access rows in HBase table:

  1. Accessed through a single rowkey.
  2. Set the startRow and stoprow parameters in scan mode for range matching.
  3. Full table scan, that is, directly scan all row records in the whole table.

Physical storage model

In physical storage, HBase divides the table into multiple hregs in the row direction, and each hreg is scattered in different hregionservers.

HBase learning 1 (basic introduction)

Each hregon consists of multiple stores, each store consists of a memstore and 0 or more storefiles, and each store stores a columns family.

HBase learning 1 (basic introduction)

4、 HBase architecture?

Components in HBase include client, zookeeper, hmaster, hregionserver, hregion, store, memstore, storefile, hfile, Hlog, etc.

HBase learning 1 (basic introduction)

Each table in HBase is divided into multiple sub tables (hregs) according to a certain range through row key. If an hreg exceeds a certain threshold, it will be divided into two. This process is controlled byHRegionServerManagement, and the allocation of hregion is managed byHMasterAdministration.

HMaster

  1. Assign an hregon to the hregon server.
  2. Responsible for load balancing of hregon server.
  3. Discover the failed hregon server and reassign the hregon on it.
  4. Garbage file recycling on HDFS.
  5. Process schema update requests.

Hmaster only maintains the metadata information of hregon, while the metadata information of table is saved on zookeeper. Therefore, the load of hmaster is very low.

HRegion Server

  1. Maintain the hregs assigned to him by hmaster and process the IO requests for these hregs (the client does not need hmaster’s participation to access the data on HBase).
  2. It is responsible for splitting hregions that become too large during operation.

HRegion

The table is divided into multiple hregs in the direction of the row,Hregon is the smallest unit of distributed storage and load balancing in HBaseThat is, different hregs can be distributed on different hregs servers, but the same hreg will not be split into multiple hregs servers.

Hregs are divided by size. Generally, there is only one hreg for each table. With the continuous insertion of data into the table, the hregs increase. When a column cluster of hregs reaches a certain threshold, it will be divided into two new hregs.

Zookeeper

  1. Ensure that there is only one hmaster in the cluster at any time to avoid a single point of failure of hmaster.
  2. Address entry to store all hregs.
  3. Monitor the online and offline information of hregon server in real time, and notify hmaster in real time.
  4. Store the schema and table metadata of HBase.

HBase relies on zookeeper. By default, HBase manages zookeeper instances (starts or closes zookeeper). Hmaster and hregionservers will register with zookeeper when they are started, so that hmaster can sense the health status of each hregionserver at any time.

Client

First, when a request occurs, HBase client uses RPC mechanism to communicate with hmaster and hregon server. For management operations, the client communicates with hmaster through RPC; For data reading and writing operations, the client communicates with the hregon server through RPC.

HBase client uses RPC mechanism to communicate with hmaster and hregon server, but how to address it? Since the address of the meta table and the address of the hmaster are stored in the zookeeper, the HBase client needs to address on the zookeeper first.

HBase client accesses zookeeper and can obtain the hregon server address according to the meta table.