HBase learning II (best practices)


1、 Rowkey optimization

Rowkey is the primary key of rows, which is sorted in dictionary order. Therefore, the design of rowkey is very important, which is related to the query efficiency of your application layer.

Integrated rowkey

Sometimes the length of the fields used as rowkey is different, such as user_ ID. by regularizing the rowkey, the inconsistent length of the rowkey can be avoided, resulting in different amount of data returned in each request. The combined rowkey can be mapped into an equal length hash value.

Code rowkey

If the rowkey is saved as a string, such as the date format (yyyy MM DD HH: mm: SS), it will waste a lot of storage space. Therefore, the string can be numerically encoded and the encoded number can be saved to the rowkey.

High base dimension

If the rowkey consists of multiple fields, you need to put the high base dimension first, that is, the number of distinct fields is more than ten million, such as user_ ID is put in front, so that the field can play a great role in filtering and greatly reduce the query scope.

Add salt

If the first part of the combined rowkey is a timestamp and HBase is sorted according to the rowkey, it is likely that the adjacent data will be stored in an hregionserver. Considering that the latest data access frequency is the highest, it will lead to excessive load on a hregionserver’s read requests and hot issues. A feasible scheme is to prefix the rowkey with random numbers, This can ensure that the data is evenly distributed, but it causes trouble for data reading. Similar schemes to adding salt include hashing rowkey, flipping rowkey (frequently changed parts are put in front), etc.

2、 Hot data issues

Data hot spots: hot spots occur when a large number of clients directly access one or a few nodes of the cluster (access may be read, write or other operations). A large number of accesses will cause the single machine where the hot hregon is located to exceed its bearing capacity, resulting in performance degradation and even unavailability of hregon, which will also affect other hregon on the same hregon server. Because the host cannot serve the requests of other hregon, it will cause a waste of resources. A good data access mode is designed to make full and balanced use of the cluster.

You can solve the hregon hotspot problem in the following ways:

  1. Reverse reverse: reverse and store the fixed length rowkey, so that the frequently changed part of the rowkey can be placed in the front, and the random rowkey can be generated effectively. For example, the mobile phone number starts with a fixed number (138)、139)Hot issues.

  2. Pre partition / salt adding: salt adds a prefix to each rowkey, and the prefix uses some random characters, so that the data is scattered in multiple different hregions to achieve the goal of hregion load balancing. For example, there are four hregions in a (Note: with [, a), [a, b), [b, c), [C,) In the HBase table, the rowkey before salt is added: abc001, abc002, and abc003. We prefix a, B, and C respectively. The rowkey after salt is added: aabc001, b-abc002, and c-abc003. It can be seen that the rowkey before salt is in the second hregon by default, and the rowkey data after salt is distributed in three hregon. Theoretically, the throughput after processing should be Three times as much as before.

  3. The advantage of hash hash or mod: hash hash instead of random salt prefix is that it can make a given row have the same prefix, which not only disperses the hregon load, but also makes the read operation inferible. Deterministic hash (such as taking the first four digits after MD5 as the prefix) allows the client to rebuild the complete rowkey. You can directly get the desired row using the get operation. If the rowkey is numeric, you can also consider the MOD method.

3、 HBase three-dimensional ordering

Hfile is the storage format of keyValue data in HBase. It can be seen from HBase physical data model that HBase is list (cluster) oriented storage. Each cell consists of{row key,column(=< family> + < label>),version}The only unit that is determined is a keyValue. According to the above description, the key in this keyValue is{row key,column(=< family> + < label>),version}, and value is the value in the cell.

The three-dimensional in the three-dimensional ordered storage of HBase refers to the three-dimensional ordered storage composed of rowkey (row primary key), column key (columnfamily + < label >) and timestamp (timestamp or version number).

Rowkey: when we query according to the rowkey range, we generally know the startrowkey. If we only send the data starting with startrowkey: D through scan, all the data larger than D are queried, and we only need the data starting with D, then we need to limit it through endrowkey. We can set endrowkey to start with: D, and the following ones are set according to your rowkey combination. Generally, the addition is one bit larger than startkey.

Column key: column key is the second dimension. After sorting the data according to the rowkey dictionary, if the rowkeys are the same, the data is sorted according to the column key and the dictionary. We should learn to use this when designing a table. For example, our inbox sometimes needs to be sorted by topic, so we can set the topic as our column key, that is, the design is “columnfamily + topic”.

Timestamp: timestamp is the third dimension, which is sorted in descending order, that is, the latest data is at the top.

4、 Write optimization

  • Whether put can be submitted in batch: using the batch put interface can reduce the number of RPC connections between the client and hregionserver and improve write performance.
  • Whether put can be submitted asynchronously: if the business can accept the loss of a small amount of data under abnormal circumstances, it can also submit requests in asynchronous batch submission. The submission is executed in two stages: after the user submits the write request, the data will be written to the client cache and the user’s write success will be returned; When the client cache reaches the threshold (2m by default), the hregionserver is submitted in batches. It should be noted that in some cases, cache data may be lost when the client is abnormal. Usage: setautoflush (false).
  • Whether the write requests are unbalanced: if unbalanced, on the one hand, it will lead to low system concurrency, on the other hand, it may also cause high load on some nodes, and then affect other services. The distributed system is particularly afraid of a high load on a node, which may slow down the whole cluster. This is because many businesses will submit read-write requests in batch. Once some of the requests fall on the node and cannot be responded in time, the whole batch request will timeout.
  • Whether the write keyValue data is too large: the size of the keyValue has a great impact on the write performance. In case of poor write performance, you can consider whether it is caused by too large write keyValue data.

5、 Query optimization

  • Whether get can batch request: it can reduce the number of RPC connections between the client and hregionserver and improve query performance.
  • Whether the cache setting of large s can is reasonable: scan needs to return a large amount of data from the server at one time, the client initiates a request, and the server will return to the client in multiple batches. This design is to avoid transmitting more data at one time, which will exert great pressure on the server and client. At present, the data will be loaded into the local cache. The default size is 100 pieces of data. Some large scans need to obtain a large amount of data and transmit hundreds or even tens of thousands of RPC requests. In this case, we suggest that the cache size can be released appropriately.
  • Request to specify column cluster or column name: HBase is a column cluster database. The data of the same column cluster is stored in one block, and different column clusters are separated. In order to reduce IO, it is recommended to specify column cluster or column name.
  • It is recommended to prohibit caching for offline computing access to HBase: when accessing HBase offline, it is often a one-time reading. At this time, the read data is not necessary to be stored in blockcache. It is recommended to prohibit caching during reading. Usage: setblockcache (false).