HBase reading notes – Overview

Time:2022-3-13

BigTable: solve the problems of massive structured data storage and efficient reading and writing within Google.
HBase is a sparse, distributed and multidimensional map

1. Logical view

  1. Table: table contains more than one row of data.
    2. Row: row. A row of data contains a unique identification rowkey, multiple columns and corresponding values.
    3. Column: column. Different from the column in relational database, the column in HBase is composed of column family and qualif ier,
  2. Timestamp: timestamp. When each cell writes to HBase, it will assign a timestamp as the version of the cell by default
  3. Cell: a cell, which is a structure composed of five tuples (row, column, timestamp, type, value), where type represents an operation type such as put / delete, and timestamp represents the version of the cell
HBase reading notes - Overview

image.png

The logical view of HBase is easy to understand. It should be noted that HBase introduces the concept of column cluster, and the columns under column cluster can be expanded dynamically; In addition, HBase uses timestamp to realize multi version support of data.

HBase, a map system, is not simple. It has many qualifiers – sparse, distributed, persistent, multidimensional and sorted.

Different, the key of map in HBase is a composite key, which is composed of rowkey, column family, qualif IER, type and timestamp. Value is the value of cell

<[rowkey,column family,qualifier,type,timestamp],cell value>

2. Physical storage:

Column clustered storage: conceptually, column clustered storage is between row storage and column storage

3. Architecture

  • Master slave architecture

    HBase reading notes - Overview

    image.png

The master node, regionserver, is responsible for reading and writing data, and HDFS stores data.

*HBase client
a. HBase client provides shell command line interface, native Java API programming interface, thrift / rest API programming interface and MapReduce programming interface
b. Before the HBase client accesses the data row, first locate the regionserver where the target data is located through the metadata table, and then send a request to the regionserver. At the same time, these metadata will be cached locally at the client to facilitate subsequent request access. If the cluster regionserver goes down or performs load balancing, resulting in data fragmentation migration, the client needs to re request the latest metadata and cache it locally.

  • Zookeeper
    a. • high availability of master
    b. Management system core metadata
    c. Participate in regionserver downtime recovery
    d. Implement distributed table lock

  • Master
    Master is mainly responsible for various management of HBase system
    a. Handle various management requests of users, including creating tables, modifying tables, permission operations, splitting tables, merging data fragments and comparison.
    b. Manage all regionservers in the cluster, including load balancing, downtime recovery and migration of regions in regionserver.
    c. Clean up the expired logs and files. The master will check whether the Hlog and hfile in HDFS have expired and whether they have been deleted at regular intervals, and delete them after expiration.

  • RegionServer
    Regionserver is mainly used to respond to users’ IO requests. It is the core module in HBase and is composed of wal (Hlog), blockcache and multiple regions.

a. WAL(HLog)
First, it is used to achieve high reliability of data. When HBase data is written randomly, it is not directly written to hfile data file, but first written to cache, and then asynchronously refreshed and dropped.
Second, it is used to realize the master-slave replication between HBase clusters. The master-slave replication is realized by playing back the Hlog log pushed by the master cluster.

b. Blockcache: read cache in HBase system
Blockcache cache objects are a series of block blocks. A block is 64K by default, which is composed of physically adjacent kV data. Blockcache uses both spatial locality and temporal locality.
Nearby data, and data that may be accessed again

c. Region: a fragment of the data table

4. Advantages and disadvantages of HBase

1. Advantages:

  • Huge capacity: the single table of HBase can support the data scale of hundreds of billions of rows and millions of columns, and the data capacity can reach the level of TB or even Pb
  • Good scalability: HBase cluster can easily realize cluster capacity expansion, mainly including data storage node expansion and read-write service node expansion
  • Sparsity: HBase supports a large amount of sparse storage, that is, it allows a large number of column values to be empty and does not occupy any storage space
  • High performance: at present, HBase is mainly good at OLTP scenarios, with strong data writing performance. Its performance can also be guaranteed for random single point reading and small-scale scanning reading
  • Multi version: HBase supports multi version feature, that is, a kV can retain multiple versions at the same time. Users can select the latest version or a historical version as needed
  • Support Expiration: HBase supports the TTL expiration feature. Users only need to set the expiration time, and the data exceeding the TTL will be automatically cleaned up. There is no need for users to write a program to delete it manually.
  • Hadoop native support

2. Disadvantages:

HBase is not applicable to all application scenarios

  • HBase itself does not support very complex aggregation operations (such as join, groupby, etc.)
  • HBase itself does not implement the secondary index function, so it does not support secondary index lookup
  • HBase native does not support global inter-bank transactions, and only supports single line transaction model