1、 What is HBase?
HBase is a distributed and column oriented open source database based on HDFS. This technology comes from the Google paper “BigTable big table” written by Fay Chang, that is, store all data in one table. Just as BigTable makes use of the distributed data storage provided by Google file system, HBase provides capabilities similar to BigTable on Hadoop. HBase is a subproject of Apache’s Hadoop project. HBase is different from the general relational database. It is a database suitable for unstructured data storage. Another difference is that HBase is column based rather than row based.
2、 Architecture of HBase
HBase is composed of three types of servers in master-slave mode. The three servers are HBase hmaster, region server and zookeeper.
- HBase hmaster is responsible for region allocation and database creation and deletion. Specifically, hmaster’s responsibilities include:
- Hmaster is responsible for the allocation of regions.
- Regulate the work of region server
- Allocate regions when the cluster starts, and redistribute regions according to the needs of service recovery or load balancing.
- Monitor the working status of the region server in the cluster.
- Manage the database. Provides an interface to create, delete, or update tables.
- The region server is responsible for reading and writing data. Users can access data by communicating with the region server.
Specifically, there are several regions on the regionserver. The table in HBase is divided into so-called regions according to the value level of row key. A region contains all row keys in the table between the start and end key values of the region. The default size of each region is 1GB. The node in the cluster that manages the region is called the region server. The region server is responsible for reading and writing data. Each region server can manage about 1000 regions.
- Zookeeper is responsible for maintaining the status of the cluster (whether a server is online, data synchronization between servers, master election, etc.). The responsibilities of zookeeper in HBase include:
- Maintain whether the HBase server is alive
- Monitor whether the HBase server is accessible
- Provide notification of server failure / downtime
- At the same time, consistency algorithm is used to ensure the synchronization between servers.
- At the same time, he is also responsible for the master election.
- It should be noted that to ensure good consistency and smooth master election, the number of servers in the cluster must be odd, such as three or five.
3、 Table structure of HBase
Each row has a rowkey to uniquely identify and locate the row, and the data of each row is arranged according to the dictionary order of rowkey. Imployeebasicinfoclf and detailinfoclf are two column families, and there are multiple specific columns under the column family. (employee basic information column family: name, age. Detailed information column family: salary, role).
HBase data model
- Namespace: a namespace is a logical grouping of tables. Different namespaces are similar to different database databases in a relational database. Using namespaces, better resource and data isolation can be achieved in multi tenant scenarios.
- Table: corresponding to a table in a relational database, HBase organizes data in the unit of “table”, and the table is composed of multiple rows.
- Row: it is composed of a rowkey and multiple column families. A row has a rowkey to uniquely identify it.
- Column family: each row is composed of several column families. Each column family can contain multiple columns. Imployeebasicinfoclf and detailinfoclf are two column families. Column family is some embodiment of column commonness. Note: physically, the data of the same column family is stored together.
- Column qualifier: the column is uniquely specified by the column family and column qualifier. For example, the name and age above are the column qualifiers of the imployeebasicinfo CLF column family.
- Cell: the cell is uniquely located by rowkey, column family and column qualifier. A value and version number are stored in the cell.
- Timestamp: the values of different versions in a cell are arranged in reverse chronological order, with the latest data at the top