Knowledge points of HBase (I) preliminary knowledge and expansion


Knowledge points of HBase (I) preliminary knowledge and expansion

By studytime


Published by GoogleTroika papers (GFS, MapReduce and BigTable), known as the symbol of computer science entering the era of big data.

Because the early Hadoop developers only implemented Hadoop file system and Hadoop MapReduce, but not BigTable. Therefore, BigTable has been absent in Hadoop big data ecology for quite some time.

It was not until powerset launched the HBase project that it really realized the open source version of BigTable. Powerset was a very famous start-up company in the early days, and its entrepreneurial field is the next generation of search engine: natural search engine. Although it released its official products in 2008, the results were not satisfactory. Later, it was acquired because Microsoft stepped into the search engine field. But in the process of developing semantic search engine system, it needs to use a system similar to Google BigTable, and the HBase developed has made a great contribution to the whole big data open source community.

brief introduction

HBase is the open source implementation of Google BigTable and an important member of apace Hadoop big data ecosystem. Is a distributed column storage system built on HDFS. Logically, HBase stores data according to tables, columns, and columns. It is ⼀Distributed, sparse, persistent storageThe multidimensional sort table for.

HBase and BigTable

HBase relies on HDFS for underlying data storage, and BigTable relies on Google GFS for data storage.
HBase relies on MapReduce for data calculation, and BigTable relies on Google MapReduce for data calculation.
HBase relies on zookeeper for service coordination, and BigTable relies on Google chubby for service coordination.

Examples of application scenarios

Page Gallery (360 search – Search)
Commodity Library (Taobao Search – Historical bill query)
Transaction information (Taobao data magic)
Cloud storage service (⼩⽶)
Monitoring information (opentsdb)

HBase data model

HBase data model: logical data model and physical data storage.
Logical data model: the logical data model is the user’sModel seen by database, which is directly related to HBase data modeling.
Physical data storage: physical data model is a model oriented to computer physical representation, which describesHBase is the organizational structure of data on storage media (including memory and disk)

Logical data model (HBase table structure logic)

1. Basic overview
Similar to the concepts of database and table logic in database, HBase calls them respectivelynamespaceandtable, a namespace contains a set of tables.

Two default namespaces are built into HBase:

-HBase: system built-in table, including namespace and meta table
-Default: when users create tables, all tables without namespace specified are created here.

The HBase table consists of a series of rows with one for each row of datarowkey, and a number ofcolumn familEach column family can containInfinite column

2. Noun concept
rowkey:The data in the HBase table is uniquely marked with rowkey, which is similar to the primary key in the relational database. Each row of data has a rowkey, which is the index for locating and changing row data. The same table rowkey is globally ordered. Rowkey has no data type and is stored as a byte array (byte []).

According to the query characteristics of HBase, rowkey has a great impact on the query performance of HBase, so the design of rowkey is particularly important. When designing, it is necessary to consider that the single row query based on rowkey should also be compatible with the range scan of rowkey.

column family:A column cluster consists of multiple columns. Each row of data has the same column family. The column family is part of the schema and must be specified when defining tables. Each column family contains numerous dynamic columns. Is the basic unit of access control. The data in the same column family is physically stored in a file.

column qualifier:The internal column indicates that the data in each column of HBase is located through column family: column qualifier. Column qualifier is not part of the schema and can be specified dynamically, and each row of data can have different qualifiers. Similar to rowkey, column qualifier has no data type and is stored as a byte array (byte []).

cell:Through rowkey, column family and column qualifier, a cell can be uniquely located. Multiple versions of values are stored internally. By default, the version number of each data is the write timestamp. The data in cell also has no type and is stored in array form.

timestamp:Cell internal data is multi version, and HBase uses the write timestamp as the version number by default. Users can set version numbers according to business requirements. The default is 3 versions. If no version number is specified for reading data, the latest version data will be returned. If the stored version exceeds the set maximum version number, it will be automatically cleaned up.

3. Model features

  • Strong scalability, support billions of rows, millions of columns, support hundreds of thousands of versions
  • Very sparse data can be stored
  • Point query is supported to obtain a row of data according to the primary key
  • Support scanning, quickly obtain the data of certain row interval range, and efficiently obtain the data of certain columns
  • Data type supported in HBase: byte [] (all data in the underlying layer are stored in byte array)
  • Mainly used to store structured and semi-structured loose data

Physical data storage

HBase is a column cluster storage engine, which stores data in the unit of column family. The internal data of each column family is stored in the form of key value.
In HBase, the data in the same table is arranged in ascending order of rowkey, the different columns in the same row are arranged in ascending order of column qualifier, and the data in the same cell is arranged in descending order of version number.