Distributed architecture is imperative
In traditional database technology, in order to ensure data security and high performance, high-end external storage is usually selected as the main storage source of database, while local disk is regarded as an unreliable device with low performance. This concept is mainly due to the bottleneck of local disk production technology in the past, and its stability and performance are far behind the high-end storage.
The rapid development of Internet application
In the past, the development of IT industry was far less vigorous than it is now. Most of us use desktop devices at home or in the office to query information and browse all kinds of current news in the world. In this era, Internet enterprises and traditional enterprises such as banks and communication companies, which are very dependent on IT departments, a small computer and an excellent relational database can handle the data between multiple systems within the enterprise.
But in the last 10 years, or even shorter time, everyone has one or more smart mobile terminals in their hands. They can not only make phone calls and send messages, but also surf the Internet, play games, shop and chat like laptops. Various intelligent applications and business scenarios have also been discovered by developers, and these novel applications are also being accepted by most consumers.
The cost of hardware is greatly reduced
A small computer + an excellent relational database has been increasingly unable to meet the IT department’s processing needs for the rapid expansion of data. An important reason is that the traditional database technology can not easily and quickly allocate massive data to multiple servers for calculation, and can only improve the system performance through vertical expansion of hardware.
After 10 years of development, the disk technology that was not recognized by enterprises has made great progress. From SATA disk to SAS disk, from mechanical disk to SSD disk, the stability and read-write performance of local disk have been greatly improved. They are also gradually used by enterprises in important production environments.
Distributed big data technology innovation
Along with the development of disk technology, there are also various big data technologies. First there are three famous papers by Google, and then those with various technical characteristicsNoSQL / newsql database。 Their technical implementation is based on the cheap x86 server and the disk as the main hardware server.
And all kinds ofNoSQL / newsql databaseIn order to solve the technical problems of the rapid expansion of enterprise data volume and customers’ higher and higher requirements for the corresponding time of IT system, the technical characteristics of distributed data storage are included.
Introduction of database distributed principle
Hash distribution data
The principle of data distribution in database hash mode is that when users create a set, the segmentation mode of this set is specified as hash, and the shardingkey is explicitly specified as which field in the record.
When an app sends a request to write a record to the database, it will first send the record and request to the coord node of the database. The coord node will determine which data partition group the record should be distributed to according to the splitting method of this set, such as shardingkey = ID, shardingtype = hash, and the hash value of shardingkey in the record. Once the data partition group receives the write record request and the written data information, the data partition group will call the corresponding method in the database to persist the record to the disk, and update the index data of the corresponding collection in the data partition group.
Distribution data in range mode
The principle of data distribution in database range mode is that users specify the partition mode of this set as range when creating a set, and explicitly specify which field shardingkey is in the record.
When an app sends a request to write records to the database, it will first send the record and request to the coord node of the database. The coord node will split the collection according to the shardingkey = ID, Shardingtype = range, and the value of shardingkey in the record to determine which range the record belongs to, and then send the record to the corresponding data partition group. Once the data partition group receives the write record request and the written data information, the data partition group will call the corresponding method in the database to persist the record to the disk, and update the index data of the corresponding collection in the data partition group.
Partition method to distribute data
staySequoiadb databaseCompared with the simple data segmentation function of other NoSQL databases, it uses the “primary and sub table” function. This function is similar to the partition function of some relational databases. It creates a logical general view in the database, and then mounts multiple partitions to the general view through the scope of a certain field.
Through the “master sub table” distributed mode, users can better and more detailed data segmentation work according to their needs. At the same time, similar to the time sequence or hot and cold data can be naturally segmented, which is more conducive to the full use of hardware resources.
In sequoiadb, the main collection corresponds to the general view of the relational database, and the sub collection corresponds to the partition of the relational database. andSequoiadb databaseIn particular, the sub collection is actually a common collection, while the main collection only exists in the catalog node of the database and does not write any data in any data partition group.
Users can basically understand that the main collection is a logical view, which only keeps some data range information in the catalog node, while sub collection is a common set established in the database. It is only called sub collection when it is used together with main collection.
In general, a user creates a collection that is randomly assigned to a data partition group. If users want to use the function of primary and sub table to realize the distributed storage of data, it still needs some small skills. When creating a collection, users can explicitly specify which data partition group the collection is assigned to, so as to avoid sub collection from clustering on a data partition group.
The last step of the user is to partition each sub collection and attach each sub collection to the main collection.
In sequoiadb database, the most complex and efficient method is data multi-dimensional partition. Multidimensional partition, as the name implies, is to partition a set with more than one partition method acting on a set at the same time.
As shown in Figure 4, multidimensional partitioning is actually a way of combining primary and sub tables with hash partitions.
At present, the multidimensional partition of sequoiadb can provide partition of two fields for a collection. Generally, users can segment the time field in the primary sub table, and then hash the ID field of the sub collection.
The main advantage of the multi-dimensional partition method is that it can balance a massive data set with smaller granularity to improve the performance of the database.
When using the multi-dimensional partition function, users can also combine the domain function of the database to better handle the data partition task.
Domain refers to a logical domain in which the database integrates multiple data partition groups. Users can directly specify which domain the collection space belongs to when creating the collection space, and then the collection established based on this collection space will automatically follow the hash of shardingkey as long as it is shardingtype = hash Value is distributed to data partition groups that belong to this domain.
*Distributed architecture is the trend of big data and even all Internet applications in the future. For the storage and management of massive data, distributed architecture naturally has core advantages such as high performance and high availability.
Sequoiadb provides a variety of data distribution / partitioning functions, which not only provides a variety of choices in storage architecture, but also meets the needs of more big data application scenarios. At the same time, through our data partition functions such as “primary sub table” and “multi-dimensional partition”, the system performance is greatly improved while meeting the business requirements.*