Overview of database system design


There are only two kinds of developers in the world, one is using database system, the other is developing database system.

Data is the most important information of the system. Most systems manage data. The application system constructs the real world through the data model, and changes the state of the data model through algorithm operation object or data structure. Data is organized in operating system files, and we organize, query, search, and process data through the data system.

This paper will discuss the development of the database, the classification of the database, the common database architecture, the common concepts and technologies of the database, and understand the specific implementation of the database by interposing the implementation principles of different databases.

This paper is divided into five chapters.Exploring the origin of ancient timesFrom the birth, development, current situation and Prospect of the database, we can understand the significance of the existence of the database, as well as the historical and practical reasons of the database design.contention of a hundred schools of thoughtThis section explains some different database system implementations from different classification methods, which helps to expand our vision and can be used as a reference for technology selection (_ The selection of the underlying database system is very important to the architecture of the whole system_ )。form a connecting link between the preceding and the followingThis section is the middle chapter of the whole article. The first two chapters are based on interest points and pure theory. In this section, we will make a summary of the first two chapters. With the knowledge of the first two chapters, we can choose the database system suitable for the needs of the project. For those students who want to know more about the underlying storage, they can also choose the database type and scheme they are interested in to find the corresponding implementation, So as to enter the next step of learning. The following two chapters will explain more specific technical points.Unity of knowledge and actionThis chapter will explain the implementation of the database, analyze some database architecture, distributed problems and solutions, dialysis specific database common technical points.

For different interests, you can take them as needed, skip the ones you are not interested in and see what you want to pay attention to.

1、 Exploring the origin of ancient times

Those who doubt the present observe the past, and those who do not know what is coming will look to the future. ——Guanzi

Database management systems allow people to organize, store and retrieve data from computers. The “input” and “output” of the computer were used to punch the data. Punch card provides a fast data input and retrieval method. Database plays a very important role in the latest development of computer. The first computer programs, developed in the early 1950s, focused almost entirely on coding languages and algorithms. At that time, computers were basically large calculators, and data (names, phone numbers) were considered the remnants of processing information. When the computer began to commercialize, the importance of data began to be paid more and more attention.

Overview of database system design

Digression: through time — when I want to understand something, I always like to go back to the source and explore it from the starting point of time or from the depth of logic. The logical origin of a thing is often pure and simple, and then with the development of time and extensive expansion, it will gradually become more complex. So it is often easier to understand something from the beginning. For example, if we look at the source code of a system, we can start with the 1.0.0 version of the system, and we can start from the problems that the system initially wants to solve.

Computer databases began in the 1960s. During this decade, there were two popular data models: calledCODASYLAnd a hierarchical model called IMS.SABERThe system proved to be a commercially successful database system, which was used by IBM to help American Airlines manage its booking data.

In 1970, the great God EF Codd published an important paper《Data relation model of large scale shared database》His idea changed people’s view of database. In his model, the architecture or logical organization of the database is disconnected from the physical information storage, which becomes the standard principle of database system. After that, UBC developed Ingres and systemR at IBM. Ingres uses aQUELThe birth of language leads to inquiryIngres CorpMS SQL ServerSybasePACEandBritton-LeeSystems like that. on the other hand,System RuseSEQUELQuery language, which helpsSQL / DSDB2AllbaseOracleandNon-Stop SQLDevelopment. Relational database management system (RDBMS) has become a recognized term.

In 1976, P. Chen proposed a new database model calledEntity-Relationship, i.eER。 This model allows designers to focus on data applications rather than logical table structures. In 1980, structured query language or SQL became the standard query language.

RDBM systemIt is an effective way to store and process structured data. However, with the rapid development of the Internet, “unstructured” data (video, photos, music, etc.) has become more common. Unstructured data is both non relational data and schema free data, and relational database management system is not designed to process such data. After the 21st century,NoSqlThe emergence of NoSQL is a response to the Internet and the demand for faster speed and unstructured data processing. In general, because of the speed and flexibility of NoSQL databases, they are preferable to relational databases in some use cases.NoSQL modelIt is non relational and adopts “distributed” database system. This non relational system is very fast, uses temporary organization of data, and processes a large number of different types of data. Generally speaking, NoSQL has the following advantages over RDBMS system:

  • Higher scalability
  • Distributed computing system
  • low cost
  • Flexible architecture
  • Can handle unstructured and semi-structured data
  • There is no complex relationship

In the course of the development of database, although only half a century has passed, a number of excellent database systems have been born,SystemRPostgresqlMysqlDB2OracleMongoDBHBaseNeo4jElasticsearchAnd so on, have played an important role in the development of software.

Overview of database system design

2、 A hundred schools of thought contend

Now spring is coming. Let 100 kinds of flowers bloom. Don’t just let a few flowers bloom, and there are still a few flowers that won’t open. This is called a hundred flowers blooming. ——Mao Zedong

So far, there are numerous data systems in the industry. If you open the DB engines website, you can see hundreds of different database systems with different functions. seeDB-EnginesWe can see thatDB-EnginesSuch a large number of systems can be roughly divided into the following categories (Web sites)

Overview of database system design

Willian BlairIn database software Market:The Long-Awaited In shake up, a detailed classification of database systems is made based on the following dimensions:Relational / non relational, operational / analytical

Overview of database system design

The vertical axis in the above figure is classified asRelational Database(RDBMS) andNonrelational DatabaseThe horizontal axis is classified as operational (OLTP) and analytical (OLAP).

The classification of non relational types is a more general division, which is mainly aimed at the traditional relational types. Those inconsistent with the traditional relational system models are classified into non relational types.

Non relational (NoSQL) can be further divided into key value type, column storage type, document type, graph database, etc.

  • Document storage: mongodb, elasticsearch, Amazon documentdb, azure cosmos dB, etc.
  • Key value storage: redis labs, Oracle Berkeley DB, Amazon dynamodb, aerospike, leveldb, etc.
  • Graph database: neo4j et al.
  • temporal database : infixdb, timescale, etc.
  • WideCloumnData Stax, Cassandra, Apache HBase, BigTable, etc.

Overview of database system design

relational model

Relational modelIt is the database model that most developers contact the earliest and most. It is based on set theory and is the most classic database schema. Relational database uses two-dimensional tables of rows and columns to model data. itfitIn the scenario where the data model is known in advance, the data model is relatively fixed, and the change is small, and the query is flexible, you only need to store the data in rows and columns, and combine the data according to different needs during query. Relational typeunsuitedThere are more data levels and more records associated with each other, which often leads to the increase of query complexity and the decline of query performance.

Relational database is mainly used for most business data processing, most of which are transaction processing (such as ERP system, bank transaction, airline booking, sales system, financial management system, etc.) and batch processing scenarios (such as customer invoice, payroll, report, etc.).

Since the 1970s, the relational database has been enduring for a long time. Its simple data model and classic SQL query statements support most of the current Internet systems, such as online forums, social networks, e-commerce and so on. Behind all kinds of systems, there is a strong relational database.

Relational databases are used more than othersOracleSql ServerOutside the commercial database, it isMySQLWhat’s more, I like and admire itPostgresql, known as the most powerful open source database in the world.

The world of analysis

On line analytical processing(online analytical processing)OLAPOLAP is relative and traditionalOLTPIn terms of online transaction processing system, OLTP is the main application of traditional relational database, focusing on basic, daily interactive transaction processing, such as bank transactions. OLAP is the main application of data warehouse system, supporting complex analysis operation, focusing on analysis and decision support, and providing intuitive query results. OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives. OLAP consists of three basic analysis operations: roll up, drill down, slicing and dicing. Volume up involves aggregation of data that can be accumulated and calculated in one or more dimensions.

OLAP is conducive to large amount of data and less data update. It often uses a large amount of data for aggregation statistics. OLTP is suitable for small amount of data and frequent operation to update data.

OLAP is mainly used in business intelligence, risk control analysis, intelligent report and other business scenarios.

analysisandaffairIt’s two worlds. When the demand for analysis is small, many teams directly use the business transaction database for analysis, which can only support scenarios with small amount of data, little change in analysis requirements and weak analysis. Real data analysis scenarios often use separatedata warehouse 。 In the case of not affecting the business library, real-time or periodic batch data is extracted from the data, converted into analysis friendly data mode, and then the necessary cleaning and transformation are performed, and then loaded into the data warehouse. The process of importing data into a warehouse is calledExtract transform load(Extract-Transform-Load, ETL)。

Overview of database system design

OLTPandOLAPWithout clear boundaries, some of their typical characteristics are as follows:

user Operator, bottom management Decision maker, senior manager
function Daily operation analysis of resolution
DB design Application oriented Subject oriented
data Current, new, detailed, two-dimensional, discrete Historical, aggregated, multidimensional integrated, unified
access Read and write tens of hundreds of data Read millions of data
Reading characteristics Returns a small amount of data based on the key Based on large amount of data aggregation
Writing features Random access, low latency Batch or data stream
DB size 100MB~~GB 100GB~~TB
Time requirement Real time Time is not strict
Main applications database data warehouse

There are many excellent open source OLAP systems in the industry, such as:

  • Druid: an open source distributed system developed by metamarkets for real-time processing of big data. It has become an open source project of Apache. Official website [understanding] ()
  • Kylin:Apache Kylin ™ It is an open source and distributed analytical data warehouse, which provides SQL query interface and multi-dimensional analysis (OLAP) capability on Hadoop / spark to support large-scale data. It was originally developed by eBay and contributed to the open source community. It can query huge tables in sub seconds. Official website
  • Presto: Presto is an open source distributed SQL query engine that runs interactive analysis on Pb level data. Official website
  • Clickhouse: Clickhouse is a column storage OLAP system developed by yandex, known as “Russian Google”. Official website

Column storage

Traditional OLTP database usually adoptsRow storage。 As an example, all the columns are arranged in order to form a row, which is stored in the behavior unit. With the B + tree or SS table as the index, the corresponding row data can be quickly found through the primary key.

Overview of database system design

Row storage is suitable for OLTP scenarios. Most operations of OLTP are based on entity, that is, adding, deleting, modifying and querying each record. Therefore, it is more convenient for operation and specific optimization to put a row of data in adjacent positions physically.

In OLAP scenarios, it is rare to operate a single record alone. OLAP analysis is often aimed at a large number of data sets, on the basis of a large number of data sets, the specific columns are grouped, filtered and aggregated. Therefore, each column of data is physically placed adjacent to each other.

Overview of database system design

In this way, if you want to do analysis and aggregation for a certain column, you only need to find the file or data block location of the corresponding column. For example, to calculate the average age of the data in the above figure, you only need to obtain the data set of the age column. However, row oriented storage engines still need to load all rows from disk into memory, parse them, and filter out rows that do not meet the required criteria. It can take a long time.

Storage based on column mode naturally has the following advantages:

  • Automatic indexing

    Because it is based on column storage, each column itself is equivalent to an index. So when doing some operations that require indexing, there is no need for additional data structures to create an appropriate index for this column.

  • Good for data compression

    There are two reasons for compression. First of all, you will find that the cardinality of most column data is repetitive. Take the above data as an example. Because the same author will publish multiple blogs, the cardinality of all values in the author column must be less than the number of blogs In fact, the storage of columns does not need to store such a large amount of blog data; secondly, the same column data types are consistent, which is conducive to the optimization and compression of data structure filling, and for the data type of digital columns, more favorable algorithms can be adopted to compress storage.

In fact, the concept of columnar storage has existed for a long time. Only in response to the needs of the times, columnar storage has become popular in recent years. Many excellent columnar storage databases have emerged, and even many previous row storage systems have the capacity of column storage.

  • HBase: a distributed, column oriented open source database. This technology comes from Google paper BigTable: a distributed storage system for structured data written by Fay Chang. HBase is different from the general relational database. It is suitable for unstructured data storage. Another difference is that HBase is column based rather than row based.
  • Cassandra: it was originally developed by Facebook to improve the search performance of the e-mail system with simple format data, which integrates Google BigTable’s data model withAmazon DynamoIs fully distributed. Facebook willCassandraOpen source, since then, due toCassandraGood scalability, which is adopted by many well-known websites, has become a popular distributed structured data storage scheme.
  • Most of the OLAP databases mentioned in the previous chapter are column oriented. asDruidClickHouseEtc.

Retrieval is no longer advanced

Once upon a time, full-text retrieval is such a profound technology, although the search algorithm and technology behind the whole network search engine such as Google are still not easy to achieve. But now all kinds of apps, large and small, the technology behind the search function of the website can be easily realized by a powerful open source system. This system isElasticsearch, based onLucenceDistributed real-time full-text retrieval database.

In his London apartment, Shay banon is busy looking for a job, while his wife is studying cooking at Le Cordon Bleu cooking school. In his spare time, he began to write search engines to help his wife manage more and more rich recipes.

His first iteration was called compass. The second iteration is elasticsearch (based on Apache Lucene). He released elasticsearch to the public as an open source product and created the elasticsearch IRC channel. The rest is waiting for users to appear.

The public response was very strong. Users naturally like the software. Due to the rapid increase in usage, the software began to have its own community, which attracted people’s attention, especially Steven Schuurman, URI boness and Simon willnauer. The four of them eventually formed a search company.

A search tool developed by a programmer to help his wife manage recipes eventually became a powerful full-text search database. It seems that one of the great sources of inspiration for programmers is still object-oriented.

Overview of database system design

Part of the information in unstructured data is extracted and reorganized to make it have a certain structure, and then the data with certain structure is searched to achieve the purpose of relatively fast search. This part of information extracted from unstructured data and then reorganized is calledIndexes。 The mapping association between these indexes and documents can retrieve the corresponding document data through the index. This kind of mapping from vocabulary to document is called “mapping”Inverted index。 The process of building an index first and then searching the index is calledFull text search

When it comes to full-text retrieval, Lucene is an open source full-text retrieval engine toolkit under Apache. Provides a complete query engine and index engine, part of the text analysis engine.

Elastisearch is a distributed open source full-text retrieval database based on Lucene. It provides a distributed multi-user capability of full-text search engine, based on restful web interface. Elasticsearch is developed in Java and released as an open source under the Apache license. It is a popular enterprise search engine. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use. Behind the search function of many systems, in fact, is a powerful elastisearch service. Elasticsearch is often used for log retrieval and data analysis scenarios.

K-V cache overlord

In the whole computer system, disk and network are the slowest parts. The most important thing in a system is data, and the data in the current system is ultimately stored on the disk. Therefore, the contradiction between the slow reading and writing speed of disk and the high concurrency of the system is the main contradiction that the current system needs to solve. Will be thorough, all system optimization is to alleviate this contradiction.

In order to provide system response data and concurrency capability, one of the most common means is cache. In the computer system, the access efficiency of CPU, memory, disk and network is different orders of magnitude. To alleviate the access efficiency problems brought by this order of magnitude, the most common means is cache. There is a cache between CPU and memory, which is called CPU efficient buffer; memory and disk also have their own cache.

Overview of database system design

In distributed system, we often use distributed cache system to solve the pressure of database access.

Redis is a high-performance key value database. It supports a relatively large number of stored value types, including string (string), list (linked list), set (set), Zset (sorted set — ordered set) and hash (hash type). Redis supports cache expiration time, atomic operation, data persistence and cluster mode.

  • K-V cache: the K-V data is stored in redis to improve the efficiency of data access and reduce the pressure of database access. This is a common system optimization strategy.
  • Distributed lock: distributed lock is a global critical resource. It can achieve the function of global lock by monopolizing the critical resource. Any global shared resource can realize the function of distributed lock, even MySQL and distributed file system. Redis based distributed lock is a common implementation.
  • PubSub: the pipe function of publish and subscribe should not be the function of a distributed cache system, but redis implements this function and can work well in some simple publish and subscribe scenarios.
  • Bloom filter: the existence of key is indicated by 0 or 1 of a bit, and a set of data is represented by bit set, which is the implementation of simple bloom filter. Compared with the hash like way to store the Boolean value of key mapping, the bloom filter can save a lot of space. Redis has the implementation of the bloom filter. The bloom filter is often used to judge whether a large amount of data is true or flash, such as whether the cache exists, such as whether a large number of users have permissions.
  • Hyperloglog: hyperloglog is used to quickly calculate cardinality. Cardinality, the number of non repeating elements (similar to SQL count distinct).
  • Tools: introduce some useful java technology stack related tools. Jetcache, an annotation based caching framework of Alibaba open source. Redisson, a powerful redis java client tool.

Small and fine

Usually, most of the database systems we use are client server mode, that is, the database service runs on the server side as a resident process, and the application program accesses the database system through TCP / IP protocol. There is also an embedded database, which can run in the local computer. This database is embedded in the application program, and the data is stored in the local disk with the application program starting. This kind of database is lightweight, generally takes up less memory, and code is concise.

  • SQLite: abides by acid, implements most SQL standards and supports SQL syntax. Supports JDBC.
  • H2: a relational database written in Java that can be embedded in Java applications or run as a separate database server. Spring boot built-in database.
  • Berkeley DB: an efficient programming library for embedded database and key value database.
  • Leveldb: Google open source persistent kV stand-alone database, with high random write and sequential read / write performance. Leveldb uses LSM (log structured merge) strategy. Another Facebook rocksdb based on leveldb is also a high-performance key value embedded storage engine. Leveldb or rocksdb is often used as a storage engine. For example, influxdb is a powerful time series database. The early underlying storage engine is leveldb; rocksdb is the underlying storage engine of checkpoint of flow computing framework Flink; the famous distributed actor framework akka also uses rocksdb as the default checkpoint storage. Because of its powerful sequential read-write capability, it is often used as a wal (write ahead log) log storage engine.

These small and sophisticated embedded database, in addition to some small devices, such as mobile phone client. It is also often used as a storage engine of many self-developed database systems. These self-developed database systems take the above embedded databases as storage engines, and realize their own special functions, so as to realize a special database system, such as extending the distributed function, realizing a distributed storage system based on it, such as leveldb For example, it can store special model data, such as time series database; for example, it can realize local operation logging and retrying submission, and realize the final consistency of distributed transaction solution.

3、 Connecting the preceding with the following

In the first few chapters, we have learned about the development of database systems, different classifications of database systems from different perspectives, and many database systems with different functional scenarios. For us how to choose the database system has added a basic knowledge. How should we choose a suitable storage scheme?


  1. Selection is based on requirements. Therefore, it is necessary to define the demand scenario, and then select the appropriate storage scheme according to the demand scenario.
  2. No investigation, no voice. Scheme investigation is a process of investigation. It is necessary to understand the basic characteristics of different databases before selecting the appropriate storage scheme.

Basic scenario

It is similar to the classification of database system in the previous chapter. In fact, on the one hand, the classification of the above database system is designed based on different scenarios, so there are different implementation of the database system, so there are special optimization for different scenarios, thus gradually forming a special model for different scenarios.

routineSuch as mysql, these are the most common storage schemes used by transactional systems. They meet acid requirements and are easy to use. It supports reading and writing of tens of millions of data levels.AnalyticalIt is suitable for Bi, data report, data monitoring and other data service systems.Document typeIt is suitable for highly variable data model. When you don’t know what your data looks like in advance, document type is a good choice, and document type is also suitable for point query and redundant set query.Graph databaseGraph database is a special and new type of database. It focuses on analyzing the relationship between data rather than data value itself. It is suitable for recommendation engine, access control and geographic data scenarios.temporality Temporal database is widely used in data analysis, time series data display and monitoring. It is suitable for query, filtering, combination and aggregation analysis of a large number of time-based data.K-V type, cache and fixed view mode data display, K-V type needs to be combined and stored according to the query, so that the query can be obtained by key.

Reading and writing

  • Do transactions need to be written
  • Sequential or random reading and writing
  • Partial point query or large data set analysis query
  • Is there a big change in data structure or query structure

Data volume

The amount of data, we need to consider the amount of data, also need to consider the growth rate of the number of data, so we need to consider the magnitude of the database bearing capacity and horizontal expansion capacity.

Data usage

The storage of temporary data and important business data can adopt the scheme with different emphasis. The consistency of data requirements will also affect the selection of data storage system. The requirements of data transaction and the selection of data saving time will be different.


The reliability of data is the ability to ensure the availability of data. Reliability and cost are two sides of a trade-off. Different storage architectures are needed for data availability.


Scalability is manifested in the scalability of data usage and system itself.


  • Operability: it is convenient for the operation team to keep the system running smoothly.
  • Simplicity: simplify the complexity of the system so that new engineers can easily understand the system.
  • Evolvability: subsequent engineers can easily improve the system and adapt it to atypical scenarios according to changes in requirements, also known as extensibility, ease of modification, or plasticity.

Learning and understanding the underlying data storage can not only help us to build a good storage architecture, but also help us learn a lot of low-level technologies that are seldom touched by pure business development. Understanding and mastering the underlying technology, in turn, allows us to better understand our entire business system and make an important choice for the rational optimization of the system. It can also help us implement our own system.

The good distributed architecture of open source database system, excellent network communication, powerful memory and disk access optimization, and more classic data interfaces and algorithms are worth learning and learning.

4、 Unity of knowledge and action

Knowing is the idea of doing, and doing is the time of knowing; knowing is the beginning of doing, and doing is the accomplishment of knowing. ——Wang Yangming

This chapter will briefly explain some common technical points of database system.

system architecture


Master slave architecture can be said to be the most commonly used data storage architecture. Relational databases such as mysql, PostgreSQL, Oracle, NoSQL such as mongodb, message queue such as Kafka and rabbitmq all use this architecture.

Overview of database system design

In the whole system, the master undertakes the writing task, and the slave ensures the consistency with the master data by copying the master data. Both master and slave can undertake the reading task. Master architecture solves the problem of high availability of data (slave stores data copies), and also extends the ability of data read concurrency (multiple slaves simultaneously through read requests).

In the master slave architecture, if a single master fails, the database system will not be available. In this case, the master master architecture can be adopted. There are multiple master nodes in the system at the same time. However, multiple mater nodes do not provide write services at the same time. At the same time, there will only be one writable master, and the other master will exist as a standby machine When the master is not available, it is called the master node, which provides the write service. As a standby machine, the master can provide the read service. This architecture only solves the problem of high availability of a single master node, and does not solve the problem of excessive load of a single master. The reason why there is only one master to provide write service is to ensure the consistency of write data.

Data consistency

The storage of the same data on different data nodes is called replica. As long as there are multiple copies of data in the system, there will be data consistency problems. How to ensure the consistency of multiple copies of data has always been the biggest challenge for distributed systems. Multi node data synchronization generally adopts the mode of replication. The data of master node is copied from the slave node, and multiple nodes are copied to each other. However, no matter which way is adopted, inconsistency cannot be avoided.

Data consistency can be divided intoFinal consistencyandStrong consistency。 The strong consistency model can allow your single service application to be migrated to the cluster of distributed nodes without any errors. Strong consistency is often achieved by sacrificing system availability. When writing data, if the consistency of multiple copies cannot be guaranteed, it will fail. In the final consistency model, after stopping changing the values for an uncertain period of time, all the replica sets will eventually remain consistent. This shows that before this time, the data copies are inconsistent in some cases, but the data will eventually reach consistency, which means “convergence”, that is, all replicas are expected to eventually converge to the same value.

In the process of data convergence, there are many problems to be solved in order to ensure the consistency of the final data. For example, the timing problem between systems, atomic submission problem, consensus problem.

Cap theory

Theorem:It is impossible for a distributed system to meet the three basic requirements of consistency, availability and partition tolerance at the same time.

  • Consistency consistency: all nodes see the same data at the same time
  • Availability availability: node failure does not prevent work that affects the running node
  • Partition tolerance: the system can continue to run (through replication) even if there is information loss or network or node failure

Overview of database system design

By combining these three properties, we can get the following three situations:

  • CA: completely strict arbitration agreementFor example, 2pc
  • CP: incomplete (majority) arbitration agreementSuch as Paxos, raft
  • AP: protocol using conflict resolutionFor example, dynamo, gossip

Both Ca and CP systems follow the strong consistency theory. The difference is that CA system can not tolerate node failure. CP system can tolerate the failure of F nodes in 2F + 1 nodes.


Overview of database system design

As mentioned above, replicas can only guarantee the availability of data. In order to improve the reading and writing ability of a large number of data sets, we can split the data into different partitions and process them separatelySectioning

Fragmentation means that the data set is divided into independent small data sets to reduce the pressure on a single node caused by the growth of data sets. Data fragmentation has the following benefits:

  • Improve performance: limit the size of data in the partition and reduce the data pressure
  • Improve availability: the data is independent of each other, and the failure of different partitions does not affect each other, allowing the existence of failed nodes

Zoning will naturally bring some problems. The first thing to consider isHow to partition

  • Based on keyword interval: divide the data into different intervals according to keywords, and write the data of the same interval to the same node. For example, if the user data ID is distributed between [1-1000000], the data needs to be distributed to 10 nodes, and the data can be divided into ten intervals:
  • Keyword hash partition:The partition number is calculated by hash algorithm, and the data is written into the partition with corresponding partition number.

Data partitioning bringsLoad tiltandhotspotProblem: due to the uncertainty of data, the partitioned storage calculated by data keywords may be concentrated in a few intervals, which may lead to the data of some nodes obviously redundant to other nodes. This kind of data concentration in a node is a data hotspot. Due to the emergence of data hotspots, the load of the whole system will tilt to these nodes, resulting in load imbalance between partitions, which is the problem of load skew.

Decentralization: dynamo

Dynamo is a distributed storage for Amazon. Amazon has published a paper dynamo: Amazon’s highly available key value store explains dynamo architecture, which makes dynamo a reference architecture for many data storage systems.

Dynamo is implemented based on some well-known technologiesScalabilityandHigh availability

  • Data passConsistent Hashing Partitioned and replicated
  • adoptObject versioning(object versioning) to achieve consistency
  • Consistency between replicas is determined by aThe technology of arbitration(quorum like technique) and a decentralized oneReplica synchronization protocol(replica synchroni protocol)
  • Distributed fault detection and membership protocol management node based on gossip protocol

Dynamo is aCompletely decentralizedSystem.

Overview of database system design

Adding or removing storage nodes to dynamo does not require manual partition or redistribution to rebalance data distribution between nodes

Dynamo adopts the final consistency scheme.

The architecture of a production level storage system is complex. In addition to the final data storage components, the system should also develop scalable and robust solutions for the following aspects: load balancing, membership, fault detection, fault recovery, replica synchronization, overload handling, state transition, concurrency and task scheduling, and request Marshalling, request routing, system monitoring and alarm, and configuration management.

The following table summarizes the technologies used by dynamo and the benefits of each.

Overview of database system design

  • Technology:Consistent Hashing
  • Benefits: incremental scalability
Write high available
  • Technology: read time coordination (conflict resolution)Vector clock(vector clocks with reconciliation during reads)
  • Benefits: version size is decoupled from update rates
Short time fault handling
  • Technology: loose election and hint handoff
  • Benefits: high availability and persistence can still be provided when some replicas are not available
Persistent recovery
  • Technology: Based onMerkle treeAnti entropy
  • Benefit: background synchronizes copies with inconsistent versions
Member management and fault detection
  • Technology: Based onGossipMember management protocol and fault detection based on
  • Benefits: it keeps the symmetry of the architecture and does not need a centralized registry to store information such as member and node status

The distributed database Cassandra is a typical implementation of dynamo.

Main architecture: BigTable

BigTable is an open source database system of Google. BigTable is typicalThere is a main structure

BigTable is mainly composed of three components:

  1. A client library, linked to each client
  2. A master server
  3. Multiple tablet servers

The master is responsible for:

  1. Assign tablet to tablet server
  2. Detect expiration and addition events of tablet server
  3. Balancing the load of tablet server
  4. Garbage collection (GC)
  5. Handle schema changes, such as the creation of tables and column families

The master of BigTable is only responsible for metadata management. The table server loads the read-write function of the table managed by itself. The client only wants the master to synchronize the metadata, and the data directly communicates with the table server without passing through the master node. As a result, the load on the master node in BigTable is very low.

In a master architecture, the capacity of the master will also be inconsistent. For example, in the architecture below, the master only undertakes the coordinate function, manages metadata and node nodes, and the client obtains Mata data and communicates directly with the corresponding data nodes.

Overview of database system design

In the following architecture, the client does not directly communicate with the data node, but communicates with the master. The master has more relevant metadata and forwards the request to the corresponding data node:

Overview of database system design

Coordinate-WorkerArchitecture is the architecture adopted by many distributed databases. Interested students can take a look at the Druid architecture design explained by the author earlier


The index of database system is used to improve the efficiency of data retrieval. The data records of database system are stored on disk. If there is no index, it is necessary to scan all the data segments to retrieve the corresponding records from the disk. This kind of O (n) access efficiency and full disk scanning can not be used in the real database system. In order to improve the ability of data retrieval, the database system introduces the index technology to make an index structure for the data records on the disk. These indexes are placed in the memory or stored on the disk by blocks (but only a few disk reads are needed to read them into the memory). In this way, the corresponding key is found from the memory index to retrieve a data Or disk location, and then read the record from the disk.

Here the index does two things:

  • Turn a lot of disk retrieval into memory retrieval
  • The efficiency of memory retrieval can be improved by specific data structure, and the low efficiency of O (n) retrieval can be changed

Hash index

Overview of database system design

Hash is hash table, which is similar to Java HashMap data structure and key value format. Suppose we maintain a HashMap index in memory, where key is the key of data and value is the storage offset of data on disk.

  • When getting data, first get the disk offset of the corresponding data from the memory map, and then read the data of the disk.
  • When writing data, first append the data to the disk, and then update the memory HashMap index.

Hash index sounds too simple, but it is a feasible index method. Hash index is simple and efficient, query performance is O (1), and update is also efficient. At that time, there are obvious shortcomings, such as:

  • The whole hash table needs to be put into memory, which will be unbearable for large amount of data.
  • Only precise queries can be made.
  • Range query cannot be implemented.

B-tree index

B-treesIndex was first introduced in 1970 and has been tested for a long time. Today, it is still the standard index implementation in almost all relational databases, and many non relational databases are often used.

understandB-treesThe index starts with a binary search tree. Binary search tree is a special binary tree, which satisfies the following conditions:

  • The left subtree is smaller than the parent node
  • The right subtree is larger than the parent node

Overview of database system design

The figure above is a search binary tree. If I want to find the key 208:

  • Start with the root node, which is 136. Comparing 208 > 136, the next step is to find the right subtree of the root node
  • 398 > 208, continue to search the left subtree of 398
  • 250 > 208, continue to search the left subtree of 250
  • 200 < 208, continue searching for the right subtree of 200.
  • The right subtree of 200 does not exist, so there is no 208 in the data, and the search is finished

Let’s look again at 40:

  • From the root node 136, 136 > 40, continue to search the left subtree
  • 80 > 40, continue to search the left subtree
  • 40 = 40, the node exists, and the data ID is obtained from the node, and then the corresponding data can be found by more data ID

In an index structure, eachNodeContains a key value, a data pointer (or data ID, disk offset, etc.)

The time complexity of binary search tree islog(N)This is a good result.

The binary search tree can still only get specific values. If I need to search the range, that is, to find all the data between two numbers, I need to traverse every node in the tree to determine whether the node is in the range. In this case, the time complexity drops toO(N)。 Therefore, we need to improve the above data structure, most modern databases have an improved binary search tree – B + tree.

Overview of database system design

B + tree adds the following features to the binary search tree:

  • Store index information (information about associated table data) only in leaf nodes
  • The remaining nodes are only used to find the final leaf node (the leaf node contains all the keys)

In B + tree, there are two nodes for each key, and all intermediate nodes are only used to assist in retrieving the final correct leaf node (only the leaf node contains the information of the associated data).

Let’s try to find the nodes between [40, 100] from the B + tree above:

  • In the same way as the binary search tree, we only need to search for the node 40 (or find the node closest to 40 when the node of 40 does not exist)
  • Then trace down the leaf node list to know that it is more than 100

Assuming that there are n nodes in the tree and M leaf nodes are traced back, the time complexity of this search is as follows:log(N) + M。 Relative to the previousO(N)The binary search tree of has the following advantages:

  • There is no need to read the whole tree, which can reduce the number of times to read the disk (index data is generally stored on the disk by page)
  • In most cases, m (approximately equal to the retrieval range) will be far less than the entire data volume n, so theO(M)The time complexity is much less than that in most casesO(N)

Everything is two-sided.The retrieval advantage brought by B + tree index will inevitably have other losses. This is mainly reflected in the deletion of data. Because the leaf node is similar to the linked list structure, deleting a node needs to start traversing from the header, and the time complexity is O (n).

B + tree index has better retrieval performance. In order to reduce the number of disk access, most index systems have only 3-4 layers of B + tree index, so the number of nodes that B + tree index can carry is limited. B + tree is required to update nodesSelf sortingandSelf balanceThis requires additional performance consumption. The insertion and deletion time complexity of B + tree isO(log(N))。 This is why when using a database, it is not recommended to add indexes to all index fields. Instead, we should consider the specific situation and add indexes to the required fields. Otherwise, too many indexes will affect the tableinsert\update\deleteOperational performance.


B + tree is a page based index engine. The data storage of B + tree is out of order. The idea of building index is to maintain the corresponding relationship between a key and the location of data disk in memory, and to ensure that the memory data structure is orderly. There is a file based storage engine, which divides the data into file segments and ensures the order of data in disk file segments. Therefore, this storage engine does not need to maintain the order table of all data in memory, but only needs to maintain a sparse index structure in memory. The data searched from the memory index is not specific to each data, but a document Then read these ordered data into memory, and then get the specific data in order. (how to ensure the order of writing data

LSM (log structured merge tree) is such an index structure. The architecture of LSM is as follows:

Overview of database system design

SSTable:LSM disk file, calledSSTable(Sorted String Table)。 WangWen is proud that the data of the files stored in the disk by LSM are also sorted by key. In this way, the problem of unable to index all the data into memory after the large amount of data mentioned above can be solved. If the disk files are also ordered, the memory index can be “sparse index” ()_ Sparse Index_) You can record an index for each segment, and logically divide the data into multipleblockThe sparse index only needs to record eachblockThe offset of each data through traversalblockrealization. In this way, the number of indexes will be greatly reduced.

Memtable:The memory structure of LSM is calledMemtableMemtableIt is an ordered structure, which can also be used in tree structureSkip Watch。 When LSM writes data, it only needs to write theMemtable, whenMemtableAfter reaching a certain amount, it will be asynchronously flushed into the disk, which is the aboveSSTable

Immutable Memtable:In data from memoryMemtableBrush inSSTableTo avoid performance problems caused by read-write locks, LSM copies a copy in memoryimmutable MemtableTable, as the name suggests, this data structure cannot be changed, and only new data will be writtenMemtableimmutable MemtableIt can be read by disk brushing thread, and the data query request can also access this data structure. In this way, if the data is in memory, there is no need to access the disk, and the efficiency of data query can be provided.

WAL:For wal, please refer to my previous article “what is wal you often hear about?”. In LSM, before data is flushed to the disk, in order to prevent data loss caused by exceptions, LSM will first write the data to wal and then to sstable. When the system restarts, LSM will trace sstable from Wal. When writing an sstable, LSM will clean up the expired wal logs to prevent excessive wal.

How to write data to LSM:

  1. write inWAL
  2. write inMemtable
  3. MemtableWhen the threshold is reached, copyImutable Memtable
  4. Asynchronous flush to disk

How to delete data by LSM:In order to ensure sequential disk writing, LSM does not delete data directly. Instead, it writes a delete identifier to indicate that the data is deleted. Only when the data is compact can the data be deleted.

How to read data by LSM:LSM reads data frommemtableimutablesstableRead in turn until the data is read or the data structure of all levels returns no data. Therefore, when the data does not exist, you need to read each layer file in turn. LSM can be introduced throughBloon filterTo determine whether a data exists to avoid invalid file scanning.

Dense index and sparse index: a dense index corresponds to an index record for each data. Sparse index generally only indexes data blocks or files, which is skip type. Therefore, sparse indexes save more space than dense indexes.


The effect of data compression on I / O performance of database system is quite obviousReduce disk space usageReduce bandwidth usageandIncrease throughput。 In database systemdata storage Index storagedata conversionData backupandNetwork CommunicationsWill use the corresponding compression technology. When the database compression is introduced into the real-time database. Compression algorithm must provide high compression ratio to achieve high data storage, compression algorithm must be fast enough to achieve real-time recording and query function in real-time database.

The compression process generally consists of two independent parts,modelingandcode。 Modeling defines the characteristics of different symbols in the input stream. The model stores information about the frequency of symbols appearing in the data, that is, symbol probability. Coding is the second part of the compression process. It creates a set of codes for different symbols according to the probability provided by the model, thus producing a compressed version of the data. Exchange more frequent symbols with shorter code words and longer rare symbols. The uniformity of data can affect the compression ratio of most compression algorithms, but has no effect on the compression speed. Therefore, in order to achieve better compression performance, the compression algorithm is specially designed for each part of the data, so the compression effect of different compression algorithms for different types, different orders of magnitude and different combinations of data is inconsistent. Therefore, most of the database systems that support data compression will provide a variety of different compression algorithms for users to choose freely according to their own data situation.

Compression algorithms can be divided into the following two categories:

  • Lossy compression: lossy compression reconstructs the original data. Therefore, the compressed data read is incomplete. This compression method is usually used in audio, video and other stream file compression.
  • lossless compression: lossless compression does not affect the original value of the compressed data. It is usually used in the compression of text, numbers and other data.

What should be considered in compression

  • size: the size of the compressed file, that is, the compression ratio. When using compression, this is to reduce the data size, so the compression ratio of compression algorithm is the primary consideration.
  • speed: compression speed will affect the efficiency of data reading and writing, which is particularly important for real-time systems. Speed and size are two sides of trade-off, so we must fully consider specific scenarios.
  • resources:Compression saves disk and broadband, but increases CPU and memory usage during compression. Therefore, the resource consumption during compression also needs to be considered.

Some common compression algorithms or methods (gzip, bzip2, LZMA, XZ, lz4, LZO) are listed below and compared

Test conditions:

  • Intel Core i5 CPU 750 at 2.67GHz
  • 8GB of DDR3 memory
  • tmpfs as ram disk
  • Linux kernel 3.3.2, gentoo amd64
  • CFLAGS: -pipe -O2 -g -floop-block -floop-interchange -fgraphite
  • bzip2-1.0.6-r3, xz-utils-5.0.3, gzip-1.4

Comparison results of file compression (original data: 445M)

Overview of database system design

Compression ratio comparison:

Overview of database system design

Compression time comparison:Overview of database system design

Each major database system will more or less use compression technology to reduce data storage space and improve system performance. The following are some compression technologies used in database systems:

  • Google uses in BigTable and MapReduceSnappyCompressed data and network transmission.
  • SQL server usageXPRESSThe algorithm compresses the backup data.
  • Oracle uses self implementedOracle Advanced CompressionThe algorithm compresses the data.
  • Using MySQLLZ77The algorithm compresses InnoDB tables.
  • Kafka supports gzip, snappy and lz4 algorithms, and optimizes the default lz4.
  • Druid uses lz4 to compress data.

Numerical compression: delta of delta

Numerical compression is often used to compress the digital columns of column storage. As we mentioned earlier, columnar storage stores the data of each column in adjacent locations. This storage structure is conducive to compressing data. Let’s talk about the delta value compression technology used in many column storage.

As shown in the figure, assume that there are six original values (73, 300, 302, 332, 343, 372). Before uncompressed, each value takes 4 bytes, and 6 * 4 = 24 occupies a total of 24 bytes. Delta compression algorithm does not store the original value, but first determines a number (usually the first value), and the latter value is the difference between the first value and the latter value. As shown in the second line of the figure, the data sets obtained are (73, 227, 3, 30, 11, 29). Because the maximum difference is 227, only one byte is needed to represent it. Therefore, each value that used to be stored in 4 bytes previously needs only 1 byte. In order to save the corresponding differential correlation element description information, an additional 1 byte is needed to save the information. The above figure also stores the data in blocks, so the final number of bytes required is 7. This saves nearly three times the original 24 bytes.

In fact, the above figure is the principle of using Lucence in elasticsearch.

Delta of delta is suitable for the compression of numerical data, and it is effective for data compression with large amount of data and data set. If the data set is small and sparse, and the maximum difference between the data value and the maximum value that the data value can represent, then the meaning of compression will exist.

Reading and writing

Data storage system is a system dealing with disk and network, so the optimization of data storage system in this respect can be described as excellence, such asAsynchronous IOBuffer batch read and writeAppend write dataRead and write data by disk pagePre reading dataandDisk memory mapping technologywait.


Asynchronous IO corresponds to synchronous IO, that is, each time an IO operation is performed, it is necessary to wait for the end of the operation to continue the next operation. In this way, in the case of a large number of concurrent requests, the efficiency of IO will be greatly reduced. When the disk IO and network IO adopt asynchronous IO, the efficiency can be obviously improved.

MySQL’s InnoDB also uses AIO to improve efficiency. Before innodb1.1. X, the implementation of AIO was simulated by the code in the InnoDB storage engine. Starting from innodb1.1. X, it provides kernel level AIO support, called native AIO. In InnoDB storage engine, the read ahead mode is all completed by AIO, and the refresh of dirty page, that is, the write operation of disk is all completed by AIO.

In Kafka, the data disk landing of broker is handled in the way of Java NiO, which is the implementation of Java asynchronous io. Java NiO can provide concurrent performance of data writing.


Buffer technology is used to coordinate data transmission between devices with different throughput rates.

Overview of database system design

In the data arrival and departure speed does not match, should use the buffer technology. Buffer technology is like a reservoir. If there is too much water in the upstream and it is too late for the downstream to drain away, the reservoir will play a “buffer” role. First, let the water stop in the reservoir for some time, and then send the water to the downstream after the downstream can continue to drain.

The combination of buffering and batch sending can improve the write rate of data in network and disk. When data is written to the network or disk, a buffer pool should be set first. When the data reaches a certain amount or the buffer time exceeds, the data will be sent in batches, which can reduce the concurrency of requests, and also reduce the bandwidth and disk consumption caused by requesting additional data.

In mysql, InnoDB uses buffers in multiple places to improve write performance. For example, insert buffering combines multiple insert requests into one operation, which can change the previous non sequential writes into relative sequential writes, so as to improve the write efficiency. On the other hand, data can also be written according to the physical page of the disk, which makes full use of the disk’s writing characteristics.

In elastisearch and Kafka’s clients, the function of buffering batch writes is adopted to reduce the write concurrency.


In the disk read-write optimization, we can often see the following technologies:

  • Read and write data by disk page: the unit of disk read and write ispage。 In order to reduce the frequency of disk access when reading and writing data, database systems usually read and write data by page.
  • Read ahead data: some database systems think that the user has accessed a part of the data, then the data placed adjacent to it is likely to be accessed next time, so it will read data from multiple pages in advance.
  • Disk memory mapping (MMP): the process of disk sector mapping to the virtual memory space of a process. When reading and writing data, MMP crosses the page cache, reduces the number of data copies, realizes the efficient interaction between user space and kernel space, and relieves the pressure of system memory shortage.

In this paper, we will give a brief introduction to various technologies. In fact, each technology point can be explained in depth. If you are interested, please continue to pay attention to our later articles.

reference resources:

《Designing Data-Intensive Applications》
《Database Software Market:The Long-Awaited Shake-up》
《Distributed systems for fun and profit》
《How does a relational database work》
Seven weeks and seven numbers database
Inside of MySQL technology InnoDB storage engine
Database system concept
《Dynamo: Amazon’s Highly Available Key-value Store》

I hope that readers can “like”, “share”, “watch” three times is the greatest encouragement.

Backstage reply “add group” enter exclusive technology group to grow together

Overview of database system design