MySQL architecture design: how to make rational use of the third party cache solution?


The biggest advantage of using a more mature third-party solution is that it can not only save its own R & D costs, but also find more document information on the Internet. It is very helpful to help us solve some daily problems.

At present, there are two popular third-party cache solutions: object-based distributed memory cache software memcached and embedded database programming library Berkeley DB. Next, I will make an analysis and architecture discussion on these two solutions.

1、 Distributed memory cache software memcached
I believe that for many readers, memcached is not too strange. Its popularity is not much worse than mysql. Memcached is so popular because of the following reasons:

The communication protocol is simple and the API interface is clear;
Efficient cache algorithm, event processing mechanism based on libevent, excellent performance;
Object oriented features are very friendly to application developers;
All data are stored in the memory, data access is efficient;
Software open source, based on BSD open source protocol;
For the details of memcached itself, I won’t cover too much here. After all, this is not the focus of this article. Now let’s focus on how memcached can help us improve the scalability of your data service (it may not be appropriate to reuse the database itself here).

To better integrate memcached into the system architecture, we must first make an accurate positioning of memcached in the application system. Is it just a cache tool to improve the performance of data service, or is it better integrated with MySQL database to become a more efficient and ideal data service layer.

① As a cache tool to improve system performance
If we only use memcached to improve system performance, as a cache software, we need more applications to maintain the synchronous update of data in memcached and database. At this time, memcached can be understood as a more front-end cache layer than MySQL database.

If we take memcached as a data cache service of the application system, we basically don’t need to do any transformation for the MySQL database, just maintain and update the cache through the application itself. The biggest advantage of this method is that the database related architecture can not be changed at all, but there is also a drawback, that is, if there are more data objects that need cache, the amount of code that the application needs to increase will increase a lot, and the system complexity and maintenance cost will also rise sharply.

The following is an architecture diagram of using memcached as a simple cache service layer.

As can be seen from the figure, all data will be written to MySQL master, including the insert when the data is first written, as well as the update and delete of existing data. However, for existing data, you need to delete the data in memcached while updating or deleting the data in MySQL to ensure the consistency of the overall data. All read requests will be sent to memcached first. If the data is read, it will be returned directly. If the data is not read, it will be read from MySQL slaves and written to memcached for cache.

Generally speaking, this method is more suitable for the environment where there are few types of objects to cache and the amount of data to cache is relatively large. It is a fast and effective solution for performance problems. Since this architecture has little to do with MySQL database itself, there are not many technical details involved here.

② Data service layer integrated with MySQL
In addition to using memcached as a tool to quickly improve efficiency, we can also use it to improve the scalability of the data service layer, integrate it with our database as a whole, or as a buffer of the database.

Let’s first take a look at how to integrate memcached and MySQL databases as a whole to provide external services. Generally speaking, we have two ways to integrate memcached and MySQL databases as a whole to provide data services. One is to directly use the memory capacity of memcached as the secondary cache of MySQL database to improve the cache size of MySQL server. The other is to communicate with memcached through the UDF of Mysql to maintain and update the data in memcached, while the application side reads the data directly through memcached.

For the first method, it is mainly used in the scenario where the business requirements are very special, it is difficult to segment data, and it is difficult to transform the application to use the cache outside the database.

Of course, we can’t do this under normal circumstances. At present, we have to rely on external forces. The open source project waffle grid is the external force we need to rely on.

Waffle grid is an idea that several foreign DBAs came up with after work: since the low cost of PC server attracts us so much, and its scale up ability is difficult to make a big breakthrough, why not use memcached, which is very popular now, to break through the memory limit of a single PC server? Driven by this idea, several young men started the open source project of waffle grid. Taking advantage of the open source features of MySQL and memcached, and combining the simple characteristics of memcached communication protocol, memcached was successfully implemented as an external “secondary cache” of MySQL host. Currently, it only supports the buffer pool for InnoDB.

The implementation principle of waffle grid is not complicated. What it does is that when InnoDB is in the local buffer pool (let’s call it local buffer pool), before reading data from the disk data file, it attempts to read the corresponding cache data from memcached (let’s call it remote buffer) through the communication API interface of memcached, Only when there is no required data in the remote buffer, InnoDB will access the disk file to read the data. Moreover, only the data in LRU list in InnoDB buffer pool will be sent to remote buffer pool. Once the data is modified, InnoDB will move it to flush list, and waffle grid will clear the data in flush list from remote buffer pool. Therefore, it can be said that there will never be differential pages in the remote buffer pool, which also ensures that the problem of data loss will not occur when the remote buffer pool fails. The following figure shows the architecture of waffle grid project

As shown in the frame diagram, we first apply waffle grid patch on the MySQL database side to communicate with other memcached servers through other connections. In order to ensure the performance of network communication, the private network with high bandwidth is used between MySQL and memcached.

In addition, the database is no longer distinguished between master and slave in the architecture diagram here. It doesn’t mean that it can’t be distinguished. It’s just a schematic diagram. In the actual application process, most of the time, you only need to apply waffle grid on the slave. The master itself does not need such a large amount of memory.

After reading the implementation principle of waffle grid, some readers may have some doubts. Isn’t this a direct impact on the performance of all queries that need to generate physical reads? All operations of reading remote buffer need to be obtained through the network. Is its performance high enough? In this regard, I also use the author’s measured data of waffle to contact your doubts

I don’t think we need to worry too much about the performance of this set of test data obtained from dbt2. As for whether waffle grid is suitable for your application scenario, it can only be evaluated by readers themselves.

Next, let’s introduce another integration method of memcached and MYSQL, that is, through the UDF function provided by mysql, we can write corresponding programs to realize the data communication update operation between MySQL and memcached.

Unlike waffle grid, the data in memcached is not completely controlled and maintained by mysql, but is maintained by the application and MySQL together. Every time an application reads data from memcached, if it finds that it can’t find the data it needs, it reads the data from the database and writes the read data to memcached. MySQL controls the failure cleaning of data in memcached. Every time data in the database is updated or deleted, MySQL calls the memcached API through the UDF written by the user to notify memcached that some data has failed and delete the data.

Based on the above implementation principle, we can design a data service layer architecture as follows:

As shown in the figure, the biggest difference between this architecture and the above one is that the data of memcached is maintained and updated by MySQL database instead of by application. First, the data is written to the MySQL database by the application. At this time, the related UDF written by the user on MySQL will be triggered. Then, the related communication interface of memcached will be called through the UDF to write the data to memcached. When the data in MySQL is updated or deleted, the relevant UDF in MySQL will also update or delete the data in memcached. Of course, we can also make MySQL do less, just when the data is updated or deleted, we can delete the data in memcached through UDF, and the writing work is done by the application just like the previous architecture.

Because of the characteristics of object-based data access and data retrieval through hash in memcached, we need to set a key to identify all data stored in memcached, and all data access operations are carried out through this key. That is to say, if you can’t read the result set containing multiple data by one (or more) keyword condition just like the query statement in mysql, it is only applicable to the data reading method of obtaining single data by a unique key.

2、 Berkeley DB, an embedded database programming library
To tell you the truth, database programming library is a bit awkward, but I really can’t find any other suitable term to call Berkeley DB, so let’s use the more common term on the Internet.

What memcached implements is memory cache. If we don’t have such high performance requirements and budget, we can also choose database cache software like Berkeley DB. Many readers may wonder why we need to use a “database” like Berkeley DB for our MySQL database? In fact, Berkeley DB was also one of the storage engines of MySQL before, but I don’t know why it was removed from the supported storage engine by MySQL in the later period. The reason why we use the database cache like Berkeley DB at the same time is that we can give full play to their respective advantages. While using the traditional general database, we can also use Berkeley DB’s efficient key value storage method as a performance supplement for efficient data retrieval, In order to get better scalability of data service layer and higher overall performance.

Berkeley DB’s own architecture can be divided into five functional modules, which are relatively independent in the whole system, and can be set to use or disable one (or several) module, so it is more appropriate to call it five subsystems. The five subsystems and their basic introduction are as follows:

data access
Data access subsystem is mainly responsible for the most important and basic work of data storage and retrieval. Berkeley DB also supports the following four data storage methods: hash, B-tree, fixed length and dynamic length. In fact, these four ways correspond to the four actual formats of data file storage. The data storage subsystem can be used alone and must be turned on.
transaction management
The transaction management subsystem provides complete acid transaction attributes for data processing services with transaction requirements. When opening the transaction management subsystem, we need to open the most basic data access subsystem, at least the lock management subsystem and the log system to help achieve the consistency and integrity of transactions.
Lock management
Lock management system is to ensure the consistency of data and provide shared data control function. It supports row level and page level locking mechanism, and provides services for transaction management subsystem.
Shared memory
I think you should basically know what the shared memory subsystem does when you see its name. It is used to manage and maintain shared cache and buffer, and provide data cache service to improve the performance of the system.
Log system
The log system mainly serves the transaction management system. In order to ensure the consistency of transactions, Berkeley DB also adopts the strategy of writing log first and then writing data, which is generally used and shut down at the same time with the transaction management system.
Based on the characteristics of Berkeley DB, it is difficult for us to combine it with MySQL database as closely as memcached. Data maintenance and update operations mainly need to be completed through the application. Generally speaking, the main reason to use Berkeley DB as well as MySQL is to improve the performance and scalability of the system. Therefore, most of the time, we mainly use the two data storage formats of hash and B-tree, especially the hash format, which is the most widely used, because this method is also the most efficient.

In the application, each timegameFor data requests, the preset key is used to search Berkeley DB. If there is data, the obtained data is returned. If the bit retrieves the data, it is read in the database again. Then the read data is stored in Berkeley DB according to the preset key, and then returned to the client. When the data is modified, the application must delete the data in Berkeley DB after modifying the data in MySQL. Of course, you can directly modify the data in Berkeley DB if you like, but this may introduce more data consistency risks and increase the system complexity.

In principle, there is little difference between using Berkeley DB and using memcached as a pure cache. Why don’t we use memcached? In fact, there are two main reasons. One is that memcached uses pure memory to store data, while Berkeley DB can use physical disks. There is a big difference in cost between the two. Another reason is that Berkeley DB can support not only the hash storage format used by memcached, but also other storage formats, such as B-tree.

Since the basic principle of using memcached is not very different from that of memcached, we won’t draw a diagram here.