How to ensure double write consistency between cache and database?

Time:2021-2-26

Please raise your head, my princess, or the crown will fall.

Distributed cache is an essential component in many distributed applications. However, if you use distributed cache, it may involve double storage and double write of cache and database. As long as you use double write, there will be data consistency problem. So how do you solve the consistency problem?

Cache Aside Pattern

The most classic mode of cache + database read / write is cache aside pattern.
When reading, first read the cache, if there is no cache, read the database, then take out the data and put it into the cache, and return the response at the same time.

When updating, first update the database, and then delete the cache.

Why delete cache instead of update cache?

The reason is very simple. Most of the time, in the cache scenario of complex points, the cache is not only the value directly extracted from the database.

For example, a field of a table may be updated, and then its corresponding cache needs to query the data of the other two tables and perform operations to calculate the latest cache value.

In addition, the cost of updating the cache is sometimes very high. Does it mean that every time a database is modified, its corresponding cache must be updated? This may be true in some scenarios, but it is not true in the more complex cache data computing scenarios. If you frequently modify multiple tables involved in a cache, the cache is also updated frequently. But the question is, will the cache be accessed frequently?

For example, if a table field involved in a cache is modified 20 times or 100 times in one minute, then the cache is updated 20 times or 100 times. However, the cache is read only once in one minute, and there is a lot of cold data. In fact, if you just delete the cache, the cache will only be recalculated once in one minute. The overhead will be greatly reduced. Only when the cache is used can the cache be calculated.

In fact, deleting the cache, rather than updating the cache, is a lazy computing idea. Don’t do complex calculations every time, no matter whether it will be used or not, but let it recalculate when it needs to be used. Like mybatis and hibernate, they all have the idea of lazy loading. When querying a department, the Department brings a list of employees. It’s unnecessary to say that every time you query a department, you can find out the data of 1000 employees at the same time. For 80% of the cases, the only way to check this department is to access the information of this department. First check the Department, and at the same time visit the employees. At this time, only when you want to visit the employees, will you query 1000 employees in the database.

The primary cache inconsistency problem and its solution

Problem: modify the database first, and then delete the cache. If the deletion of the cache fails, it will result in new data in the database and old data in the cache, and the data will be inconsistent.
How to ensure double write consistency between cache and database?
Solution: first delete the cache, and then modify the database. If the database modification fails, the old data in the database and the empty data in the cache will not be inconsistent. Because there is no cache when reading, read the old data in the database, and then update it to the cache.

Analysis of complicated data inconsistency

If the data has changed, delete the cache first, and then modify the database, but it has not been modified at this time. A request came over to read the cache, found that the cache was empty, went to query the database, found the old data before modification, and put it in the cache. Then the data change program completes the modification of the database.

Finished, the data in the database is different from that in the cache…

Why does cache have this problem in the scenario of hundreds of millions of traffic and high concurrency?

Only when a data is read and written concurrently can this problem occur. In fact, if your concurrency is very low, especially the read concurrency is very low, with 10000 visits per day, in rare cases, there will be the inconsistent scenario just described. But the problem is, if the daily traffic is hundreds of millions, and the concurrent reading per second is tens of thousands, as long as there are data update requests per second, the above database + cache inconsistency may occur.

The solution is as follows:

When updating data, according to the unique identification of the data, the operation is routed and sent to an internal queue of the JVM. When reading data, if it is found that the data is not in the cache, it will reread the data + update the cache. After routing according to the unique identification, it will also be sent to the same internal queue of the JVM.

A queue corresponds to a worker thread, and each worker thread gets the corresponding operation serially, and then executes one by one. In this case, a data change operation, first delete the cache, and then update the database, but has not completed the update. At this time, if a read request comes and reads the empty cache, you can first send the cache update request to the queue, which will be overstocked in the queue, and then wait for the cache update to complete synchronously.

There is an optimization point here. In a queue, it is meaningless to string multiple update cache requests together. Therefore, filtering can be done. If there is already an update cache request in the queue, there is no need to put an update request operation in it. Just wait for the previous update request to complete.

After the corresponding worker thread of the queue completes the modification of the database of the previous operation, it will execute the next operation, that is, the cache update operation. At this time, it will read the latest value from the database and write it to the cache.

If the request is still in the waiting time range and the value can be obtained through continuous polling, it will be returned directly; if the waiting time of the request exceeds a certain length of time, the current old value will be read directly from the database this time.

In the high concurrency scenario, the solution should pay attention to the following problems:

1. Read request long blocking

Due to the slight asynchronization of read requests, it is necessary to pay attention to the problem of read timeout. Each read request must return within the timeout period.

The biggest risk of this solution is that the data may be updated frequently, resulting in a large backlog of update operations in the queue, and then a large number of read requests will time out, resulting in a large number of requests directly going to the database. Be sure to go through some simulated tests to see how often the data is updated.

On the other hand, because there may be a backlog of update operations for multiple data items in a queue, you need to test according to your own business situation. You may need to deploy multiple services, and each service allocates some data update operations. If an inventory modification operation of 100 items is squeezed in a memory queue, and it takes 10ms to complete every inventory modification operation, then the read request of the last item may wait for 10 * 100 = 1000ms = 1s to get the data, which will lead to long-term blocking of the read request.

According to the actual operation of the business system, we must carry out some stress tests and simulate the online environment to see how many update operations may be squeezed by the memory queue in the busiest time, and how long the read request corresponding to the last update operation will hang, if the read request is within 200ms Back, if you have a backlog of 10 update operations after calculation, even at the busiest time, and wait at most 200ms, that’s OK.

If there may be a backlog of update operations in a memory queue, you need to add machines to make the service instances deployed on each machine process less data, and the backlog of update operations in each memory queue will be less.

In fact, according to the previous project experience, generally speaking, the write frequency of data is very low. Therefore, normally, the backlog of update operations in the queue should be very few. For projects with high read concurrency and read cache architecture, generally speaking, there are very few write requests. It’s good that the QPS per second can reach several hundred.

Let’s make a rough calculation

If there are 500 write operations per second, if there are five time slices, 100 write operations every 200ms, put them into 20 memory queues, and each memory queue may have a backlog of 5 write operations. After each write operation performance test, it is generally completed in about 20ms, then the read request for each memory queue data will hang for a while at most, and it will be returned within 200ms.

After a simple calculation just now, we know that it’s OK to write QPS supported by a single machine in a few hundred years. If the QPS is expanded by 10 times, the machine will be expanded by 10 times, and each machine will have 20 queues.

2. Read request concurrency too high

We must also do a good job in stress testing to ensure that when the above situation happens to happen, there is another risk, that is, suddenly a large number of read requests will hang on the service with a delay of tens of milliseconds, to see if the service can carry it, and how many machines are needed to carry the peak value of the maximum extreme situation.

However, because not all data are updated at the same time, the cache will not fail at the same time, so each time, the cache of a small number of data may fail, and then the read requests corresponding to those data come over, and the amount of concurrency should not be particularly large.

3. Request routing for multi service instance deployment

It is possible that multiple instances of this service are deployed, so it must be ensured that requests for data update and cache update are routed to the same service instance through nginx server.

For example, all read and write requests for the same product are routed to the same machine. You can do hash routing between services according to a request parameter, or use nginx’s hash routing function, etc.

4. The routing problem of hot commodity leads to the request skew

If the read-write request of a certain product is extremely high and all of them are sent to the same queue of the same machine, it may cause excessive pressure on a certain machine. That is to say, only when the commodity data is updated will the cache be cleared, which will lead to read-write concurrency. In fact, according to the business system, if the update frequency is not too high, the impact of this problem is not particularly great, but it is true that the load of some machines will be higher.