Flow chart + source code in-depth analysis: the principle of cache penetration and breakdown problems and landing solutions

Time:2021-1-24

1 overview of the article

Cache system is an important system in the Internet scene. In order to prevent traffic from accessing the database frequently, a cache layer is usually set in front of the database layer as protection.

Cache is a broad concept, the core of which is to store the data closer to the user, or to store the data in the media with faster access.

Cache can be divided into memory cache and remote cache. Common tools for memory caching, such as guava and ecache, and common systems for remote caching, such as redis and Memcache. This paper takes remote cache redis as an example to explain.

Cache penetration and breakdown are problems that must be faced in high concurrency scenarios. These problems will cause access requests to bypass the cache and directly hit the database, which may cause the database to hang up or the system to avalanche. This paper analyzes the principles and solutions of these problems according to the outline in the following figure.

Flow chart + source code in-depth analysis: the principle of cache penetration and breakdown problems and landing solutions


2 distinction between buffer penetration and breakdown

In terms of the final result of cache penetration and breakdown, traffic bypasses the cache and hits the database, which may cause the database to hang up or the system to avalanche. However, there are still some differences after careful differentiation. Let’s analyze a general flow chart of business read cache.

Flow chart + source code in-depth analysis: the principle of cache penetration and breakdown problems and landing solutions

Let’s briefly describe this picture in words

(1) When querying business data, query the cache first. If there is data in the cache, return it. The process ends

(2) If there is no data in the cache, query the database. If there is no data in the database, return null data, and the process ends

(3) If there is data in the database, the data is written to the cache and returned to the business, and the process ends

Assuming that the business side wants to query a data, cache penetration means that there is no a data in the database at all, so there is no data to write to the cache at all, which makes the cache layer meaningless and a large number of requests will frequently access the database.

Cache breakdown refers to the request before querying the database, first check the cache to see if it exists, which is no problem. But the amount of concurrency is too large, resulting in the first request has not had time to write the data to the cache, and a large number of subsequent requests have begun to access the cache. This is because the data does not exist in the cache, so a large number of requests will hit the database instantaneously.


3 CAS examples and source code analysis

Now let’s put the caching problem aside and analyze the source code of CAS. Later, we need to learn from this idea when we write caching tools.

3.1 an interview question

Let’s take a look at a common interview question. I believe this interview question is familiar to you: analyze the output value of the following code.

class Data {

Run the result num value is generally not equal to 100, but less than 100, this is because num + + is not atomic, we write a simple code to prove.

public class VolatileTest2 {
    volatile int num = 0;

    public void increase() {
        num++;
    }
}

Execute the following command to get the bytecode:

javac VolatileTest2.java
javap -c VolatileTest2.class

The bytecode file is as follows:

$ javap -c VolatileTest2.class
Compiled from "VolatileTest2.java"
public class com.java.front.test.VolatileTest2 {
  volatile int num;

  public com.java.front.test.VolatileTest2();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: aload_0
       5: iconst_0
       6: putfield      #2                  // Field num:I
       9: return

  public void increase();
    Code:
       0: aload_0
       1: dup
       2: getfield      #2                  // Field num:I
       5: iconst_1
       6: iadd
       7: putfield      #2                  // Field num:I
      10: return
}

We observe num + + code fragment and find that there are three steps

(1) getfield 
(2) iadd  
(3) putfield

Getfield reads the num value, Iadd calculates num + 1, and finally putfield assigns the new value to num. This is not difficult to understand why num will eventually be less than 100: because thread a has no time to assign a new value to num after the second step and before the third step, the data is taken away by thread B, and there is still no old value of 1.


3.2 CAS case analysis

So how to solve the above problems? There are two common schemes: Lock scheme and no lock scheme.

The locking scheme is to add synchronous keywords to increase, so that only one thread operation can be guaranteed at the same time. This is not the focus of this article, and we will not expand it in detail.

The lock free scheme can use the atomicinteger provided by JUC for operation. Let’s take a look at the improved code.

import java.util.concurrent.atomic.AtomicInteger;

class Data {
    volatile AtomicInteger num = new AtomicInteger(0);
    public void increase() {
        num.incrementAndGet();
    }
}

public class VolatileTest {
    public static void main(String[] args) {
        Data data = new Data();
        for (int i = 1; i <= 100; i++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    try {
                        Thread.sleep(1000L);
                        data.increase();
                    } catch (Exception ex) {
                        System.out.println(ex.getMessage());
                    }
                }
            }).start();
        }
        while (Thread.activeCount() > 2) {
            Thread.yield();
        }
        System.out.println(data.num);
    }
}

After rewriting in this way, the result is equal to 100 as expected, and we do not lock it. So why can we use atomicinteger to achieve the expected effect?


3.3 CAS source code analysis

In this chapter, we use incrementandget method as the entry to analyze the source code.

class Data {
    volatile AtomicInteger num = new AtomicInteger(0);
    public void increase() {
        num.incrementAndGet();
    }
}

Enter the incrementandget method:

import sun.misc.Unsafe;

public class AtomicInteger extends Number implements java.io.Serializable {
    private static final Unsafe unsafe = Unsafe.getUnsafe();
    private static final long valueOffset;

    public final int incrementAndGet() {
        return unsafe.getAndAddInt(this, valueOffset, 1) + 1;
    }
}

We can see a class called unsafe. This class is not common, so what’s the use?

Unsafe is located in sun.misc A class under the package, which has the ability to operate the underlying resources. For example, it can directly access the operating system, operate specific memory data, and provide many CPU primitive level APIs.

Continue to analyze the source code and follow up the getandaddint method:

public final int getAndAddInt(Object o, long offset, int delta) {
    int v;
    do {
        v = getIntVolatile(o, offset);
    } while (!compareAndSwapInt(o, offset, v, v + delta));
    return v;
}

We describe the parameters: O represents the object to be modified, offset represents the offset of the field to be modified in memory, and delta represents the increment of this modification.

The core of the whole method is a do while loop code. The getintvolatile method is easy to understand, that is, to get a field value whose o offset of the object is offset.

We focus on the comparative and swapin method in while

public final native boolean compareAndSwapInt(
    Object o,
    long offset,
    int expected,
    int x);

Where O and offset have the same meaning, expected represents the expected value, and X represents the updated value, which leads to three values of CAS core operation: memory location value, expected original value, and new value.

When performing CAS operation, the memory location value will be compared with the expected original value. If it matches, the processor will automatically update the location value to a new value, otherwise the processor will not do anything.

The CAS method provided by unsafe is a CPU atomic instruction, and the underlying implementation is the CPU instruction cmpxchg, which will not cause data inconsistency.

Let’s go back to this Code:

public final int getAndAddInt(Object o, long offset, int delta) {
    int v;
    do {
        v = getIntVolatile(o, offset);
    } while (!compareAndSwapInt(o, offset, v, v + delta));
    return v;
}

The code execution process is as follows:

(1) Thread a executes the accumulation to getandaddint method. First, it gets the field value V1 of O object offset according to the memory address

(2) In the while loop, compareandswapint is executed. This method will get the field value V2 of O object offset again. At this time, it will judge whether V1 and V2 are equal. If they are equal, the position value will be automatically updated to V1 plus the incremental value to jump out of the loop

(3) If the field value has been changed by thread B when compareandswapint is executed, the method will return false, so it cannot jump out of the loop and continue to execute until it succeeds. This is the spin design idea

Through the above analysis, we know that unsafe class and spin design idea are the core of CAS, and spin design idea will be reflected in our cache tool.


4 example analysis of distributed lock

In the same JVM process, in order to ensure that the same code block can only be accessed by one thread at the same time, Java provides a locking mechanism. For example, we can use synchronized and reentrantlock for concurrency control.

In a cluster environment with multiple servers, each server runs a JVM process. If you want concurrency control over multiple JVMs, then the JVM lock is not applicable. At this time, we need to introduce distributed lock. As the name suggests, distributed lock is to control the concurrency of multiple JVM processes in a distributed scenario.

When implementing distributed lock, step on the pit carefully: for example, the timeout is not set. If the node that obtains the lock fails to release the lock for some reason, other nodes will never get the lock.

There are many ways to implement distributed locks. You can implement them by redis or zookeeper, or you can directly use the redisson framework. This chapter gives the implementation of lua script of redis distributed lock.

public class RedisLockManager {
    private static final String DEFAULT_VALUE = "lock";

    private static final String LOCK_SCRIPT =
        "\nlocal r = tonumber(redis.call('SETNX', KEYS[1], ARGV[1]));"
        + "\nif r == 1 then"
        + "\nredis.call('PEXPIRE',KEYS[1],ARGV[2]);"
        + "\nend"
        + "\nreturn r";

    private static final String UNLOCK_SCRIPT =
        "\nlocal v = redis.call('GET', KEYS[1]);"
        + "\nlocal r = 0;"
        + "\nif v == ARGV[1] then"
        + "\nr = redis.call('DEL',KEYS[1]);"
        + "\nend"
        + "\nreturn r";

    @Resource
    private RedisClient redisClient;

    public boolean tryLock(String key, int seconds) {
        try {
            String lockValue = executeLuaScript(key, lockSeconds);
            if (lockValue != null) {
                return true;
            }
            return false;
        } catch (Exception ex) {
            LOGGER.error("key={},lockSeconds={}", key, lockSeconds, ex);
            return false;
        }
    }

    public boolean unLock(String key) {
        try {
            Long r = (Long) redisClient.eval(UNLOCK_SCRIPT, 1, key, DEFAULT_VALUE);
            if (new Long(1).equals(r)) {
                return true;
            }
        } catch (Exception ex) {
            LOGGER.info("key={}", key, ex);
        }
        return false;
    }

    private String executeLuaScript(String key, int lockSeconds) {
        try {
            Long returnValue = (Long) redisClient.eval(LOCK_SCRIPT, 1, key, DEFAULT_VALUE, String.valueOf(lockSeconds));
            if (new Long(1).equals(returnValue)) {
                return DEFAULT_VALUE;
            }
        } catch (Exception ex) {
            LOGGER.error("key={},lockSeconds={}", key, lockSeconds, ex);
        }
        return null;
    }
}


5 cache tool example analysis

The above chapter analyzes CAS principle and distributed lock implementation. Now we want to combine the above knowledge to implement a cache tool that can solve the problem of cache breakdown.

The core idea of cache tool is that if there is no data in the cache, only one JVM process can access the database at the same time by using distributed lock, and write the data to the cache.

So what about processes that don’t get distributed locks? We offer the following three options:

Scheme 1: return null data directly

The cache tool code is as follows:

/**

In the above code, we use distributed lock to restrict the behavior of accessing database resources. Only one process can access database resources at the same time. If there is data, it will be put into the cache to solve the problem of cache breakdown. If there is no data, the loop ends, and the problem of cache penetration is solved. The usage is as follows:

/**


Consistency between database and cache

In this chapter, I want to extend a question: write cache first or write database first, or how to guarantee the consistency between database and cache?

My conclusion is very clear: write the database first, then write the cache. The core idea is to pursue the final consistency between database and cache. If it is unnecessary, there is no need to ensure strong consistency.

(1) In the context of cache as a means to improve system performance, there is no need to ensure the strong consistency between database and cache. If we have to ensure the strong consistency of the two, it will increase the complexity of the system

(2) If the database is updated successfully, update the cache again. There are two situations: if the cache is updated successfully, everything will be OK. It doesn’t matter if the cache fails to update. Wait for the cache to fail. The failure time must be set reasonably here

(3) If the update of the database fails, the operation fails. Try again or wait for the user to restart

(4) Database is persistent data, which is the judgment basis of success or failure of operation. Caching is a means to improve performance, allowing short-term inconsistency with the database

(5) In the Internet architecture, we generally do not pursue strong consistency, but ultimate consistency. If we have to ensure the consistency of cache and database, we are essentially solving the problem of distributed consistency

(6) There are many solutions to distributed consistency problems, such as two-phase commit, TCC, local message table, MQ transactional message


7 article summary

This paper introduces the causes and solutions of cache breakdown problem, including reference rate CAS source code spin design ideas, combined with distributed lock to achieve the cache tool, I hope this article can help you.