Solve the problem of high CPU of single node in online redis cluster once

Time:2021-1-27

problem

Redis (Alibaba cloud 64 node cluster version), which has been running online for a long time, often reports that the CPU is on the high side. After confirmation, it is found that only 13 nodes have CPU on the high side, which is more than three times that of other nodes. That is to say, the CPU occupancy rate of most nodes is 20% to 30%, QPS is about 7000, the CPU occupancy rate of problem nodes is more than 80%, and QPS is 27000.

Check the problem

Initial positioning

First of all, we suspected that it was caused by the big key problem, so we used the monitor command to monitor when the business was low. We found that a key of hash type was split into 100 pieces, and the length of each hash was about 300000. We observed these 100 keys, which are evenly distributed in each node, and should not cause such a serious offset problem.

So the conclusion is that the high CPU of this node is not caused by the big key problem.

Check again

Through the imonitor command of alicloud, I monitored the specified node and saved the results as a text file for analysis.

Use the telnet command to connect to the redis server cluster of alicloud and execute it in telnet

imonitor 13

Save results asredis-13.txt

The results were analyzed

cat redis-13.txt | grep -o -E "1561806\d+" | sort | uniq -c

The results are as follows

Solve the problem of high CPU of single node in online redis cluster once

It can be seen that the QPS of this node is about 7000, far less than the 27000 on the monitoring.

Therefore, we determined that the corresponding node of Alibaba cloud monitoring does not correspond to the node of imonitor.

After finding this problem, we try to use the iinfo command to find the node with high CPU.

#Traverse each node
for i in {0..63}
do
redis-cli -h $host -a iinfo $i CPU
done

Through this method, we find that the CPU of 5 nodes is significantly higher than that of other nodes.

Use the imonitor command again to get the QPS of the node

cat redis-5.txt | grep -o -E "1561807\d+" | sort | uniq -c

Solve the problem of high CPU of single node in online redis cluster once

QPS is about 28000, which is in line with the monitoring curve.

Then we analyze the results of monitor by using redis Faina.
First of all, we need to process the result, remove the “+” sign at the beginning of each line received by imonitor, and execute the command

sed -i "" "s/+//g" redis-5.txt 

Then use redis Faina to analyze the results and execute the command

python redis-faina.py redis-5.txt 

Get out of the high traffic command, found that there is ahgetThe key and field of the command are obviously reversed, and the QPS is very high. It is this key that causes the rise of CPU and miss curve.

We found the problem key in the code base, checked it line by line, and found that there was a reverse key field in a project, so we informed the developer to modify it.

The next day, after the developer modified the code and released it, the once high CPU node was as stable as him, and the problem was finally solved.

conclusion

  1. The imonitor command is inconsistent with the monitoring nodes in the background of Alibaba cloud, which causes the problem to be undetected for a long time (Alibaba cloud only reports this phenomenon in the early redis cluster, and the new version of redis cluster is corresponding).
  2. Together with the monitor and redis Faina, you can easily check the redis big key and hot key.
  3. Specific problems should be analyzed in detail. When problems cannot be solved, we should change our thinking and try to solve them through various methods.

This article first appeared in my blog to solve the problem of high CPU of single node in online redis cluster. Please do not reprint it without my authorization.

Recommended Today

How to Build a Cybersecurity Career

Original text:How to Build a Cybersecurity Career How to build the cause of network security Normative guidelines for building a successful career in the field of information security fromDaniel miesslerstayinformation safetyCreated / updated: December 17, 2019 I’ve been doing itinformation safety(now many people call it network security) it’s been about 20 years, and I’ve spent […]