To clarify, I didn’t solve the whole process, I just played a role in soy sauce. Because actually I am in charge of this project and the whole process is quite clear. I also told my colleagues in charge that I would take him to do the project review after some time. As a result, I have been busy all the time. I almost forgot what I had done before, and I didn’t bring him to do the second round. So while I still remember, to sum up this problem, it can also be regarded as a review.
On Monday, we ran into a problem in our test environment: starting a service will lead to more time-consuming back-end calls. At that time, I consulted colleagues who knew about this problem before and got the reply because a request was sent to two sets of test environments (one request needs to be run in two sets of environments to compare the results), because the two sets of environments share the same redis cluster. When a second identical request is received, it is marked as a duplicate request. The request received by the downstream is repeated. You need to query the database again to verify whether the request is repeated. If not, make a correction. So at this time, the request delay will increase.
My colleague Xiao a, who is responsible for solving this problem, asked me: is it possible to build another redis cluster to separate the two. I don’t think it’s necessary. There’s something that can’t be explained: one environment service doesn’t start and doesn’t write redis, and another environment service will encounter this problem when it starts and writes redis.
So small a found the service related responsible colleagues to understand the business. Because I am the total responsible person of the test environment, when I understand the business, little a also brings me together to understand. I learned from my colleagues that there are two rooms for service in one environment. If the redis of two computer rooms is the same set, it will also be marked as repeated request.
So far, the solution to the environmental problem has the answer: each set of environment needs to build two redis clusters, two environments and four redis clusters to solve the problem.
Often, an answer is just the beginning of a series of questions.
Our solution is to build two sets of redis clusters directly on the service using redis, only change the port and run two more processes.
Problem 1: when the server logs out, the redis service stops
Little a told me the problem: according to the classic online installation and startup tutorial, the startup was successful. But when you do something else, SSH will automatically log out, and then watch redis service to stop.
When I heard this, I first thought of this phenomenon. I can basically conclude that it is running as a non daemon process. So I went to the Internet and found a command to run in the mode of Deamon
redis-server ./redis.conf –daemonize yes
Small a saw the solution and added that it must also be possible to directly configure the daemon mode operation in the configuration file. I agree, and he did the same. I didn’t point it out at that time. I believe he will soon find that configuration file and display command are actually the same thing. Only one is permanent and the other is every time it runs. The direct use of this command is just to illustrate the essence of the problem.
Problem 2: service connection redis reported an error not auth
Xiao a reported a mistake to me again, saying that what he was looking up on the Internet was a problem with the version of redis. It is estimated that redis needs to be rebuilt. I went to have a look: redis cluster is version 3. X, and jedis client is version 2. 9. I haven’t heard that redis3. X version is not downward compatible. At the same time, because redis is an installation package from the team in charge of redis, it should be the same version as the one running now. If it is suspected that the installation package issued by the redis team is different from that before, I am sure that the previous version will not be lower than 3.0, because redis only supports clusters after 3.0. So I decided it wasn’t the redis version. Let him check again. Actually, what I mean is to ask him to look it up with another keyword. For example, you can check according to the cause of the error, or according to the exception. Different keyword search can get different information.
Then I read the newspaper’s mistake: the others didn’t read it carefully, and I saw it blatantly written: not auth. I asked: is there a password for redis service. He said no and demonstrated it. I confirmed it next to him. Check whether the client configuration has a configuration password. Sure enough, there is a password configuration in the client. This does not match the server.
Question 3: error report cluster support disabled
After small a removed the client password and repackaged the deployment, the error of not auth was not reported, but actually two errors were reported, and another error was not solved: prompt cluster support disabled.
I said that cluster startup should be a configuration, and there should be a cluster enabled or something, which should be changed from no to yes. I also have a bad idea (pay attention to the bad idea used here, think about every sentence in a dream of Red Mansions is a spoiler, here is no exception): I said that in theory, a cluster can also be regarded as a cluster. It should be possible to change the configuration directly and start in cluster mode.
Small a uses the method of changing the configuration to cluster directly according to the idea I said, and the client does not report an error when it restarts.
Problem 4: the request delay is not improved, and the redis server does not write the data successfully
After the client did not report an error, small a tried the original problem scene again, and the request delay did not improve. In addition, it is found that the redis server does not write successful data.
This time, Xiao a and I first check whether the configuration is correct. When I found that there was no problem with the configuration, I told little a: let him log more. Type the places where the client connects, and type the places where the data is read and written.
Through this method, the connection pool of small a locating to client connection is empty. Finally, I found out that the hash slot of the cluster of one machine. In the case of one machine, there was a problem in the hash slot allocation, and the data writing failed. Finally, each cluster has two more redis processes to make a cluster of three nodes to solve the problem.
Analysis of the thinking of investigation that can be optimized
During the troubleshooting of problem 4, Xiao a and I checked whether the configuration was correct to confirm that the redis request arrived at the correct server. In fact, there is a more direct and illustrative method: catching bags. The tcpdump port can be used to check whether the request traffic is correctly sent from the client and where it is forwarded.
InMethod of technical scheme designI also mentioned that many times I can’t find the information I want, which is probably the problem of keywords. When checking problems, you can also try to change keywords to search.
Root cause analysis
There is a problem that has not been fully understood: why does a machine’s redis cluster have problems.
When I asked little a, the getslots method returned null. That is to say, the problem may actually be that the slot is not allocated.
I asked him if he had run the cluster info command on the redis cli client on a single machine. He sent me the following operation screenshot.
This screenshot verifies my conjecture that the slot is not allocated and the cluster status is failed, so the connection cannot be made.
The connection condition is not that there are several nodes in the cluster, but that the slots are allocated and the cluster status is successful.
In order to verify this conjecture, I set up a cluster with one node, and manually added 0 to 16383 slots. The cluster judges that 16384 slots have been allocated, and the automatic status changes to OK.
The whole process shows that: on the Internet, it is said that the redis cluster must be started by more than three nodes, preferably an odd number of nodes. The odd number of nodes is to win two out of three games when voting. The conclusion is that more than half of the nodes hang up and the whole cluster is not available. For redis, whether the cluster is available or not depends on whether there are 16384 complete slots providing services.
I think the performance of small a in the whole process is very OK. There are four points
1> Subjective initiative
In the process, he found resources through online search and solved many problems by himself. In fact, it didn’t take me much time to deal with the whole problem. He solved all the things that took time by himself.
2> Reasonable use of a variety of resources
He didn’t understand the business, so he found a colleague who understood the business. He came to me for technical problems. Because I’m responsible for the project, it’s reasonable to come to me. At the same time, I very much hope that he will come to me in such a situation. Because he came to me to prove that he trusted me and believed that I could help him to a certain extent. Second, he came to me as a resource. Being needed as a resource is valuable. Being needed makes people feel secure.
I ask for help when I can’t make it. For example, when other groups needed to cooperate before, the leader came forward to help solve the scheduling problem. In addition, due to the temporary shortage of resources, the higher authorities came forward to help solve the problem. A competent superior can be a kind of resource and is willing to make himself a resource. But the form of resources is different, some may provide strategy, some may provide spiritual support and so on.
3> Summary after the event
After the problem is solved, little a has his own Wiki to summarize the process of things, to avoid future generations mining pit, at the same time, he also has his own summary harvest.
4> Timely communication
During the process, he communicated with me in time at every key step. After solving the problem, he gave me a feedback: he didn’t understand the redis principle, so he didn’t know why the three nodes were OK. Because I understand his idea, I try to give a root cause analysis.
About redis, just one sentence: I always find the things that many people say can only be used in the interview very useful in the actual work.