Technology sharing | a case of handling the soaring number of Mongo cluster connections

Time:2021-9-17

Author: Ren Kun

Now living in Zhuhai, he has successively served as a full-time Oracle and MySQL DBA. Now he is mainly responsible for the maintenance of MySQL, mongodb and redis.

Source: original contribution

*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.


1 background

At about 22:30 on June 5, several alarm messages were received. The shard2 of a Mongo cluster on the line experienced a master-slave delay and automatically recovered after a few minutes. I wanted to pretend that I didn’t see it, but after a while, the alarm was delayed. Although it was also restored automatically this time, I didn’t go up and have a look. I was a little uneasy. This environment is 3.4 cluster, 3 mongos + 4 Shard (1 master and 2 slave), master node priority = 2, and other nodes priority = 1. When the network is unblocked, ensure that the master database is always on node 1.

2 troubleshooting

Log in to grafana to view the monitoring information of the main library. During this period, CPU and memory utilization increased significantly

Technology sharing | a case of handling the soaring number of Mongo cluster connections

Technology sharing | a case of handling the soaring number of Mongo cluster connections

QPS decreased linearly to 0 around 22:30

Technology sharing | a case of handling the soaring number of Mongo cluster connections

The number of connections continued to soar

Technology sharing | a case of handling the soaring number of Mongo cluster connections

It can be roughly inferred that the connection storm leads to excessive pressure on the master-slave database, which leads to the delay of master-slave replication. Please check mongod.log for specific information. The advantage of Mongo log file is that the recorded information is very detailed, but the disadvantage is that too much recorded information leads to large file space. For this reason, Mongo provides a special switching log instruction, but this instruction is only responsible for switching, not cleaning up the historical log file, and Mongo itself cannot set a job to achieve regular automatic switching. Therefore, it needs to write a script separately.

Create a Mongo account backup with the role of hostmanager, and write the following shell script, which is executed once an hour.

[[email protected] mongod]# more logrotate_mongo.sh
#!/bin/sh

MONGO_CMD=/usr/local/mongodb/bin/mongo
KEEP_DAY=7

#flush mongod log
datadir=/data/mongodb
port=27017
role=mongod
destdir=/data/backup/mongolog/${role}
if [ ! ‐d "${destdir}" ]; then
  mkdir ‐p ${destdir}
fi
$MONGO_CMD ‐‐authenticationDatabase admin admin ‐ubackup ‐p"*****" ‐‐eval
"db.runCommand({logRotate:1})" ‐‐port $port
mv ${datadir}/mongod.log.????‐??‐??T??‐??‐?? ${destdir}/
find $destdir ‐name "mongod.log.*" ‐mtime +${KEEP_DAY} ‐exec rm ‐rf {} \;

The archived log files are as follows. The number of logs generated during 22:00-23:00 is one order of magnitude more than usual. Therefore, it can also be inferred that Mongo has abnormal pressure during this period.

‐rw‐r‐‐r‐‐ 1 root root  11647760 Jun 5 18:00 mongod.log.2021‐06‐05T10‐00‐02
‐rw‐r‐‐r‐‐ 1 root root  12869316 Jun 5 19:00 mongod.log.2021‐06‐05T11‐00‐01
‐rw‐r‐‐r‐‐ 1 root root  15625361 Jun 5 20:00 mongod.log.2021‐06‐05T12‐00‐03
‐rw‐r‐‐r‐‐ 1 root root  10993819 Jun 5 21:00 mongod.log.2021‐06‐05T13‐00‐02
‐rw‐r‐‐r‐‐ 1 root root  71500070 Jun 5 22:00 mongod.log.2021‐06‐05T14‐00‐40
‐rw‐r‐‐r‐‐ 1 root root 215216326 Jun 5 23:00 mongod.log.2021‐06‐05T15‐00‐02

Query the log at the corresponding time point, and a large number of information about failed creation of worker threads appears

2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] failed to create service entry worker
thread for 172.31.0.65:39890
2021‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from
172.31.0.66:45090 #4051839 (32622 connections now open)
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] pthread_create failed: Resource
temporarily unavailable
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] failed to create service entry worker
thread for 172.31.0.66:45090
2021‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from
172.31.0.65:39892 #4051840 (32622 connections now open)
2021‐06‐05T22:28:00.821+0800 I ‐    [thread2] pthread_create failed: Resource
temporarily unavailable

Logs from the library in the same time period:

2021‐06‐05T22:30:07.325+0800 I REPL     [ReplicationExecutor] Error in heartbeat request
to 172.31.0.183:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2021‐06‐05T22:30:07.425+0800 I SHARDING [migrateThread] Waiting for replication to catch
up before entering critical section

The master database almost lost its response ability. The connections initiated by the timed heartbeat of the slave database could not be created, which triggered the master-slave election. At present, the slave database is temporarily selected as the master database. After the old master database recovers its response and rejoins the cluster, it regains control.

Mtools is used to parse the log, and the output results also verify this. Under normal circumstances, 183 is the master database and 114 is the slave database,

[[email protected]‐31‐0‐183 mongod]# mloginfo mongod.log.2021‐06‐05T15‐00‐02 ‐‐rsstate
.....
RSSTATE
date               host                  state/message

Jun 05 22:28:05 172.31.0.114:27017 primary #114 was elected as the main library
Jun 05 22:36:41 172.31.0.114:27017 secondary #114 was downgraded to slave

[[email protected]‐31‐0‐114 mongod]# mloginfo mongod.log.2021‐06‐05T14‐48‐24 ‐‐rsstate......
RSSTATE
date               host                  state/message

Jun 05 22:31:50 172.31.0.183:27017 rollback #183 joins the cluster as a slave database and executes the rollback and recovery processes,
Jun 05 22:34:48    172.31.0.183:27017    RECOVERING
Jun 05 22:35:52 172.31.0.183:27017 after the status of secondary #183 changed to secondary, a new round of election was triggered because the priority was higher
Jun 05 22:36:43 172.31.0.183:27017 primary #183 became the main library again

The next problem is to confirm which application servers initiated the connection storm. The connections of shard nodes are initiated by mongos. You can only view the logs of mongos instances. The mtools tool is also used, and the output is as follows: This is the connection creation information in the time period of 21:00-22:00

[[email protected] mongos]# mloginfo mongod.log.2021‐06‐05T14‐00‐01 ‐‐connections
......

CONNECTIONS
     total opened: 8904
     total closed: 5645
    no unique IPs: 6
 socket exceptions: 649

127.0.0.1    opened: 2683 closed: 2683
172.31.0.21  opened: 1531 closed: 733
172.31.0.28  opened: 1525 closed: 676
172.31.0.81  opened: 1520 closed: 704
172.31.0.78  opened: 1495 closed: 699
172.31.0.134 opened: 150  closed: 150

This is the connection creation information from 22:00 to 23:00. Obviously, there is a problem in this time period

[[email protected] mongos]# mloginfo mongod.log.2021‐06‐05T15‐00‐01 ‐‐connections
......

CONNECTIONS
     total opened: 58261
     total closed: 58868
    no unique IPs: 7
  socket exceptions: 3708
   
172.31.0.78  opened: 14041  closed: 14576
172.31.0.21  opened: 13898  closed: 14284
172.31.0.28  opened: 13835  closed: 13617
172.31.0.81  opened: 13633  closed: 13535
127.0.0.1    opened: 2681   closed: 2680
172.31.0.134 opened: 170    closed: 176
172.31.0.4   opened: 3      closed: 0

It is confirmed that this event was caused by the connection change of the application server. The rest is to submit these information to the developer for further review.

3 Summary

This case is generally easy to diagnose, but two tools are needed, one is a shell script for regularly switching logs, and the other is mtools written by official engineers of Mongo. They can help us quickly view the log and extract the key information needed for diagnosis, which greatly improves the work efficiency.

Recommended Today

Supervisor

Supervisor [note] Supervisor – H view supervisor command help Supervisorctl – H view supervisorctl command help Supervisorctl help view the action command of supervisorctl Supervisorctl help any action to view the use of this action 1. Introduction Supervisor is a process control system. Generally speaking, it can monitor your process. If the process exits abnormally, […]