1. Instance restart background
According to the feedback from the students in the monitoring group, the application connecting to a mongodb instance took an abnormal time and timed out. Check the database monitoring platform and find that the IO of this instance server is soaring abnormally. Check the replica set status (rs.status()), the master-slave is broken and the slave node is unreachable.
Log in to the slave node and check the mongodb service status. It is stop.
Check the server log and find that oom occurs and mongodb is closed. Manual restart is required.
Jan 17 12:02:48 qqorderdb02 kernel: Out of memory: Kill process 83717 (mongod) score 919 or sacrifice child
Jan 17 12:02:48 qqorderdb02 kernel: Killed process 83717 (mongod), UID 1001, total-vm:21256876kB, anon-rss:15529572kB, file-rss:0kB, shmem-rss:0kB
Jan 17 12:42:42 qqorderdb02 systemd[1]: mongodbqq.service: main process exited, code=killed, status=9/KILL
Jan 17 12:42:42 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state.
Jan 17 12:42:42 qqorderdb02 systemd[1]: mongodbqq.service failed.
After the analysis, the memory of the master node is larger than that of the slave node. When the index is created, the master node executes normally, while the slave node attends the oom (12:02), and after the kill, the service fails to start automatically (12:42).
2. Restart the service
Restart and check mongodb log. We will see the uncommitted transactions of redo to create indexes
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** WARNING: Running wiredTiger with the --nojournal option in a replica set
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten] ** is deprecated and subject to be removed in a future version.
2019-01-17T19:38:11.529+0800 I CONTROL [initandlisten]
2019-01-17T19:38:11.592+0800 I INDEX [initandlisten] found 1 index(es) that wasn't finished before shutdown
2019-01-17T19:38:11.595+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/mongodbqq/db/diagnostic.data'
2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] found 1 interrupted index build(s) on qqorderdb.weixinordersn
2019-01-17T19:38:11.596+0800 I INDEX [initandlisten] note: restart the server with --noIndexBuildRetry to skip index rebuilds
However, after execution for a while, restart failed and exit. Check the server log and the error message is as follows:
Jan 17 19:41:10 qqorderdb02 systemd[1]: mongodbqq.service stop-final-sigterm timed out. Killing.
Jan 17 19:41:10 qqorderdb02 systemd[1]: Failed to start mongodbqq_service.
Jan 17 19:41:10 qqorderdb02 systemd[1]: Unit mongodbqq.service entered failed state.
Jan 17 19:41:10 qqorderdb02 systemd[1]: mongodbqq.service failed.
Check the mongodb log. The latest information is as follows:
2019-01-17T19:41:00.001+0800 I - [initandlisten] Index Build: 55387600/192576426 28%
2019-01-17T19:41:03.002+0800 I - [initandlisten] Index Build: 57463100/192576426 29%
2019-01-17T19:41:06.002+0800 I - [initandlisten] Index Build: 59385700/192576426 30%
2019-01-17T19:41:09.001+0800 I - [initandlisten] Index Build: 61549000/192576426 31%
By checking the server log and mongodb log, we can judge that it is necessary to rebuild the index that was not completed when it was closed during startup, but rebuilding the index of this large collection (in this case, weixinordersn, 500 million data, 102gsize) takes a long time, exceeding the time allowed for starting the service. The service will be killed after timeout.
3. Solutions
Set the service timeout of SYSTEMd. In mongodb service, specify the timeoutsec parameter.
Timeoutsec: defines the number of seconds that SYSTEMd waits before stopping the current service. The unit is seconds. Setting to 0 is unlimited
For example, mongodbtest Writing of service, addingTimeoutSec=1800
[Unit]
Description=mongodbtest
After=network.target remote-fs.target nss-lookup.target
[Service]
User=mongouser
Group=mongouser
# (open files)
LimitNOFILE=64000
Type=forking
ExecStart=/data/mongodb/mongobin404/bin/mongod --config /data/mongodb/mongobin404/bin/mongodbtest.conf
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/data/mongodb/mongobin404/bin/mongod --shutdown --config /data/mongodb/mongobin404/bin/mongodbtest.conf
PrivateTmp=true
TimeoutSec=1800
[Install]
WantedBy=multi-user.target
4. Analysis of performance degradation
Combined with the time correlation between application timeout and IO surge of database monitoring, the performance degradation is mainly caused by index creation and master-slave disconnection, respectively Slow RS insertion, query and update.
5. References
1.https://cloudblue.freshdesk.com/support/solutions/articles/44001881778
2.https://www.cnblogs.com/f-society/p/13177614.html
The copyright of this article belongs to the author. It cannot be reproduced without the author’s consent. Thank you for your cooperation!!!