On line alarm realized by supervisor + nail

Time:2020-10-25

1. Background

In order to ensure the stability of the online project, it is necessary to pin the failed process of supervisor repeatedly, so that the relevant technical personnel can deal with it in time.
The results are as follows:
On line alarm realized by supervisor + nail

2. Basic knowledge

Here is a literacy program for readers who have never been exposed to supervisor and nail robots. If you are already familiar with it, you can skip it directly.
1. What is supervisor?
Supervisor is a reliable process management tool written by python. It will monitor the child processes managed by it. When the process crashes unexpectedly, it will try to pull it up again (you can set the maximum number of pull ups, and generally it will not pull up the failed process indefinitely). We only need to know the general function of it. For details, please refer to the official website of supervisor

2. What is a nail robot?
The nail robot is based on the nail software, which is an extension of the nail group function. Swarm robots can aggregate the information of third-party services into group chat to realize automatic information synchronization. There are many kinds of robots, such as gitlab robot, GitHub robot, cooding robot, etc. Here we use custom robots. Please refer to the official website for details

3. Configure custom nailing robot

1. Open the nail software and find the robot management
On line alarm realized by supervisor + nail

2. Click robot management, and the robot management menu appears. It looks like this
On line alarm realized by supervisor + nail

The remaining steps have been explained in detail on the official website, so I will not repeat them here

At the end of the day, we’ll get one with access_ For the URL link of token, we send a post request to the URL to alert. It is roughly as follows:

https://oapi.dingtalk.com/robot/send?access_token=XXXXXX

(PS: This paper adopts the policy of IP restriction)

4. Configure the supervisor’s EventListener

First of all, take a look at the configuration file of supervisor eventlistner

[eventlistener:monitor]
command=/home/zero/supervisord.d/monitor.py
events=PROCESS_STATE_FATAL
stdout_logfile=/var/log/supervisor/script_log/monitor.log
stderr_logfile=/var/log/supervisor/script_log/[email protected]

A simple analysis of the above configuration
First line: [even] tlistener:monitor ]As all students who have been in contact with supervisor know, generally, the first line of configuration of supervisor’s subprocesses is[program: process name]And here it is[EventListener: listener name]This indicates that this is a configuration of event listeners whose process name is monitor.
The second line: Command = / home / zero / Supervisor. D/ monitor.py , represents the command to execute the process, which will be described in detail below, and is ignored for the time being.
The third line: events = process_ STATE_ Fatal represents the event monitored by the listener. As of supervisor 4.1, there are currently 23 events supported. They are:

PROCESS_STATE
PROCESS_STATE_STARTING
PROCESS_STATE_RUNNING
PROCESS_STATE_BACKOFF
PROCESS_STATE_STOPPING
PROCESS_STATE_EXITED
PROCESS_STATE_STOPPED
PROCESS_STATE_FATAL
PROCESS_STATE_UNKNOWN
REMOTE_COMMUNICATION
PROCESS_LOG_STDOUT
PROCESS_LOG_STDERR
PROCESS_COMMUNICATION_STDOUT
PROCESS_COMMUNICATION_STDERR
SUPERVISOR_STATE_CHANGE
SUPERVISOR_STATE_CHANGE_RUNNING
SUPERVISOR_STATE_CHANGE_STOPPING
TICK_5
TICK_60
TICK_3600
PROCESS_GROUP
PROCESS_GROUP_ADDED
PROCESS_GROUP_REMOVED

A detailed introduction can be found here
Here we focus on process_ STATE_ Fatal incident, what does this incident mean? The following is the foreign language of the official website:

Indicates a process has moved from the BACKOFF state to the FATAL state. 
This means that Supervisor tried startretries number of times unsuccessfully to start the process, and gave up attempting to restart it.

When this event occurs, it means that the child process of supervisor changes from backoff to fatal.
It means that the supervisor tries to restart the child process several times (depending on the number of reboots, depending on its configuration), but still fails and decides to give up

The fourth line and the fifth line represent the output positions of the standard output log and error log during the monitoring process

5. Write event listener listening script

Here, we continue the above-mentioned command configuration item. The command configures the real executor of the event listener, that is, / home / zero / Supervisor. D mentioned above/ monitor.py script. In theory, monitoring scripts can be written in any language, but since supervisor itself is written in Python and provides a name called supervisor.childutils Module. This makes it extremely easy to write listeners with Python scripts. A simple demo is provided on the official website, and what we need to do is to carry out a secondary development for the demo. The code is as follows:

#!/usr/bin/env python
import sys
import urllib2
import json


def write_stdout(s):
    # only eventlistener protocol messages may be sent to stdout
    sys.stdout.write(s)
    sys.stdout.flush()

def write_stderr(s):
    sys.stderr.write(s)
    sys.stderr.flush()

def main():
    while 1:
        # transition from ACKNOWLEDGED to READY
        write_stdout('READY\n')

        # read header line and print it to stderr
        line = sys.stdin.readline()

        # read event payload and print it to stderr
        headers = dict([ x.split(':') for x in line.split() ])
        notifyData = sys.stdin.read(int(headers['len']))

        title = "warning online"
        Content = "supervisor sub process failed" + ">! [screenshot] (URL address of an image)," + ">" + notifydata + "\ n"

        URL: "link obtained when setting custom robot"
        headers = {'Content-Type':'application/json'}

        body = {'msgtype':'markdown', 'markdown':{'title':title, 'text':content, 'at':{'isAtAll':'true'}}}
        request = urllib2.Request(url,headers=headers,data=json.dumps(body))
        response = urllib2.urlopen(request)

        # transition from READY to ACKNOWLEDGED
        write_stdout('RESULT 2\nOK')

if __name__ == '__main__':
    main()

(PS: the alarm content supporting markdown format is used here)

OK, the configuration is finished, enjoy!!! Due to the limited personal ability, if there is any mistake, please give me your advice!

6. References

Supervisor website http://www.supervisord.org/index.html
The official website document of the document nailing robot https://ding-doc.dingtalk.com/doc#/serverapi2/krgddi

Recommended Today

What black technology does the real-time big data platform use behind the glory of the king?

Hello everyone, I’m Xu Zhenwen. Today’s topic is “Tencent game big data service application practice based on Flink + servicemesh”, which is mainly divided into the following four parts: Introduction to background and Solution Framework Real time big data computing onedata Data interface service onefun Microservice & servicemesh 1、 Introduction to the solution framework and […]