Open platform: monitoring system

Time:2022-5-14

Monitoring background:
What the monitoring does is to quickly know, locate and solve the problems when the gateway and service are abnormal. Therefore, we need to know the value of these indicators in real time, and send a notice (Prometheus / Flink) if it exceeds the specified range. When developers receive exception notification, they can quickly find the reason (elk) through the log.Improve the ability to perceive problems online, it can quickly locate the problem and minimize the loss caused by the problem with the minimum cost.

Purpose:

  • Master the operation status of the system and ensure the smooth operation of the system
  • When the system is abnormal, it can give alarm in time or give early warning in advance
  • Track problems afterwards, through logs or various indicators when exceptions occur
  • performance analysis

Several dimensions:

  • Indicator category: system indicators (service health, memory, CPU usage), business indicators (transaction success rate), and user-defined indicators
  • Log class: you can easily view the system log
  • Link class: monitor the link through which a request passes, and analyze its time consumption

Monitoring object:

  • Open platform services: system health status, CPU occupancy, JVM indicators (GC times and frequency, number of threads)
  • Core gateway: transaction success rate, business exception rate
  • Middleware: MySQL, ES, redis cluster, zookeeper, Kafka, ActiveMQ, Linux indicators and whether they are normal

Application on open platform:

  • At present, Prometheus + grafana is used to monitor the system indicators, including various services, core gateway, database and middleware.
  • Use grafana to display gateway transactions. Es is the data source and only for display
  • Kibana: complex calculation and monitoring of gateway transactions
  • Use alertmanager alarm and common service to expose the SMS interface for alarm
  • Use the elk aggregation log to view the operation and maintenance screen on the management side:

    • Check the normal / abnormal gateway transaction and link logs through the serial number
    • View the abnormal service log, and view the detailed log according to the serial number
  • Skywalking for link monitoring
  • Use redis manager to monitor and operate redis clusters
  • give an alarm:

    • Prometheus can alarm indicators
    • flink cep

      • Alarm the error codes interested in gateway transactions
      • Alarm the log keywords of open platform services

Sketch Map:
Open platform: monitoring system