Architecture design | service layered monitoring strategy under distributed system


Source code of this article:GitHub · click here || Gitee · point here

1、 Distributed fault

The architecture and business development of distributed systems are relatively easy to handle under good ideas and design document specifications. The relative here refers to the sudden failure of the production environment under the distributed architecture.

In the actual development, there is such an enchanting situation: the more complex the core business is, the more worried about problems, the more prone to problems.

Architecture design | service layered monitoring strategy under distributed system

Therefore, when the link of the core service fails, how to quickly locate the problem is a headache. Especially in some special cases, the problem is very vague and difficult to reproduce. In addition, customers or leaders urge, the shadow in the heart of this scenario is common to most developers. What’s more, someone is responsible for the development of the entry point where the problem may occur. The actual problem occurs in other services of the request link. In this case, the level of pot throwing will rise sharply.

The more complex the system, the more experienced the development or operation and maintenance, the more obsessive the monitoring system is, especially the whole link monitoring, bottom layer, network, middleware, service link, log observation and early warning, which is used to quickly locate problems and save time and worry.

2、 Full link monitoring

1. Monitoring hierarchy

In the distributed system, the system and level to be monitored are extremely complex. It is usually divided into three levels as a whole: application service, software service and hardware service.

Architecture design | service layered monitoring strategy under distributed system

Generally, operation and maintenance management hardware services, development and management applications and software services.

2. Application services

The application layer serves the business logic of development, which is also one of the most prone to sudden problems. When you stay in a company for a long time, because you develop too many business lines, you will feel that you are not a developer, but a chore, and you have to allocate a lot of time to deal with various problems every day. Application layer monitoring involves the following core modules:

Request traffic

Any service with high concurrent traffic will expose various service problems, especially the traffic of the core interface is the focus of monitoring.

Service link

When a request has a problem, it is very important to quickly determine the service where the problem is located or which services are involved.

Log system

Core interface logging is also a necessary function. Usually, based on the analysis results of the log system, you can identify the abnormal points of the system and focus on optimization.

3. Software services

In order to solve various complex business scenarios of distributed systems, various intermediate software are usually introduced as support, such as necessary database, cache, message MQ, etc. usually these middleware will have their own monitoring and management port.

database: Druid monitoring analysis is widely used;

Message queue: commonly used rocketmq and console;

Redis cache: provide command to obtain relevant monitoring data;

Some companies even directly develop an aggregation platform for management, operation and maintenance and monitoring at the middleware layer, which makes it easier to analyze the problem as a whole.

4. Hardware Services

At the hardware level, the operation and maintenance focuses on the three core contents: CPU, memory and network. For the failure of the underlying hardware resources, the possibility of triggering from the upper application service or middleware service is high.

There are many mature frameworks for hardware level monitoring, such as ZABBIX, grafana, etc. of course, these components have rich functions, not only in the hardware layer.

5. Avalanche effect

Some faults lead to a large area of service paralysis, also known as avalanche effect. The fault source may not be handled quickly or there is no fuse mechanism, resulting in the collapse of the whole service link. This is a common problem. Therefore, when dealing with faults, we should learn to analyze the core fault points based on the whole stack monitoring information, and quickly cut off the fault of single point service, Ensure the availability of the whole system.

3、 Precautions

Although the monitoring system plays a great role, it is still very difficult to build it in practice. It needs to have a good awareness, not the feeling of business development. All aspects of needs need to be addressed. The basic strategies for building the monitoring system are as follows.

1. Selectivity

Not all environments and interfaces of all services need to be monitored. They usually monitor the core link, core middleware and the environment where the service is located.

For example: transaction link, transaction library, and deployment environment; Or high concurrency business of large customers. Once there is a problem, it needs to respond in time and deal with it immediately. To put it directly, services that bring benefits need to be focused on.

Even if there is a problem with non critical services, there is a buffer time, so you don’t need to spend energy on adding monitoring. When doing the monitoring system, there is a saying: adding monitoring to a simple link is complex and easy to make mistakes; Monitoring is added to complex links, which is more complex and more error prone. However, this is to better solve the fault.

2. Independence

The failure of the monitoring system itself cannot affect the normal business process. Even if there is no monitoring information under certain circumstances, the normal business service cannot be affected by the monitoring service.

3. Integrity

The aggregated monitoring system can observe the global state of the monitoring link, which can quickly locate the fault coordinates and analyze the cause of the problem.

4. Early warning

For example, the CPU suddenly rises, a middleware service suddenly stops, and the memory occupation is too high. These can make an early warning notice based on the monitoring system, and then notify the relevant person in charge by e-mail or message to achieve the purpose of rapid response. Most developers are familiar with this scenario and have psychological shadow.

4、 Source code address

GitHub · address
Gitee · address

Recommended reading: Architecture Design Series

Architecture design: single service, cluster, distributed, basic differences and connections
Architecture design: Global ID generation strategy in distributed business system
Architecture design: distributed system scheduling, zookeeper cluster management
Architecture design: interface idempotency principle, anti duplicate submission token management
Architecture design: cache management mode, monitoring and memory recycling strategy
Architecture design: asynchronous processing flow, detailed explanation of various implementation modes
Architecture design: peak shaving of high concurrent traffic and locking mechanism of shared resources
Architecture design: distributed service, detailed explanation of library table splitting mode
Architecture design: distributed transaction ① concept introduction and basic theory
Architecture design: Based on the e-commerce transaction process, illustrate the piecewise submission of TCC transactions
Architecture design: Based on message oriented middleware, illustrating flexible transaction consistency
Architecture design: Based on Seata middleware, transaction management under microservice mode