Hello’s practice in distributed message governance and microservice governance

Time:2021-10-14

Introduction:With the continuous development of the company’s business, the traffic is also growing. We found that some major accidents in production are often crossed by sudden traffic. It is particularly important to control and protect the traffic and ensure the high availability of the system.

Hello's practice in distributed message governance and microservice governance

Author Liang Yong

background


Hello has evolved into a comprehensive mobile travel platform including two-wheel travel (Hello bicycle, hello moped, hello electric vehicle, small ha for electricity), four-wheel travel (Hello free ride, whole network car hailing, hello taxi), and has explored many local life ecology such as hotels and group shopping in stores.

With the continuous development of the company’s business, the traffic is also growing. We found that some major accidents in production are often crossed by sudden traffic. It is particularly important to control and protect the traffic and ensure the high availability of the system.

This article will share Hello’s experience in the governance of message traffic and micro service invocation.

Author introduction


Liang Yong (Lao Liang), co-author of rocketmq practical combat and advanced column, participated in the review of rocketmq technology insider. Lecturer of archsummit Global Architecture conference and qcon case study society.

The official account is currently in the direction of back-end middleware, and more than 100 articles on source code and actual combat have been published in public numbers, including RocketMQ series, Kafka series, GRPC series, Nacosl series, Sentinel series, Java NIO series. At present, he works in Hello travel as a senior technical expert.

Talk about governance


Before we start, let’s talk about governance. Here’s Lao Liang’s personal understanding:

What is the manager doing?

  • Let’s make our environment better

Need to know what’s not good enough?

  • Past experience
  • User feedback
  • Industry comparison

Also need to know if it has always been good?

  • Monitoring and tracking
  • Alarm notification

How to make it better when it’s bad?

  • Governance measures
  • Emergency plan

catalogue

  1. Build a distributed message management platform
  2. Rocketmq practical stepping on pits and Solutions
  3. Build a high availability governance platform for micro services

background

Streaking rabbitmq


The company used rabbitmq before. Here are the pain points when using rabbitmq. Many accidents are caused by rabbitmq cluster current limiting.

  • Is the backlog too much to be cleaned up or not? This is a problem. Let me think about it again.
  • Excessive backlog triggers cluster flow control? That really affects the business.
  • Want to consume the data of the first two days? Please resend it.
  • Which services are accessed? You have to wait more. I have to go fishing for IP.
  • Are there any risks, such as big news? Let me guess.

Streaking service

There was such a failure that multiple businesses shared a database. In one evening peak, the traffic increased sharply and hung up the database.

  • Upgrading a stand-alone database to the highest configuration still cannot be solved
  • After the restart, it was delayed. After a while, it was hung up again
  • Such a cycle, suffering, silently waiting for the peak to pass

Thinking: both messages and services need perfect governance measures

Build a distributed message management platform

Design Guide


Which are our key indicators and which are our secondary indicators are the primary issues of message governance.

design goal

It aims to shield the complexity of each underlying middleware (rocketmq / Kafka) and dynamically route messages through unique identification. At the same time, build a message management platform integrating resource management and control, retrieval, monitoring, alarm, patrol inspection, disaster recovery, visual operation and maintenance, so as to ensure the smooth and healthy operation of message middleware.

Points to consider in the design of message governance platform

  • Provides an easy-to-use API
  • What are the key points to measure the use of the client without security risks
  • What are the key indicators to measure cluster health
  • What are the common user / O & M operations to visualize
  • What are the measures to deal with these unhealthy

Be as simple and easy to use as possible

Design Guide


It’s an ability to simplify complex problems.

Minimalist unified API

It provides a unified SDK and encapsulates two message oriented middleware (Kafka / rocketmq).

Hello's practice in distributed message governance and microservice governance

One application


The automatic creation of theme consumption group is not suitable for the production environment. Automatic creation will lead to out of control, which is not conducive to the whole life cycle management and cluster stability. The application process needs to be controlled, but it should be as simple as possible. For example, each environment takes effect in one application, generates associated alarm rules, etc.

Hello's practice in distributed message governance and microservice governance

Client governance

Design Guide

Monitor whether the client use is standardized and find appropriate governance measures

Scene playback

Scene instantaneous flow and cluster flow control

Assuming that the current cluster TPS is 10000, and the instantaneous increase is 20000 or more, this excessive and steep increase in traffic is likely to cause cluster flow control. For such scenarios, it is necessary to monitor the sending speed of the client and make the sending smoother after meeting the speed and steepness threshold.

Scenario 2 messages and cluster jitter

When the client sends a large message, for example, sending a message of hundreds of KB or even a few megabytes, it may cause too long IO time and cluster jitter. For this kind of scenario management, we need to monitor the size of the sent message. We identify the service of large messages through post inspection, promote the use of compression or reconstruction, and control the message within 10KB.

Scenario 3: client version

With the iteration of functions, the SDK version will also be upgraded. In addition to functions, changes may introduce risks. When using a version that is too low, one is that the function cannot be supported, and the other is that there may be security risks. In order to understand the use of the SDK, you can report the SDK version and promote the upgrade of users through patrol inspection.

Scenario 4 consumption flow removal and recovery

Consumer traffic removal and recovery usually have the following usage scenarios. The first is to remove traffic when publishing applications, and the other is to remove traffic before troubleshooting when locating problems. In order to support this scenario, you need to listen to the removal / recovery event on the client to suspend and resume consumption.

Scenario 5 sending / consumption time detection

How long it takes to send / consume a message. By monitoring the time-consuming situation, we can find out the applications with low performance, and promote the transformation to improve the performance.

Scenario 6: improve the efficiency of troubleshooting and positioning

When troubleshooting problems, you often need to retrieve information related to the message life cycle, such as what message is sent, where it exists, and when it is consumed. This part can concatenate the life cycle inside the message through msgid. In addition, the rpcid / traceid similar link identifier is embedded in the message header to string messages in one request.

Refining of treatment measures

Required monitoring information

  • Sending / consumption speed
  • Sending / consuming time
  • Message size
  • Node information
  • Link identification
  • Version information

Common treatment measures

  • Regular patrol inspection: with buried point information, risky applications can be found through patrol inspection. For example, the sending / consumption time is greater than 800 ms, the message size is greater than 10 KB, and the version is less than a specific version.
  • Send smooth: for example, if it is detected that the instantaneous flow meets 10000 and increases sharply by more than 2 times, the instantaneous flow can be smoothed by preheating.
  • Consumption current limiting: when the third-party interface needs current limiting, the consumption flow can be limited. This part can be implemented in combination with the high availability framework.
  • Consumption removal: close and recover the consumption client by listening to the removal event.

Theme / consumer group governance

Design Guide


Monitor the resource usage of subject consumption group

Scene playback


Scenario 1 impact of consumption backlog on business

Some business scenarios are very sensitive to consumption accumulation, and some businesses are not sensitive to backlog, as long as they catch up and consume. For example, single vehicle unlocking is a second level thing, and the batch processing scenario related to information summary is not sensitive to backlog. By collecting the consumption backlog index, the application that meets the threshold is notified to the students in charge of the application by means of real-time alarm, so that they can master the consumption situation in real time.

Scenario 2 impact on consumption / transmission speed

Send / consumption speed drop zero alarm? In some scenarios, the speed cannot fall to zero. If it falls to zero, it means that the business is abnormal. By collecting the speed index, the application that meets the threshold can be alarmed in real time.

Scenario 3: consumption node drops

When the consumer node drops the line, it needs to notify the students in charge of the application. This kind of needs to collect the registered node information. When the consumer node drops the line, it can trigger the alarm notification in real time.

Scenario 4 unbalanced transmission / consumption

The imbalance of sending / consumption often affects its performance. I remember that in a consultation, some students set the key of sending messages as a constant. By default, they hash and select partitions according to the key. All messages enter a partition. This performance can not be improved in any case. In addition, it is also necessary to detect the consumption backlog of each partition, and trigger real-time alarm notification in case of excessive imbalance.

Refining of treatment measures


Required monitoring information

  • Sending / consumption speed
  • Send partition details
  • Consumption backlog by Division
  • Consumer group backlog
  • Registration node information

Common treatment measures

  • Real time alarm: real time alarm notification for consumption backlog, sending / consumption speed, node drop and unbalanced zoning.
  • Improve performance: if the consumption backlog can not meet the demand, it can be improved by increasing pull threads, consumption threads, increasing the number of partitions and other measures.
  • Self service Troubleshooting: provides multi-dimensional retrieval tools, such as multi-dimensional retrieval of message life cycle through time range, msgid retrieval, link system, etc.

Cluster health governance

Design Guide


What are the core indicators to measure cluster health?

Scene playback

Scenario 1 cluster health detection

Cluster health detection answers a question: is this cluster good. This problem is solved by detecting the number of cluster nodes, heartbeat of each node in the cluster, cluster write TPS water level and cluster consumption TPS water level.

Scenario 2 cluster stability

Cluster flow control often reflects the lack of cluster performance, and cluster jitter will also cause client sending timeout. By collecting the heartbeat time of each node in the cluster and the change rate of cluster write TPS water level, we can master whether the cluster is stable.

Scenario 3 high availability of cluster

High availability mainly aims at the unavailability of an availability zone in extreme scenarios, or the exceptions of some topics and consumption groups on the cluster. Some targeted measures are required. For example, MQ can be solved by cross deployment of master-slave across availability zones in the same city, dynamic migration of themes and consumption groups to disaster recovery clusters, multiple activities, etc.

Refining of treatment measures


Required monitoring information

  • Number of cluster nodes collection
  • Cluster node heartbeat time
  • Water level of cluster write TPS
  • Water level of cluster consumption TPS
  • Change rate of cluster write TPS

Common treatment measures

  • Regular patrol inspection: regularly patrol the cluster TPS water level and hardware water level.
  • Disaster recovery measures: cross deployment of master-slave across available areas in the same city, dynamic migration of disaster recovery to disaster recovery cluster, and multi activity in different places.
  • Cluster tuning: system version / parameter, cluster parameter tuning.
  • Cluster classification: classified by business lines and core / non core services.

Focus on core indicators


If which of these key indicators is the most important? I will select the heartbeat detection of each node in the cluster, i.e. response time (RT). Let’s see what may affect RT.

Hello's practice in distributed message governance and microservice governance

About alarm

  • Most monitoring indicators are second level detection
  • The alarm triggering threshold is pushed to the company’s unified alarm system for real-time notification
  • The risk notice of patrol inspection is pushed to the company’s patrol inspection system and summarized every week

Message platform diagram

Architecture diagram


Hello's practice in distributed message governance and microservice governance

Kanban diagram

  • Multi dimension: cluster dimension and application dimension
  • Total aggregation: total aggregation of key indicators

Hello's practice in distributed message governance and microservice governance

Hello's practice in distributed message governance and microservice governance

Craters and solutions trampled by rocketmq in actual combat

Action guide


We always meet a pit and fill it up.

1. Rocketmq cluster CPU glitch

Problem description

**

Rocketmq slave nodes and master nodes frequently have high CPU, which is obvious. Many times, the slave nodes directly hang up.

Hello's practice in distributed message governance and microservice governance

Only the system log has an error prompt

2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x202020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86\_64 #12020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: Call Trace:2020-03-16T17:56:07.505724+08:00 VECS0xxxx kernel:[] ? \_\_alloc\_pages\_nodemask+0x7e1/0x9602020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: [] ? dev\_queue\_xmit+0xd0/0x3602020-03-16T17:56:07.505729+08:00 VECS0xxxx kernel: [] ? ip\_finish\_output+0x192/0x3802020-03-16T17:56:07.505732+08:00 VECS0xxxx kernel: [] ?

Various debugging system parameters can only slow down but can not be eradicated, and the burr still exceeds 50%
Hello's practice in distributed message governance and microservice governance

Solution

Upgrade all systems of the cluster from CentOS 6 to CentOS 7, and upgrade the kernel version from 2.6 to 3.10. The CPU glitch disappears.

2. The rocketmq cluster online delay message fails

Problem description

Rocketmq Community Edition supports 18 delay levels by default, and each level will be accurately consumed by consumers at the set time. To this end, we also specially tested whether the consumption interval is accurate, and the test results show that it is very accurate. However, there is something wrong with such an accurate feature. It’s strange to receive a business classmate’s report that the consumption of a cluster delay message on the line is not enough!

Solution

Moving “delayoffset. JSON” and “consumequeue / schedule \ _topic \ _xxxx” to other directories is equivalent to deleting; Restart the broker node one by one. After the restart, it is verified that the delayed message function can be sent and consumed normally.

Build a high availability governance platform for micro services

Design Guide

What are our core services and what are our non core services? This is the primary issue of service governance

design goal

The service can cope with the sudden sharp increase in traffic, especially to ensure the smooth operation of core services.

Application classification and grouping deployment

Application classification


The application is divided into four levels according to the user and business impact.

  • Business impact: the business scope affected by application failure
  • User impact: the number of users affected by application failure

S1: the failure of core products will cause external users to be unable to use or cause large asset losses, such as the core links of main businesses, such as single car, moped switch lock, the core links for issuing and receiving orders of free ride, and the applications on which the core links are strongly dependent.

S2: it does not directly affect the transaction, but is related to the management and maintenance of important configurations of front office business or the function of business background processing.

S3: service failure has little impact on users or core product logic, and has no impact on main businesses, or a small amount of new businesses; The important tools used for internal users do not directly affect the business, but the relevant management functions have little impact on the front office business.

S4: systems for internal users that do not directly affect the business or need to be pushed offline in the future.

Group deployment

S1 service is the core service of the company and the key guarantee object. It needs to be protected from accidental impact by non core service traffic.

  • S1 service is deployed in groups and is divided into stable and standalone environments
  • Non core service calls and S1 service traffic is routed to the standalone environment
  • S1 service calls non core services and needs to configure fusing policy

Hello's practice in distributed message governance and microservice governance

Capacity building of multiple current limiting fuses

High availability platform capabilities we build

Hello's practice in distributed message governance and microservice governance

Effect drawing of partial current limiting

**

  • Preheating diagram

Hello's practice in distributed message governance and microservice governance

  • Queue up

Hello's practice in distributed message governance and microservice governance

  • Preheating + queuing

Hello's practice in distributed message governance and microservice governance

High availability platform diagram

**

  • All middleware access
  • Dynamic configuration takes effect in real time
  • Detailed traffic per resource and IP node

Hello's practice in distributed message governance and microservice governance

summary

  • Which are our key indicators and which are our secondary indicators are the primary issues of message governance
  • What are our core services and what are our non core services? This is the primary issue of service governance
  • Source code & actual combat is a better way to work and learn.

Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.