The adoption rate and complexity of kubernetes in the production environment are higher and higher, which brings more and more challenges to the stability guarantee.
For kubernetes based cloud products, stability assurance has become a basic demand. Stability defects will bring huge losses to the products, such as loss of users, decline in user confidence, slow product iteration, etc.
The stability of different products based on kuberes’ best practices can’t be guaranteed, but the stability of products based on the same standard practices can’t be guaranteed in the same product stack.
Therefore, based on the past development practice and kubernetes’ stability guarantee experience, try to form kuberentes stability guarantee manual to precipitate the best practice of stability guarantee, so that everyone can form a comprehensive understanding of kubenretes’ stability guarantee theory, and the corresponding tools and services become infrastructure and reused in products of similar technology stacks, Accelerate the dissemination, iteration and application of stability assurance best practices.
As the first article of kubernetes stability assurance manual, this article abstracts the core content of stability assurance and serves as the simplest user manual of stability assurance.
Minimalist manual objectives
- 1min to understand stability assurance objectives
- 3min grasp the global view of stability guarantee
- One stop search for recommended tools or services for stability assurance
Stability guarantee objective
- Meet the demands of service or product for stability
- Accelerate service or product iterations
Stability guarantee inspection items
Stability guarantee level
- Sort out the operation link diagram and mark whether the link is a key link
- Observability configuration based on operation link diagram
- Controllability management based on link importance
In order to reduce the cost of practice, it is necessary to grasp the elements and interaction relationships in cloud products and deconstruct complex systems from the basic elements and interactions:
- Element (Class 2) cloud product component cloud product
- Interaction (2 types, 3 scenarios in total) internal components of cloud products, between components, between cloud products, between cloud products
As shown below:
along withNumber of elementsandInteraction relationshipWith the increase of, the system will gradually become more and more complex, and the challenges faced by stability assurance will become greater and greater. It is necessary to avoid introducing unnecessary complexity.
Therefore, it is necessary to sort out the current operation link diagram, analyze the link importance, sort out the large diagram of components, and judge the explosion radius of components. On this basis, it is also necessary to review the participants to avoid a single point of risk in personnel investment.
Example of operation link diagram:
Example of link importance:
Example of interaction between cloud products:
Based on the above analysis of system complexity and operation links, we can effectively propose and implement solutions to the problem domain of stability guarantee.
- Long term maintenance role list, function flow chart and operation link diagram
- Perceive the occurrence and recovery of problems in multiple hierarchical “alarm groups”
- Handle problems and duplicate problems in the only “problem handling group”
For complex systems, there are usually the following role relationships:
Sort out the roles of each layer, and make it easy for the participating students to find the target students, which will shorten the problem processing time.
For kubernetes stability assurance manual, the following chapters will be refined, summarized from the perspective of methodology and tools / services, and shared with you after the first edition:
Author: Wu Peng
This article is the original content of Alibaba cloud and cannot be reproduced without permission