Kubernetes stability guarantee Manual: insight + plan

Time:2021-9-15

Introduction:Stability guarantee is a complex topic, which needs to ensure the stability of clusters effectively, iteratively and sustainably. A systematic method may solve this problem.

Kubernetes stability guarantee Manual: insight + plan

Author Wu Peng
Source|Alibaba cloud official account

Kubernetes stability Assurance Manual Series:

Overview


Stability guarantee is a complex topic and needs to be improvedEffective, iterative and sustainableTo ensure the stability of clusters, a systematic approach may solve this problem.

In order to form a systematic method, we can sort out the source of the complexity of stability guarantee, formulate a data model to describe it, and then analyze the stability guarantee of the cluster on the basis of the data modeldigitizationandvisualization, take the data model as the core to continuously iterate the understanding, practice and experience of stability guarantee.

Source of stability and complexity


The complexity source of stability guarantee generally has the following dimensions:

  • Number and interaction of system components: continuous change over time
  • Dynamic behavior characteristics of system components and interactions: not easy to deduce and observe
  • Type and quantity of system resources: continuous change over time
  • Dynamic behavior characteristics of system resources: not easy to deduce and observe
  • Cluster stability guarantee action: it is difficult to standardize and implement safely

To sum up:

data model


The data model of insight and plan can be abstracted through 4 diagrams and 3 tables:

4 drawings

  • Architecture diagram: describe cluster components and their interactions
  • Architecture diagram: describe the dynamic characteristics of cluster components and interactions
  • Resource composition diagram: describe the composition of cluster resources
  • Resource operation diagram: describe the dynamic usage characteristics of cluster resources

3 sheets

  • Event list: describe the events generated by the cluster that need attention
  • Action list: describes the management operations that can be performed in the cluster
  • Plan list: describes the relationship between events and operations in the cluster

As follows:

Kubernetes stability guarantee Manual: insight + plan

Insight


The functions of the cluster are provided by the cluster architecture, and the functional components operate based on the cluster resources. Therefore, the core of the insight into the stability of the cluster is to graspCluster architectureandCluster resourcesCharacteristics of.

1. Architecture diagram


Cluster architecture can usually be implemented throughchartThe node represents the component, and the edge represents the interaction relationship. Through the graph structure, you can intuitively grasp the cluster architecture, as shown in the figure below:

Kubernetes stability guarantee Manual: insight + plan

It can be described by the following data structure:

{
    "nodes": [
        {
            "_id": "0ce0e913f6e5516846c654dbd81db6ecab1f684e",
            "name": "kube-apiserver",
            "Description": "in XXX VPC",
            "type": "managed component",
            "dependencies": {}
        },
        {
            "_id": "f0740d8bb67520857061a9b71d4a9e4fc50bfe3d",
            "name": "etcd",
            "Description": "in XXX VPC",
            "type": "managed component | storage",
            "dependencies": {}
        },
        {
            "_id": "05952a825e91cb50a81cbaf23c6941d5c3bb2c89",
            "name": "eni-operator",
            "Description": "managing Eni in XXX VPC",
            "type": "component",
            "dependencies": {
                "serviceaccount": "enioperator",
                "clusterrole": "enioperator",
                "clusterrolebinding": "enioperator",
                "configmaps": ["eniconfig"],
                "secrets": ["enioperator"]
            }
        },
        {
            "_id": "42699513a7561e89a5f99881d7b05653a1625c51",
            "name": "Network Service",
            "Description": "providing management services for cloud network resources such as VPC / vswitch",
            "type": "cloud service"
        }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "Description": "managing Eni requests"
        },
        {
            "_id": "93f3c21247165f0be3a969fc80f72bc1a402e9f5",
            "source": "eni-operator", "target": "Network Service",
            "Description": "accessing Alibaba cloud ECS OpenAPI to manage VPC / vswitch and other network resources"
        }
    ]
}

2. Architecture and operation diagram


During cluster operation, components and interaction relationships can infer internal status through external observation data, such as log / metrics / trace. Combined with the cluster architecture diagram, dynamic insight data can be superimposed on the static architecture to more intuitively grasp the health status of the cluster, as shown in the following figure:

Kubernetes stability guarantee Manual: insight + plan

The numbers represent insight data, which can be “abnormal number”, “request traffic”, etc. In addition to insight through numbers, you can also use “color to represent health status”, “line thickness to represent flow size”, etc.

It can be described by the following data structure:

{
    "nodes": [
      {
            "_id": "ea4538dc0625d06b0dc93579998e04288656050f",
            "name": "mutatehook",
            "deploy": {
                "type": "K8s:Deployment",
                "namespace": "kube-system",
                "replicas": 3
            },
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "mutatehook",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "fuzzy": "fail OR Fail OR error OR Error"
                        }
                    }
              }
          ]
      }
    ],
    "edges": [
        {
            "_id": "38bce9ca8a0cec6d8586d96298bd63b0523fc946",
            "source": "eni-operator", "target": "kube-apiserver",
            "insight":[
                {
                    "source": {
                        "vendor": "cloud:aliyun:sls",
                        "log_project": "xxx",
                        "log_store": "xxx",
                        "log_url": "https://sls.console.aliyun.com/lognext/project/xxx"
                    },
                    "signal": {
                        "exception": {
                            "unauthorized": "Unauthorized",
                            "throttling": "'Throttling' OR 'throttling'"
                        }
                    }
                }
            ]
        }
    ]
}

3. Resource composition

Resource management is a complex topic. By analyzing the composition of resources in the cluster, you can also try tochartStructure to represent the resource composition of the cluster, nodes represent resources, and edges represent the dependency or binding relationship of resources.

It can be described by the following data structure:

{
    "kinds": ["vpc", "vswitch", "securitygroup", "ecs", "clb", "rds", "nat", "eip"],
    "tags": {
        "cluster/product": "xxx",
        "cluster/id": "2736f42d4e882ad6825d6364545a3f1cb5136859",
        "cluster/name": "xxx",
        "cluster/env": "staging"
    },
    "nodes": [
        {
            "kind": "vpc",
            "nodes": [
                {
                    "_id": "c505f21871bac7385c1387988cf226310af0831e",
                    "id": "vpc-xxx",
                    "description": "",
                    "ipv4": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": ""
                     },
                     "url": "https://vpc.console.aliyun.com/vpc/xxx"
                }
            ]
        },
        {
            "kind": "ecs",
            "nodes": [
                {
                    "_id": "47c4fe5cc2585a49f07798a0b8b69cda7f8d4a23",
                    "id": "xxx",
                    "az": "xxx",
                    "interfaces": {
                        "primary": {
                            "ip": "xxx",
                            "eni": "xxx",
                            "mac": "xxx"
                        }
                    },
                    "instance-type-family": "xxx",
                    "instance-type": "xxx",
                    "tags": {
                        "resource/creator": "product",
                        "resource/role": "worker",
                        "node/container-runtime": "xxx",
                        "node/user-networking": "xxx",
                        "node/system-networking": "xxx"
                    },
                    "status": "",
                    "condition": "",
                    "url": "https://ecs.console.aliyun.com/#/server/xxx"
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "a754c748b2723a25c017421dd0969d00df3c000b",
            "source": "vsw-xxx", "target": "vpc-xxx",
            "description": ""
        },
        {
            "_id": "c34b164eba2897cfb2b574a576672d8aa441d709",
            "source": "eip-xxx", "target": "ngw-xxx",
            "description": ""
        }
    ]
}

4. Resource operation diagram


During the use of resources, you can also infer the internal state of resources and the relationship between resources through external observation data, such as log / metrics / event. Combined with the resource composition diagram, dynamic insight data can be superimposed on the basis of static resources to intuitively grasp the use status of cluster resources.

It can be described by the following data structure:

{
    "nodes": [
         {
            "_id": "35103ac62d4ef0a314e2a5128f44c684205bea2f",
            "id": "vpc",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "vpc/exist": "DescribeVpcs",
                        "vswitch/count": "DescribeVSwitches"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/count": "DescribeInstances",
                        "securitygroup/count": "DescribeSecurityGroups"
                    }
                }
            ]
        },
        {
            "_id": "6450e07dc67027f76f29fbfcb841e57200855196",
            "id": "ecs",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "ecs/exist": "DescribeInstances",
                        "ecs/count": "DescribeInstances",
                        "ecs/usage": "DescribeInstanceMonitorData"
                    }
                },
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "auto"
                    },
                    "signal": {
                        "ecs/state_change": ""
                    }
                }
            ]
        }
    ],
    "edges": [
        {
            "_id": "caa1e395c713f47766ca7bcfc20419c0be0f0803",
            "source": "i-xxx", "target": "sg-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:ecs",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeInstances"
                    }
                }
            ]
        },
        {
            "_id": "537dc478d95714792b3694674d6164f72b361bb0",
            "source": "eip-xxx", "target": "ngw-xxx",
            "insight": [
                {
                    "source": {
                        "vendor": "cloud:aliyun:vpc",
                        "type": "OpenAPI"
                    },
                    "signal": {
                        "exist": "DescribeEipAddresses"
                    }
                }
            ]
        }
    ]
}

reserve plan


Exceptions in clusters are inevitable and need to be handled safely and effectively.

Exceptions can be characterized by events. Safe and effective operations are reviewed and rehearsed operations. Exceptions are combined with operations, and exceptions trigger operations to form reviewed and rehearsed plans, which can safely and effectively deal with cluster exceptions.

1. Event list


Events requiring attention will be generated during the operation of the cluster. The format of the event itself can be used based on the community cloudevents standard:\_https://github.com/cloudevents/spec/blob/v1.0.1/spec.md\_

It can be described by the following data structure:

{
    "events": [
        {
            "_id": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "description": "restart workload manually",
            "event": {
                "id": "restart-workload",
                "source": "xxx",
                "specversion": "1.0",
                "type": "com.aliyun.trigger.manual",
                "datacontenttype": "application/json",
                "data": "{\"NAMESPACE\": \"\", \"NAME\": \"\", \"TYPE\": \"\"}"
            }
        }
    ]
}

2. Operation list


In order to reduce the possibility of misoperation and avoid unaudited and verified operations when exceptions occur, it is necessary to define a list of operations that can be performed in the cluster.

It can be described by the following data structure:

{
    "actions": [
        {
            "_id": "47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d",
            "name": "Action Restart Workload",
            "exec": "restart-workload",
            "env": [
                "NAMESPACE",
                "NAME",
                "TYPE"
            ]
        }
    ]
}

3. Plan list


Based on the event list and operation list, events and operations can be associated to handle exceptions in an event driven manner, that is, contingency plans.

It can be described by the following data structure:

{
    "plans": [
        {
            "_id": "29a091c48d8992991ed69e8694b017a11abe3eec",
            "name": "Plan Restart Workload",
            "Description": "restarting workload",
            "event": "a1ab5b61857be35a5c5b203dd84b49248161c823",
            "actions": ["47abc5cd9d64018ebf96dc5b2d6a4fbd35a3cb6d"]
        }
    ]
}

Global visual stability guarantee


Based on the above4 drawingsand3 sheetsThe data model forms a guarantee for the stability of the clusterInsight + planThe kernel can derive a global visual stability assurance service.

Such a service has the following key points:

  • Global Perspective
  • digitization
  • visualization

This service is implemented based on two principles:

  • The efficiency of image processing is much higher than that of text processing
  • The global perspective can provide the ability of “end-to-end understanding of the system”, “accurate positioning of problems” and “safe handling of problems”

Take the traffic map in daily life as an example:

Kubernetes stability guarantee Manual: insight + plan

Through the traffic map, we can quickly understand the road distribution and key nodes in a region. The conventional red, yellow and green colors can intuitively express the road congestion. On richer traffic maps, important events such as road construction and road closure will also be observed.

In this way, based on visualization, we can quickly understand the traffic and geographical situation of a region.

The underlying data model is the foundation. The application of visualization means makes the value of data easier to play.

An implementation


Kubernetes stability guarantee Manual: insight + plan

1) Deployment form

  • REGIONIZED deployment
  • Provide services for single cluster or multiple clusters in the region

2) Use somatosensory


According to the best practice of stability guarantee, the stability guarantee is divided into the following columns

  • Operation link diagram:

    • This column is a high-frequency area for daily stability guarantee. Through the ability of visualization, it can intuitively perceive the occurrence, scope and impact of abnormalities, and deal with exceptions in a white screen + visualization way
  • Deployment architecture diagram

    • This column is used to understand the deployment architecture of the cluster, perceive and handle the problems of the deployment dimension
    • Capacity management (including node management, capacity planning, etc.) is carried out in this column
  • Business flow chart

    • This column precipitates the function flow chart of the business. On the one hand, it helps the business control the complexity of the function, on the other hand, it helps the business understand the current situation of the business function and jointly help the business iteration
    • Business related data analysis can be placed in this column
  • Data analysis: the column serves two data needs

    • Business requirements

      • View class: SLI information such as cluster size, SLO information such as cluster stability
      • Query type: query statistical information according to characteristics (such as query resource application according to label, etc.)
    • Stability guarantee requirements

      • View class: SLI information such as cluster water level, SLO information such as cluster stability guarantee effect
      • Query type: query statistical information according to characteristics (such as querying all associated resource information and resource leakage information according to label)
  • Observability management

    • This column is used to manage matters related to observability, including:

      • Observation data generation
      • Observation data acquisition
      • Observation data processing
      • Observation data consumption
  • Controllability management

    • This column is used to manage control related operations, including:

      • Release management
      • disaster management
      • Plan management
      • resource management
      • Chaos Engineering
      • security management
      • Regular physical examination

During normal system operation

  • Confirm the coverage and accuracy of the cluster in terms of “observability”, “controllability” through the “data analysis” column
  • In the “observability management” column, manage the observable dimension, including data source / monitoring / alarm supplement, governance, etc
  • In the “controllability management” column:

    • According to the problems found in the observation data, carry out plan configuration, issue management, etc
    • According to the problems found in chaotic engineering or drill, carry out plan configuration, etc
  • In the “operation link diagram” and “deployment architecture diagram”, the configured monitoring, alarm and plan are combined with components or links through visualization

During system abnormality and recovery, in the “operation link diagram”

  • Sense the occurrence of abnormalities through the cluster operation link diagram or alarm
  • Automatically or manually trigger issue tracking
  • Sense abnormal components, abnormal links and severity through the colors of components and interactions in the cluster operation link diagram
  • Click the abnormal number of components in the cluster operation link diagram to obtain the associated abnormal details, or jump to the log and tracing system for manual query
  • Determine the plan to be implemented and associated components according to the abnormal details or platform prompts
  • Implement the plan in the cluster operation link diagram (blocking problems or restoring services)
  • Confirm the implementation effect of the plan through the colors of components and interactions in the cluster operation link diagram
  • End issue tracking automatically or manually

The main contents recorded during problem tracking include:

  • issue
  • When the anomaly occurs
  • Actions performed during exception handling
  • Run link graph snapshot
  • Time of abnormal recovery

Data model and competitiveness analysis


The data model is the medium for iteration, sharing and application of stability assurance best practices. The general insight and plan can form standardized services, and the personalized insight and plan can be described through a fixed structure, and then implemented with a general controller.

Formed by data modelInsight + planThe technical core of the stability guarantee service is:

  • Insight model

    • key problem:

      • How to gain insight into cluster stability?
      • How to gain insight into business iteration efficiency?
  • data model

    • key problem:

      • How to define valid and extensible data descriptions?

Based on the core technology, we can iterate around the following Competitiveness:

  • Insight

    • Global
    • digitization
    • visualization
  • efficiency

    • Shortest operation path
    • Minimum use cost
  • Progressiveness

    • Process best practices

Summary


Through spec specification of 7 data models, we can characterize insight + plan based on structured description. Take this as the core, constantly iterate the practice and understanding of stability guarantee, and accelerate business iteration. Further expansion may also feed back the business in the development direction based on the model.

If you are interested, please communicate in the message area.

Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.