How many typical problems have you encountered in the construction of 6 k8s log systems?


How many typical problems have you encountered in the construction of 6 k8s log systems?

Guide: as k8s continues to update and iterate, developers using k8s log system construction gradually encounter various complex problems and challenges. In this article, the author combines his many years of experience to analyze
K8s log system construction difficulties, looking forward to providing useful reference for readers.

In the past few years, more and more students have come to ask how to build a log system for kubernetes, or how to solve a series of problems encountered in the process. It is better to teach people to fish than to teach people to fish, so they want to send out our experience accumulated in these years in the form of articles, so that students who see this article can avoid detours. This series of articles is positioned as a long series, and the content tends to be practical and experience sharing, and the content will be updated irregularly with the iteration of technology.


The first time I heard the name of kubernetes was in 2016, when kubernetes was still in the “age of three powers” with docker swarm and mesos solutions. Kubernetes emerged in this competition due to a series of advantages (extensible, declarative interface, cloud friendly), and finally gained a dominant position.

Kubernetes, as one of the most core projects (none) of CNCF, is the base of cloud native. At present, Ali has carried out cloud native transformation of the whole station based on kubernetes. In 1-2 years, 100% of Alibaba’s business will run on the public cloud.

The core of cloudnative’s definition in CNCF is: in public cloud, private cloud, hybrid cloud and other environments, through containers, service meshes, microservices, immutable infrastructure and declarative APIs, build and run elastic and scalable application systems with high fault tolerance, easy management, observability and loose coupling. Observability is an essential part of the application system. One of the cloud’s original design concepts is diagnostics oriented design, including cluster level logs, metric, and trace.

Why we need a log system

Generally, an online problem location process is: find the problem through metric, locate the problem module according to trace, and locate the cause of the problem according to the specific log of the module. The log includes error, key variable, code running path and other information, which are the core of troubleshooting, so the log is always the necessary path for online troubleshooting.

How many typical problems have you encountered in the construction of 6 k8s log systems?

In Ali’s more than ten years, with the development of computing form, the log system has been evolving, which is roughly divided into three main stages:

In the stand-alone era, almost all applications are deployed on a stand-alone basis. When the service pressure increases, only higher specification IBM minicomputers can be switched. As a part of the application system, log is mainly used as program debug, usually combined with grep and other Linux common text commands for analysis;
With the stand-alone system becoming the bottleneck restricting the business development of Alibaba, in order to truly scale out, the Feitian project was launched: in 2013, the Feitian 5K project was officially launched. At this stage, the distributed transformation has been started for each business, and the call between services has also changed from local to distributed. In order to better manage, debug and analyze the distributed application, we have developed a trace (Distributed Link Tracking) system and a variety of monitoring systems. The unified feature of these systems is to centralize the storage of all logs (including metric, etc.);
In order to support faster development and iteration efficiency, in recent years, we have started containerization transformation, and started to embrace kubernetes ecology, full business cloud, serverless and other work. At this stage, no matter from the scale and types of logs, there is an explosive growth, and the demand for digital and intelligent analysis of logs is also increasing, so a unified log platform emerged.

The ultimate interpretation of observability

In CNCF, the main role of observability is the diagnosis of problems, rising to the overall level of the company. Observability includes not only Devops, but also business, operation, Bi, audit, security and other fields. The ultimate goal of observability is to realize the digitalization and intelligence of all aspects of the company.

How many typical problems have you encountered in the construction of 6 k8s log systems?

In Ali, almost all business roles involve a variety of log data. In order to support various application scenarios, we have developed many tools and functions: real-time log analysis, link tracking, monitoring, data processing, flow computing, offline computing, BI system, audit system, etc. The log system mainly focuses on real-time data collection, cleaning, intelligent analysis and monitoring, as well as docking various kinds of flow computing and offline systems.

Difficulties in the construction of kubernetes log system

How many typical problems have you encountered in the construction of 6 k8s log systems?

There are many simple log system solutions, which are relatively mature. We will not go into details here. We only focus on the log system construction on kubernetes this time. The log scheme on kubernetes is quite different from our previous log scheme based on physical machine and virtual machine scenarios, for example:

The form of log becomes more complex. There are not only logs on physical machine / virtual machine, but also standard output of container, files in container, container events, kubernetes events and other information that need to be collected;
The environment becomes more dynamic. In kubernetes, the downtime, offline, online, pod destruction, capacity expansion / reduction of the machine are all normal. In this case, the existence of the log is instantaneous (for example, if the pod log is not visible after the pod is destroyed), so the log data must be collected to the server in real time. At the same time, we need to ensure that the collection of logs can adapt to this highly dynamic scene;
There are many kinds of logs. The figure above is a typical kubernetes architecture. A request from the client needs to go through multiple components such as CDN, ingress, service mesh, pod, etc., involving a variety of infrastructure, among which there are many more types of logs, such as k8s various system component logs, audit logs, servicemesh logs, ingress, etc;
With the change of business architecture, more and more companies begin to implement the microservice architecture on kubernetes. In the microservice system, the development of services is more complex, the dependence between services and the dependence of service underlying products are more and more. At this time, the problem troubleshooting will be more complex. If the logs of various dimensions are associated, it will be a difficult problem;
It is difficult to integrate the log scheme. Generally, we will build a cicd system on kubernetes. This cicd system needs to complete the integration and deployment of business as automatically as possible, and the collection, storage and cleaning of logs also need to be integrated into this system, which is as consistent as possible with the declarative deployment of k8s. However, the existing log system is usually a relatively independent system, which costs a lot to integrate into cicd;
In the early stage of the system, we usually choose to build our own open-source log system. This way has no problem in the test and verification stage or in the early stage of the company’s development. However, when the business grows gradually and the log volume grows to a certain scale, our own open-source system will encounter various problems, such as tenant isolation, query delay, and number According to reliability, system availability, etc. Although the log system is not the most core path in it, once these problems occur at the critical moment, it will be a very terrible impact. For example, emergency problems occur during the time of large-scale promotion. During troubleshooting, multiple engineers query and knock down the log system at the same time, resulting in a longer time of fault recovery and a large-scale impact.
I believe that the students who are engaged in the construction of k8s log system will have deep feelings when they see the above analysis of difficulties. Later, we will introduce in detail how to build k8s log system in Ali from the perspective of landing. Please pay attention.
Author of this paper: head of data collection client of Alibaba cloud log service of Yuanyi
Original link:
This is the original content of yunqi community, which can not be reproduced without permission.