Observability practice in service mesh


Observability practice in service mesh

Service mesh virtual meetup is a live online series jointly sponsored by servicemesher community and CNCF. In this issue, we invited four guests from different companies to share the application practice of service mesh from different perspectives, including how to use skywalking to observe service mesh, the production practice of service mesh from Momo and Baidu, the observability and production practice of service mesh, and the differences between service mesh and traditional microservice monitoring.

Based on the theme of Ye Zhiyuan, a G7 microservice architect, on the evening of May 14, he shared the practice of high availability of service mesh in enterprise production. At the end of the article, the video review link and PPT download address are included.


When it comes to service mesh, people always think of microservices and service governance, from Dubbo to spring cloud (which started to enter the vision of domestic R & D in 2016 and prospered in 2017) to service mesh (which began to be familiar to everyone in 2018). Just as the so-called Yangtze River wave pushes forward, service mesh has no choice but to service mesh Mesh is full of envy. The emergence and prosperity of microservice architecture is a great breakthrough in the form of architecture in the Internet era. Service mesh has a certain learning cost. In fact, there are few landing cases in China, most of which are cloud business and head enterprises. With the improvement of performance and ecology and the implementation of containerization scenarios promoted by major communities, service mesh has also begun to take root in large and small companies to make up for the shortage of container layer and kubernetes in service governance. This time, we will take a look at the observability mainstream practice solutions in service mesh from the perspective of a selection researcher.

Philosophy of observability

But observability has been a new field for a long time. Observability is defined in Wikipedia as follows: “in control theoryobservability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. ”。 The word “cloud native” first appeared in 2017 when the concept of cloud Nativity was in the ascendant. Under the trend of cloud nativity, the traditional description method is not enough to summarize the monitoring demands of this era, and “observability” is much more appropriate.

Observability practice in service mesh

Recall the traditional monitoring methods. Apart from the host monitoring, JVM monitoring and message queue monitoring at the operation and maintenance level, how many monitoring methods are planned in advance? very seldom! In fact, most of the time, what we do is to customize and add some monitoring in addition to the bug recurrence and repair in the process of fault recovery after the fault occurs, so as to expect a real-time alarm when the same situation occurs next time. After receiving the alarm, the R & D personnel can deal with the problem quickly and reduce the loss as much as possible. Therefore, most of the traditional monitoring mode is to make up for the lost sheep, lack of initiative.

It is different in the container system in the cloud native era. The life cycle of container and service is closely linked, coupled with the perfect isolation characteristics of container and kubernetes Compared with the traditional physical host or virtual machine, it is very inconvenient to check the problems. Therefore, in the cloud era, the emphasis is on observability. Such monitoring will always be carried out before the army and horses are moved. We need to think in advance how we want to observe the services in the container, the topology information between services, and the collection of various indicators. These monitoring capabilities are very important.

Observability practice in service mesh

There is no clear time when observability became popular in the cloud native world. The industry believes that observability was first proposed by Cindy Sridharan. In fact, Peter bourgon, an engineer from Berlin, Germany, had an article on observability as early as February 2017. Peter is the first developer to discuss observability in the industry. His famous blog “metrics, tracking, and logging” written by him has been translated into many languages. The real observability becomes a standard, which comes from the cloud native standard defined by Matt sting of pivot company. Observability ranks among them, so observability becomes a standard theme in the cloud native era.

Observability practice in service mesh

The three pillars of observability proposed by Peter bourgon revolve around metrics, tracing and logging. These three dimensions almost cover all kinds of representation behaviors of applications. Developers can do a variety of things by collecting and viewing the data of these three dimensions, and master the operation of the application at any time. The understanding of the three pillars is as follows:

  • Metrics: metrics is a kind of aggregate data form. QPS, tp99, tp95 and so on, which are often contacted in daily life, belong to the category of metrics. It is closely related to statistics, and it often needs to use the principle of statistics to do some design;
  • Tracing: the concept of tracing is almost compensated by the complexity brought about by the SOA era, and the long call chain brought by servitization is difficult to locate problems only by relying on logs. Therefore, its manifestation is more complex than metrics. Fortunately, multiple protocols have emerged in the industry to support the unified implementation of tracking dimension;
  • Logging:Logging It is triggered by a request or an event. In an application, it is used to record the status snapshot information. In short, it is a log. However, this log is not only printed out, but also a challenge to its unified collection, storage and analysis. For example, structured and unstructured log processing often requires a high performance The resolver and buffer of;

In addition, Peter bourgon also mentioned some ideal output forms of the combined states of the three pillars, as well as their dependence on storage. Due to the different degree of aggregation, metrics, tracing and logging rely on storage from low to high. For more details, interested students can check the original link at the end of the article.

Observability practice in service mesh

Peter bourgon’s thinking on the three pillars of observability goes beyond that. He also discussed the deep-seated significance of metrics, tracking and logging in industrial production in 2018 with gophercon EU. This time, he explored four dimensions.

  • Capex: indicates the initial collection cost of indicators. Obviously, the cost of log is the lowest, and the embedding point is enough; the second is metrics, and the most difficult is tracing data. With the support of the protocol, many data points still need to be defined to complete the metadata definition collection necessary for link tracking;
  • OPEX: refers to operation and maintenance cost, generally refers to storage cost, which has been discussed before;
  • Reaction: it indicates the response sensitivity of abnormal conditions. Obviously, the aggregated data can show fluctuation, so metrics is the most sensitive to abnormal conditions; logging is the second, and abnormal quantities can be found from logging cleaning; while tracing seems to have no effect on response sensitivity, and it is mostly used in obstacle removal and positioning scenarios;
  • Investigation: standard fault location capability. This dimension is the strong point of tracing, which can visually see the fault in the link and locate it accurately; logging takes the second place; metrics dimension can only feedback fluctuation, which is not helpful to fault location;

Observability practice in service mesh

In CNCF landscape, there is an area dedicated to show observable solutions in cloud native scenarios, which is divided into several dimensions. The figure shows the latest version as of May 14, 2020, and more excellent solutions will emerge in the future. Among the 10 project databases currently graduated by CNCF, three are related to observability, which shows that CNCF attaches great importance to observability.

Observability practice in service mesh

Speaking of this, many students may be interested in observability related protocols. At present, there are several popular ones, such as opentracing, opencensus, opentelemetry, openmetrics, etc. currently, the first three are more popular. The openmetrics project is no longer maintained.

Opentracing can be said to be the most widely used distributed link tracking protocol. Skywalking, the famous one, is based on it. It defines a link tracking protocol API that is independent of the manufacturer and language. It makes it easier to build a cross platform link tracking protocol. At present, it is thriving in the CNCF incubator.

Opencensus is a protocol for tracking and metrics scenarios proposed by Google. Backed by dapper’s blessing and historical background, opencensus is also supported by Microsoft. It is very popular in the commercial field at present.

Other protocols, such as W3C trace context, are also very popular. It even compresses the data in the header, regardless of the implementation layer. Perhaps CNCF realizes that various protocols are emerging in an endless stream, and each middleware has to do a lot of compatibility, which is not conducive to the entire technology ecology itself. Therefore, opentelemetry was born. From the literal meaning, CNCF will carry out observable “telemetry” to the end. It integrates the protocol content of opentracing and opencensus, aiming to improve the unified collection and processing of observability indicators in the cloud native era. At present, opentelemetry has entered the beta version, and the most gratifying one is the Java version The SDK already has a non-invasive probe similar to skywalking based on the byte buddy framework. At present, telemetry data can be automatically detected and obtained from 47 Java libraries. Besides, API and SDK of Erlang, go, Java, JavaScript and python are available. In addition, the data collector opentelemetry collector can also be used to receive data from opentelemetry client for unified collection and processing. At present, CNCF has suspended the development of logging related protocols, but a working group is also working on this aspect of standardization.

Observability practice in service mesh

Service mesh and observability

The relationship between service mesh and observability is that observability is a functional subset of service mesh. Service mesh is one of the most popular technology concepts today. It is committed to providing unified service discovery, edge routing, security, traffic control, observability and other capabilities for large-scale services running in the container in the cloud native era. It is a supplement to kubernetes service governance ability. It can be said that service mesh is the inevitable product of the cloud native container era, which will have a profound impact on the cloud service architecture. The architecture concept of service mesh is to treat the container service running unit as a grid, hijack the traffic in each group of running units, and then a unified control panel is used to handle the traffic uniformly. All grids and the control panel maintain a certain connection. In this way, the control panel can be used as a bridge between the observable solution and the container environment.

Observability practice in service mesh

The most common service mesh technologies on the market include linkerd, istio, conduct, etc., but to be implemented in the production environment, it must withstand the evaluation of strict performance, reasonable architecture and community activity.

Linkerd is developed and maintained by buoyant, which is regarded as the first generation product in the field of service mesh. Linkerd1. X is written based on Scala and can be run on the host computer. As we all know, Scala running environment depends on JDK, so it consumes relatively large resources. After that, the government carried out rectification and launched a new generation of data plane component conduct, which was written based on rust and go and combined with linkerd’s double swords to become linkerd2. X. Generally speaking, linkerd2. X performance has been greatly improved, and there is also a visual interface for operation. However, in China, it is not angry and the community has never developed.

Looking back, istio, which appeared in 2017, was born with a golden spoon. It was initiated by Google, IBM and LYFT. Although linkerd was a year late, it was widely concerned and sought after once it was launched. Istio is based on golang, perfectly fits kubernetes environment, integrates Evoy in data plane, and has clear responsibilities in service governance. Domestic landing cases are more extensive than linkerd.

Observability practice in service mesh

Istio is still a young open-source middleware at present. There are big differences in component architecture between major versions. For example, 1.0 introduces galley (as shown on the left), 1.5 removes mixer, integrates control plane into monomer, and adds wasm extension mechanism (as shown on the right). The overall architecture has not changed much. The data plane still focuses on the implementation of traffic hijacking and forwarding strategy, while the control plane still does telemetry collection, policy issuance and security. At present, cloud business and head companies are in the leading position in the use of istio in the domestic industry. For example, ant financial has developed its own data plane mosn based on golang, which is compatible with istio, and has done a lot of optimization work. It has set an example for the implementation of istio in China. More information can be learned in depth to see how to create a service mesh architecture more suitable for the domestic Internet.

Observability practice in service mesh

Although in version 1.5, the mixer has been basically abandoned and entered the maintenance phase until version 1.7. By default, the mixer is completely closed. However, most of the landing schemes are still based on the version range of 1.0-1.4. Therefore, without the overall upgrade and the unclear performance of wasm, it seems that mixer is still indispensable. As mentioned above, service mesh is the bridge between cloud native container environment and observability. Mixer’s adapter can be regarded as the main steel frame of this bridge, and it has good scalability. In addition to checking traffic, the more important part of the mixer adapter is to collect telemetry data in the pre inspection stage and report stage. Telemetry data is exposed or transmitted to various observation terminals through the adapter. The observation end draws rich traffic trajectories and event snapshots based on the data. The commonly used observability adapters can be adapted to various commercial solutions, such as datadog and new relic, and the open source solutions Apache skywalking, Zipkin, fluent, Prometheus, etc. the related contents will be expanded below.

Observability practice in service mesh

Data plane such as Evoy will report log, trace, metric and other data to the mixer. The original data reported by envoy is attribute information. Attribute information is metadata of name and type, which is used to describe the entry and exit traffic and the environment information when the traffic is generated. Then, the mixer will report log entry, metric or The format of tracespan template configuration formats the attributes, and finally gives it to the mixer adapter for further processing. Of course, for log information and trace information with large amount of data, you can choose to report directly to the processing side. Envoy also supports some specific components natively. Different adapters need different attributes. The template defines the schema mapping attributes to adapter input data. One adapter can support multiple templates. Three configuration models can be abstracted from the mixer

  • Handler: represents a configured adapter instance;
  • Instance: define the mapping rules of attributes information;
  • Rule: assign instance and trigger rule to handler;

The following figure shows metric template and logentry template. Default values can be set on the mapping relationship. For more settings, you can view official documents.

Observability practice in service mesh

The following figure shows the tracespan template. Students who are familiar with opentracing may be familiar with the mapping content. A lot of information is the standard value of opentracing protocol, such as various description information of span and http.method , http.status_ Code and so on. Interested students can also go to the standard definition of opentracing.
In addition, in the service There is a common problem with link tracking in mesh. No matter how you hijack traffic in the data plane, how to transparently transmit information, and how to generate or inherit span, there is a problem that the entrance traffic and the exit traffic can not be connected in series. To solve this problem, the main container should be served to bury the point transparent transmission, and the link information should be transparently transmitted to the next request The problem is unavoidable, and the subsequent implementation of opentelemetry can solve the problem of standardization in this respect.

Observability practice in service mesh

Istio observability practice

In the istio mixer adapter, we can learn that istio supports link tracking of Apache skywalking, Zipkin, and Jaeger. These three middleware all support opentracing protocol, so there is no problem using tracespan template to access at the same time. The three are slightly different

  • Zipkin is an old-fashioned link tracking middleware. The project was launched in 2012, and the new version is easy to use;
  • Jaeger is an emerging project launched in 2016. It is written in go. However, due to the blessing of cloud nativity, it is committed to solving the link tracking problem in the cloud native era, so it has developed rapidly. It is very easy to integrate in istio, and it is also the solution recommended by istio officially
  • Skywalking is an open source project that started in 2015 and is now booming. However, it is slightly different that it is currently combined with istio through out of process adaptation, and the access loss is slightly higher. There are corresponding solutions in the latest version 8.0 (not yet released);

Observability practice in service mesh

When it comes to skywalking, it is an APM middleware independently developed and open-source by Wu Sheng. It can be said that it is the pride of Chinese people. In the second session of this sharing, Mr. Gao Hongtao, one of the core contributors of skywalking, also made a “Apache skywalking in the
The observable application in service mesh can be shared by interested students.

Skywalking provides non intrusive plug-in SDK for Java,. Net, nodejs, PHP and python, as well as intrusive SDK of golang and Lua.

Why can’t golang be made non intrusive? This also needs to start from the language characteristics. Generally, programming languages are divided into compiler language, interpretive language and intermediate language. For example, Java is compiled into bytecode at compile time, and then run through JVM to run bytecode. In this way, a lot of things can be done in this process, and the original code can be changed during compilation. For example, python, PHP, JS and Erlang are translated line by line when they are in use, so you can add some extra code when using them. Golang, C and C + + are compiled languages. When compiling and linking, the source code has been converted into machine code, so it is difficult to change it during the running time. This is why golang cannot be used as an automatic probe. In addition, skywalking is initiated by Chinese people, so the user base is very large and the iteration is very fast. Before version 7.0, it supports telemetry and display based on mixer. After 8.0, it adds data collection from Prometheus or spring sleuth. After 8.0, it supports enviy als (access log service), but the ALS receiver needs to be turned on.

In the use of skywalking, ES is basically used for storage, but there are some changes. The information such as service, endpoint and instance is put into the relational database. Each plug-in SDK is also added to the basic image. It is also easy to realize the call count of service interface granularity based on skywalking.

Observability practice in service mesh

Another middleware widely used in the field of cloud native link tracking is Jaeger, which is open-source by Uber and accepted by CNCF. Currently, it is a graduation project. It supports opentracing protocol and has interoperability with other middleware in the market. It supports multiple back-end storage and has flexible scalability. It supports Jaeger in Evoy. When the request arrives at envoy, envoy will choose to create or inherit span to ensure the link continuity. It supports the transparent transmission of B3 series header of Zipkin and the header of Jaeger and lightstep. The following figure shows the link in Jaeger, which can accurately locate a request through traceid.

Observability practice in service mesh

Elk, a traditional log solution, is a household name. Since spring cloud is popular, it is a good choice for log solution. With the development of technology, efk has emerged in recent years. The storage components elasticsearch and kibana interface have not changed much. However, in the harsh online environment and container environment, as well as in various resource sensitive scenarios, the requirements for the log collection component are higher and higher. At present, the more popular solution is to use fluent Or filebeat replaces logstash. Here are some introductions to the three:

  • Logstash: written in Java, it consumes a lot of resources. Now it is not recommended to be used for log collection;
  • Fluent: the main body is written by C, and the plug-in is written by ruby. Graduated from CNCF in April 2019, the resource consumption is very small, usually occupying about 30MB of memory. The log can be transmitted to multiple buffers, that is, multiple receivers. Currently, it is a common component in the container;
  • Filebeat: go, but the problem of raising the bottom layer resource “load average” and the resource consumption is large, which is about 10 times of that of fluent. Before the appearance of fluent, it was widely used in virtual machines;

For the log solution in istio, although the “fluent adapter” is provided in the mixer, we all know that this method is not good. Therefore, it is relatively friendly for the application to get the original attribute log from Evoy and then process and send it to the storage side, which can save a lot of resources.

In the log dimension, if you want to locate the problem, you’d better bind it to the request. Binding the request and log requires a specific identifier, which can be transactionid or traceid. Therefore, the integration of link tracking and log is an imperative industry demand. Therefore, when selecting link tracking middleware, we must consider how to obtain traceid better And combined with the log.

Observability practice in service mesh

Is fluent the best solution for log collection and emission?

no The R & D team of fluent has launched a more lightweight fluent bit, which is written in pure C and takes less resources. It is directly reduced from MB level of fluent to KB level, which is more suitable for log collector. There are many kinds of plug-ins in fluent, there are nearly thousands of plug-ins, so it is more suitable to be used as the aggregate processor of logs in the processing and transmission after log collection. In practical application, some problems may be encountered when using fluent bit. If the earlier version is used, there may be configuration dynamic loading problem. The solution is to start another process to control the start and stop of fluent bit, and monitor the change of configuration. If there is a change, reload.

Observability practice in service mesh

As for the Loki in the above figure, its core idea is as described in the project, “like Prometheus, but for logs”, which is similar to Prometheus’s aggregation log solution. It was open-source in December 2018. In just two years, it has nearly 10000 stars! It was developed by the grafana team, from which we can see the purpose of grafana for the original observability of the unified cloud.

In the cloud native era, it seems inappropriate to directly store a large number of original logs into expensive storage media with expensive full-text index, such as ES or column storage, such as HBase. Because 99% of the original logs will not be queried, the logs also need to be merged. After merging, the logs are compressed into gzip and labeled with various labels. This may be more in line with the principle of fine-grained operation in the cloud native era.

Loki can store a large number of logs in cheap object storage, and it marks the logs and merges them into daily log streams, which enables us to quickly retrieve the corresponding log entries. However, it should be noted that it is not wise to use Loki instead of efk. They aim at different scenarios and have different data integrity assurance and retrieval capabilities.

Observability practice in service mesh

Since the emergence of Prometheus, it has firmly occupied the main position of monitoring indicators. Prometheus should be the most widely used open-source system monitoring and alarm platform at present. With the development of container technology with kubernetes as the core, Prometheus has powerful multi-dimensional data model, efficient data collection ability, flexible query syntax, extensible and easy integration features, especially its combination with cloud native ecology, which makes it more and more widely used General application.

Prometheus was officially launched in 2015, joined CNCF in 2016, and became the second project to graduate from CNCF in 2018 (the first is kubernetes, whose influence can be seen). At present, envoy supports TCP and UDP statsd protocols. Firstly, envoy can push indicators to statsd. Then, Prometheus can be used to pull indicators from statsd for visualization by grafana. In addition, we can also provide a mixer adapter to receive and process telemetry data for Prometheus acquisition.

In the actual use of Prometheus, there may be some problems. For example, if the pod is killed, another one needs to be started, resulting in the loss of Prometheus data. Therefore, a highly available solution of Prometheus data persistence is needed. There is a project named thanos in the sandbox project of CNCF. Its core idea is similar to that of database sharding. There are two architecture modes: sidecar and receiver. For the official sidecar scheme currently used in the architecture diagram, the receiver is a component that has not yet been fully released. The sidecar scheme is relatively mature, more efficient and easier to expand.

Observability practice in service mesh

Linkerd and conduct in service mesh solutions have visual interfaces. Istio is relatively black box and has been criticized. However, the istio community and kiali jointly launched a visualization scheme, which provides the following functions:

  • Topology: Service topology;
  • Health: visual health examination;
  • Metrics: index visualization;
  • Tracing: visualization of distributed link tracing;
  • Validations: configuration verification;
  • Wizard: route configuration;
  • Configuration: visualization and editing of CRD resources;

Observability practice in service mesh

The following is the architecture of kiali. It can be seen clearly that it is a front-end and back-end separation architecture. It can obtain indicator data from Prometheus or cluster specific API. In addition, it also includes Jaeger link tracking interface and grafana display interface. However, they are not out of the box. The three-party components kiali relies on need to be deployed separately.

Observability practice in service mesh


In many small and medium-sized companies, in fact, service mesh is still in a pre research stage. There are many factors to be considered when it is actually implemented. How to obtain a better input-output efficiency ratio is a must for every personnel who makes a selection. In fact, regardless of the landing situation, in view of the cloud’s original observability philosophy, doing observability well while landing can solve many problems simultaneously and avoid spending too much resources on meaningless things. In terms of the three pillars of observability and the support for observability in service mesh, the summary is as follows:

  • Metrics: reasonable use of Prometheus and good persistence and high availability are the key;
  • Tracing: the key to choosing the appropriate link tracking middleware is to consider the integration of fit, logging, storage and display;
  • Logging: it should be clear which scenario uses the original log and what scenario uses the summary log;

Introduction of guests

Ye Zhiyuan, G7 microservice architect, co-founder of spring cloud Chinese community, member of servicemesher community, author of redefining spring cloud, early practitioner of domestic micro service field and cloud native follower.

Review video and PPT download address

Video review: https://www.bilibili.com/video/BV13K4y1t7Co
PPT download: https://github.com/servicemesher/meetup-slides/tree/master/2020/05/virtual

reference material

  • Metrics, tracing, and logging – Peter Bourgon
  • Go for Industrial Programming – Peter Bourgon
  • CNCF Landscape
  • Exploring Istio telemetry and observability – Marton Sereg
  • Istio Service Mesh Observability with Kiali – Gokul Chandra
  • MOSN:https://github.com/mosn/mosn

Recommended Today

How to share queues with hypertools 2.5

Share queue with swote To realize asynchronous IO between processes, the general idea is to use redis queue. Based on the development of swote, the queue can also be realized through high-performance shared memory table. Copy the code from the HTTP tutorial on swoole’s official website, and configure four worker processes to simulate multiple producers […]