Talk about microservice observation in detail | from monitoring to observability, where do we finally go?


Author Liu Haoyang

Reading guide: in order to make you better understand the design and implementation of APM system in MSP, we decided to write aA series of articles on micro service observation, in-depth product, architecture design and basic technology of APM system. This is the first article in this series, which mainly shares some of our thoughts on observability.


Erda CloudIt is our one-stop developer cloud platform to be released soon. It provides enterprise development teams with cloud native services such as Devops (_devopsplatform, dop_), microservice governance (_microserviceplatform, msp_), multi cloud management (_cloudmanagement platform, cmp_) and fast data management (_fastdataplatform, fdp_).

As the core platform in Erda cloud, MSP provides managed microservice solutions, including API gateway, registry, configuration center, application monitoring and log services, to help users solve the technical complexity problems caused by microservicing of business systems. With the upgrading of the product, we have also made a new designService observation centeredAPM (application performance monitoring) products to explore the best practice of observability in the field of application monitoring.

In order to let you better understand the design and implementation of APM system in MSP, we will write a series of articles on micro service governance to go deep into the products, architecture design and basic technology of APM system. This is the first article in this series, which mainly shares some of our thoughts on observability.

From monitoring to observability

With the popularity of cloud native concept and cloud native architecture design in recent years, more and more development teams begin to use Devops mode for system development, and disassemble large-scale systems into small service modules, so that the system can be better deployed in containers. Based on the cloud native capabilities such as Devops, microservices and containerization, it can help the business team deliver the system quickly, continuously, reliably and on a large scale. At the same time, it also doubles the complexity of the system, which brings unprecedented operation and maintenance challenges, such as:

  • Calls between modules change from function calls within processes to calls between processes, and the network is always unreliable.
  • The call path of the service becomes longer, which makes the flow direction uncontrollable and makes troubleshooting more difficult.
  • With the introduction of cloud native systems such as kubernetes, docker and service mesh, the infrastructure layer becomes more black box for the business development team.

In the traditional monitoring system, we often pay attention to the CPU, memory, network, interface requests of application services, resource utilization and other indicators of the virtual machine. However, in the complex cloud native system, only paying attention to the indicators of a single point or a single dimension is not enough to help us master the overall operation status of the system. In this context, the “observability” of distributed systems came into being. Generally, we believe that the biggest change in observability compared with past monitoring is that the data to be processed by the system has expanded from indicators to a wider field. Taken together, there are about several types of data that are seen as pillars of observability:

  • Metrics
  • Tracing
  • Logging

Relationship between metrics, tracing and logging

In order to unify the data acquisition and standard specifications in the observability system and provide supplier independent interfaces, CNCF combines opentracing and opencensusOpenTelemetryproject Opentelemetry standardizes the data model, acquisition, processing and export methods of observation data through spec, but it does not involve how to use, store, display and alarm the data. The current official recommended scheme is:

  • Prometheus and grafana are used to store and display metrics.
  • Use Jaeger to store and display distributed tracking.

Thanks to the vigorous development of cloud native open source ecology, the technical team can easily build a monitoring system, such as using Prometheus + grafana to build basic monitoring, skywalking or Jaeger to build tracking system, elk or Loki to build logging system. However, for users of observability system, different types of observation data are stored in different back ends. Troubleshooting still needs to jump between multiple systems, and the efficiency and user experience can not be guaranteed. In order to solve the fusion storage and analysis of observable data, our self-developed unified storage and query engine provides seamless correlation analysis of index, tracking and log data. In other parts of this article, we will describe in detail how we provide observability analysis capabilities for services.

Observation portal: observability topology

Observability proposes the relationship between three kinds of data, so that we can use tags to associate metrics and tracing, and use request context to get through tracing and log. Therefore, the following methods can be used to locate the interface exceptions of the online application system:_ Use metrics and alarms to find the problem, then use tracing to locate the module where the exception may occur, and finally use logging to locate the error source.

Although this method is effective most of the time, we do not think it is a best practice to observe the system:

  • Although metrics can help us find problems in time, we often find a large number of single point problems without a global perspective to observe the state of the whole system.
  • The business development team needs to be familiar with the concepts and uses of metrics, tracing and logging systems. If the monitoring system is based on the combination of open source components, it still needs to jump between various systems to complete a problem investigation. Today, this is very common in many companies.

In the practice of monitoring requirements of users in different fields, we found that the topology can be used as the entrance of the observation system naturally. Different from the common distributed tracking platform, we not only show the topology as the runtime architecture of the application system, but also further reveal the service request and service instance status on the topology node after drawing the topology based on the real request relationship of 100% sampling (more observation data will be revealed in the future, such as traffic ratio, physical node status, etc.).

In the layout of the topology page, we divide the page into left and right columns. The status bar on the right will display the key system indicators we need to observe, such as the number of service instances, service error requests, code exceptions and alarm times. When we click a topology node, the status bar will detect different node types and display different status data. At present, the node types that we support to display status include API gateway, service, external service and middleware.

When you click the service node, the status bar will display the status overview, transaction call overview and QPS line chart of the service

How to observe services?

Based on the observable topology, we can easily observe the overall state of the system from a global perspective. At the same time, we also provide an observation method to drill down from the topology to the service in order to quickly locate the service fault. When service exceptions are found, we allow links to_ Service analysis, This page provides observation and analysis in three dimensions: transaction, exception and process.

Take the interface exception mentioned above as an example. Our troubleshooting method is as follows:

  1. Query the interface * / exception that triggers the exception on the transaction analysis page.
  2. Then click the data point on the request or delay trend graph to associate the slow transaction tracking and error transaction tracking sampled by the interface.
  3. In the pop-up tracking list, view the request link details and the log associated with the request to locate the root cause of the error.

Failed transaction request found

Automatically associate the calling link of the request

Automatically associate log context for request link

Where are we going?

Due to the space limitation of this article, this article will not show too many product details. With the help of the above scenario, we propose a direction of observable APM product design: integrate and analyze different data at the back end from the perspective of system and service observation, rather than deliberately emphasizing that the system supports the separate query of three kinds of observable data, In terms of product functions and interaction logic, the separation of metrics, tracing and logging shall be shielded from users as much as possible. In addition, we will continue to explore the infinite possibilities of code level diagnosis, full link analysis and intelligent operation and maintenance in the field of observability.

reference resources

Welcome to open source

As an open-source one-stop cloud native PAAS platform, Erda has platform level capabilities such as Devops, microservice observation governance, multi cloud management and fast data governance.Click the link below to participate in open source, discuss and communicate with many developers to build an open source community.Welcome to pay attention, contribute code and star!