Author Liu Haoyang
Source ｜Erda official account
The full name of APM is application performance management. As early as the mid-1990s, some manufacturers put forward the concept of performance management. Up to now, the field of APM has developed for nearly 25 years.
Generally speaking, APM technology has developed in three stages. Here, we can review the development history of APM through the past two decades of APM application performance management shared by he Xiaoyang, the former founder of oneapm, in 2014.
From 1995 to 2000, the first generation Internet wave rose. At that time, Yahoo, as the representative of Internet companies, led the trend of a generation. Americans were busy laying optical fiber racks and network cables, and sites were established one by one. If the response speed of the website determines the user experience, then the network speed at that time determines the response speed of the website. Therefore, the software function in the APM 1.0 era is so simple: managing the performance of the network system.
In 2000, readers who read the book “top of the tide” should have some impression of that era. At that time, sun was at its peak, with a market value of nearly $200 billion. These companies were frantically building data centers and buying all kinds of hardware and software. Here, we use a professional term to call them infrastructure. Then, the APM system at that time has reached the second generation, which is used to monitor and manage the performance of various basic components.
After 2005, with the rise of application providers such as Facebook and twitter, more and more apps were used to serve global customers; For users, the application services they access may be distributed and deployed in multiple data centers around the world. Especially after 2010, the rise of new mobile access methods makes everyone’s lifestyle more closely dependent on various applications. At this time, the performance of the application itself has increasingly become the bottleneck restricting the improvement of user experience. This is where the third generation APM software can be used: the first is to manage the real user experience, and the second is to conduct end-to-end business transaction performance analysis.
It can be seen that for a long time in the past, APM has been focusing on user experience performance and application performance. With the rise of cloud computing in recent years and the new paradigm advocated by cloud native, it has brought new challenges to the traditional R & D and operation and maintenance mode: the concepts of micro services and Devops have made R & D more efficient, but it has brought about the problem investigation and maintenance of a large number of micro services Fault location becomes more difficult; The gradual maturity of container orchestration technologies such as containerization and kubernetes makes it easy to deliver large-scale software, but the challenge is how to more accurately evaluate capacity and schedule resources to ensure the best balance between cost and stability.
Monitoring to observability
Cindy Sridharan, an engineer of apple, brought the word obersrvability into the vision of developers for the first time in his blog “monitoring and obersrvability”. However, in Google, its famous SRE system has laid the theoretical foundation of observability before that, that is, before the concepts of microservice and observability or appeared, predecessors called this theory monitoring. Among them, Google SRE especially emphasized the importance of white box monitoring, and put the black box monitoring commonly used in the technical circle at that time in a relatively secondary position. White box monitoring is in line with the concept of “initiative” in observability.
Here is a definition of Baron Schwarz: “monitoring tells us which parts of the system don’t work. Observability tells us why it doesn’t work. “
Thus, observability is a set of concepts that provide stability and performance monitoring, diagnosis and analysis in cloud native systems. Compared with monitoring, observability is expanded from a single measurement to three pillars: metrics, tracing and logging:
- Logging shows events generated during application operation or some logs generated during program execution. It can explain the operation status of the system in detail, but storage and query need to consume a lot of resources. Therefore, filters are often used to reduce the amount of data.
- Metrics is an aggregate value with small storage space. It can observe the state and trend of the system, but it lacks detailed display for problem location. At this time, multidimensional data structures such as contour indicators are used to enhance the expressiveness of details. For example, statistics on the accuracy, success rate, traffic, etc. of a service’s TBS are common for a single indicator or a database.
- Tracing is request oriented and can easily analyze the outliers in the request, but the same problem as logging is that it consumes a lot of resources. It is also necessary to reduce the amount of data by sampling. For example, the scope of a request, that is, any call from the browser or mobile phone, a procedural thing, we need to track.
Erda microservice observation platform
As mentioned above, observability provides a set of concepts to monitor and diagnose cloud native application systems. Therefore, CNCF launched the opentelemetry project, hoping to unify the standard specifications and unified acquisition implementation of three kinds of observable data. But in the real world, we are more concerned about how the collected data is stored and used. Therefore, the application monitoring subsystem in Erda MSP (microservice platform) is also gradually evolving into “observability analysis and diagnosis”_ As the core microservice observation platform.
- Observation: observe the operation status and monitoring indicators of the service itself.
- Analysis: correlation, statistics and processing of observed data.
- Diagnosis: describe the direct cause of system abnormality based on the analysis results of observed data.
Erda MSP currently covers hundreds of indicators and status collection from infrastructure, business systems and end-to-end applications:
Built in observation view
We also provide the default observation view in Erda according to the common scenarios and indicators of monitoring operation and maintenance:
Observation of cloud cluster status and resource utilization
Cluster node index observation
Service request and delay observation
Slow / error transaction analysis
For slow requests and error requests of the business system, we integrate the correlation of log, trace and metric, so that users can easily locate the exception context information of the request:
Error request retrieval
Error request and slow request top
Analysis of slow request and wrong request
Exception RIH is associated with trace and log
Erda MSP supports customizing the user’s own analysis scenarios using the custom dashboard. For details on using the dashboard, refer to:Only after I got started did I know that this dashboard system is really cool to use。
For the processing of log data, Erda supports full-text retrieval and structured label retrieval, and realizes the analysis ability of one key Association log and call link.
Log correlation link tracking analysis
Write at the end
As an open-source one-stop cloud native PAAS platform, Erda has platform level capabilities such as Devops, microservice observation governance, multi cloud management and fast data governance。 Click the link below to participate in open source, discuss and communicate with many developers, and build an open source community. Welcome to pay attention, contribute code and star!
- Erda GitHub address:https://github.com/erda-project/erda
- Erda cloud official website:https://www.erda.cloud/