Observability design for SaaS applications


Observability design for SaaS applications

(Zhou Jianghua)

Senior business architect of Netease smart enterprise

Observability design for SaaS applications


For the distributed system, because it is far more complex than the traditional software, the difficulty of system operation and maintenance increases greatly, and will increase exponentially with the increase of distributed nodes.

When the system fails, it is a great challenge for developers to find the problem in hundreds of application nodes. Especially when the indicators of many nodes are abnormal, it is often difficult to distinguish what is the cause and what is the result.

On the other hand, for SaaS applications, their enterprise oriented attributes determine the extremely high requirements for service reliability. First, when a fault occurs, we need to quickly locate the problem, eliminate the fault and restore the service. More importantly, it is able to clearly understand the operation status of the system at all times, find out in time and eliminate the faults before the system has a deterioration trend.

To do this, the most basic and important thing is that the system has high observability.

Observability design for SaaS applications

What is observability

When we drive a car on the road, the code meter, tachometer, fuel gauge and other instruments on the instrument panel indicate the current basic operation state of the car.

When the yellow indicator light is on, it indicates that a part corresponding to the vehicle has hidden dangers and needs to be checked, but it will not affect the basic operation and safety functions of the vehicle. When the red indicator light is on, it is a serious warning. It is best to send the vehicle for inspection and maintenance immediately, otherwise the core components of the vehicle may be damaged, or there are serious potential safety hazards, which is easy to cause accidents.

Observability design for SaaS applications

The automobile instrument panel and its connected sensors and signal transmission system are the most basic and intuitive examples of the observability of the automobile system. If you need a deeper understanding of the status of various parts of the vehicle, you can also connect the OBD interface of the vehicle to obtain more current status and historical status data of the vehicle. Through OBD interface, the observability of vehicle system is also greatly enhanced. Based on OBD data, many mobile phone management software have been derived, which can more simply observe the vehicle state, record more historical data in the past, and analyze the driving habit trend, which greatly expands the observability of the vehicle.

In the field of it, observability is more important than others. The definition of cloud native by CNCF (cloud native Computing Foundation) includes the important feature of observable. From the example of automobile, the definition of observability can be simply summarized, that isIt can collect, analyze and process the internal state of the system, and reasonably aggregate, summarize and display the index data, so that people can understand everything that happens in the system in the shortest time.

With observable system, operation and maintenance personnel and R & D personnel can intuitively observe whether the overall operation state of the system is healthy, and can easily go deep into all details of operation. During normal operation, the observation system can evaluate the system load and provide suggestions for operation and maintenance. In case of failure, it can help quickly locate and repair problems.In the general trend of operation and maintenance automation and intelligence, the observability of the system is the most basic link.

Build a perfect observation system

The observability of the system is mainly provided by a perfect monitoring system. There are many open source distributed monitoring systems on the market, such as Prometheus, ZABBIX, Nagios, cat, etc. Prometheus has become a de facto interface standard. Each large company will have its own monitoring system out of flexibility, customization and strong operation and maintenance development ability. The components of any monitoring system are basically the same. The following is the architecture diagram of Prometheus:

Observability design for SaaS applications

(the picture is quoted from Prometheus official website)

It can be seen from the figure that a system is observable and must have the following components:

1. Service state awareness component

At each node of the system, collect service status information and provide original data in a comprehensive dimension. Due to the huge amount of data collected, the component is directly deployed on the operation node of the system. Various methods need to be used to avoid affecting the normal operation of the system. The sensing component collects status data in a variety of ways, and thenThe standard interface outputs structured data. Common acquisition methods include:

  • Independent monitoring tools, such as SAR, top, dstat, etc. for monitoring system operation status.
  • Bytecode injection
  • Structured log
  • Behavior event burial point

2. Status data collection and storage

This component is the core of the whole observation system. It reports the collected data and then stores it efficiently。 For different data and different analysis methods, appropriate storage data format and storage medium shall be adopted. Time series databases are most commonly used to store monitoring data, such as Prometheus, influxdb, etc. For data collectors, various measurement types are provided for structured reporting data, including

  • Gauges: gauge, used for simple counting scenarios such as memory and thread count.
  • Counters: counter, used for statistics scenarios such as the number of requests and errors.
  • Histograms: histogram, used for scenes where the mean, variance and quantile need to be calculated, such as average response time and RT 95 value.
  • Meters: TPS counter, used for rate statistics and statistics of 1-minute and 5-minute mean values.
  • Timers: timer, used to count request delay, such as request delay, disk read delay, etc.

3. Visual display

The visual interface is the most important determinant of whether the observation system can produce value. The visual system must support flexible configuration, flexible combination, easy to use and intuitive information display. Grafana is the most widely used open source visual monitoring tool.

4. Alarm

Alarm function is one of the core values of the whole monitoring system. When the system has been abnormal or may be abnormal,The alarm system can timely notify relevant parties through e-mail, Im, SMS, telephone and other channels, so that relevant operation and maintenance and R & D can intervene.

Observability design for SaaS applications

(example of IM channel alarm)

There are two types of alarm configurations, status event alarm and trend alarm.

Status events refer to abnormal events that have occurred, such as:

  • CPU utilization exceeds 95%
  • Disk space usage exceeds 90%
  • JVM continuous fullgc
  • Interface calls fail more than n times for a period of time, with a rate of more than x%
  • The number of Tomcat threads exceeds 120
  • Print the error level log for the business

Trend alarm is to analyze the change trend of indicators and then alarm abnormal changes, such as:

  • The number of messages is 30% lower than a week ago
  • The number of requests for a URL interface has increased by 30% compared with a minute ago
  • Memory usage increased by more than 5% for ten consecutive minutes

Observability design for SaaS applications

Comprehensive observation dimension

Resource monitoring

Resources mainly refer to system computing resources and network bandwidth resources. Common resources include CPU, memory, disk ID, network card traffic, etc.Such indicators are usually counted in numerical value, percentage and other ways to intuitively measure the system load. As the most basic monitoring project, various types of open source monitoring systems and cloud computing platforms are provided.

Observability design for SaaS applications

(example of resource monitoring)

For various types of resources, the following are some common concerns that should be clearly presented in our monitoring system.

  1. CPU: for computing applications, CPU is the core resource, and the load level directly indicates the current system load. For non computing applications, the CPU is usually in a low load state. If the CPU load suddenly rises at a certain point in time, it usually indicates that there is a bug in the application, such as an dead cycle, or the virtual machine fails to come under the load due to frequent fullgc, or the high network traffic increases the burden of CPU IO processing and context switching.
  2. Process survival: check whether there is a process with the specified process name.
  3. Memory: memory usage, remaining memory.
  4. Hard disk: including hard disk space, inode quantity, disk IO, etc.
  5. Network card: in and out network traffic, in and out network PPS, packet loss rate, etc

Performance monitoring

Resource monitoring focuses on the operating system level,Performance monitoring focuses on the application level, that is, application performance management (APM).

For application process level monitoring, monitoring indicator data is usually obtained by implicit injection in the virtual machine layer or bytecode execution layer, such as JVM, PHP Zend engine, etc. Take Java applications as an example. In this way, we can obtain the following monitoring indicators:

  • JVM memory status
  • JVM GC status
  • Java method call statistics
  • Tomcat thread status
  • Custom thread pool working status

At the interface level, traditional applications can still be obtained by bytecode injection. Cloud native applications can be monitored by sidecar. Through monitoring, we can obtain the QPS of the interface, the amount of concurrent calls, the distribution of response time, the number of errors and other indicators.

Observability design for SaaS applications

(HTTP interface call monitoring example)

The interface types that can be monitored in this way are:

  • HTTP interface call / called statistics
  • RPC interface call / called statistics
  • SQL execution statistics
  • Redis access statistics

The interface performance of a single node can only reflect the situation of one node, which is not so significant. However, when the interface calls of all services of the whole platform are concatenated, the overall operation of the system can be intuitively displayed, which is the hot call link tracking system at present. The current mainstream call link tracking system is basically based on Google’s dapper. The popular open source components are Zipkin, pinpoint, skywalking, etc. By invoking link tracing, we can easily:

  • Performance analysis of each service node
  • Fast fault location
  • Request call link analysis
  • Service dependency analysis and Governance

Observability design for SaaS applications

(seven fish full link monitoring market)

Business monitoring

Whenever,Business performance is the most direct concern, and the most direct monitoring is business monitoring.Common business monitoring is all kinds of business market, such as the large screen of transaction data of Taobao double 11. This kind of large screen data is usually used for external display or for leaders. The appearance will be very cool, and the displayed data types will also be screened.

And for the R & D team,The most important thing of business monitoring is to indicate the health of the businessTherefore, the displayed data types will be all different, generally more detailed, and the dimensions will be more comprehensive.

First, monitor the overall business status. The overall monitoring mainly focuses on the health of the core business to ensure that any abnormality in the core business can be immediately reminded.

Observability design for SaaS applications

(Part VII business monitoring panel)

For example, for Qiyu, its core business process is the communication between visitors and customer service and the improvement of customer service efficiency. Therefore, the overall business monitoring indicators will include:

  • Concurrent sessions
  • Concurrent traffic
  • AI answer quantity
  • AI resolution rate
  • Number of online seats
  • Messaging rate
  • Work order creation rate

However, the overall business trend is not obviously abnormal, but under a certain subdivision dimension, the business may be completely unavailable.Therefore, under the overall business, monitoring should continue to drill down and monitor from different dimensions。 Common subdivision dimensions are:

  • Region: the region mainly focuses on the network, especially for the network sensitive business such as video. The biggest difference in network regions is the quality of CDN coverage, followed by the restrictions on network access by regional operators, and then the frequently reported incident of optical fiber being cut off in a certain place.
  • Users: different functions may be provided for different types of users. The common ways to distinguish users are VIP, tag, etc.

Tenant status tracking

For SaaS business, providing services is based on tenant granularity. In order to provide personalized services, tenants have great functions that can be customized freely. The same function works normally in one tenant, but may be completely unavailable in another tenant. So,Business entity functions need to be monitored by tenant dimension. Customer status tracking is divided into two parts: one is SaaS platform function, and the other is customer interface monitoring.

The function of SaaS platform is easy to understand. It refers to the functional services provided by SaaS platform to customers. It is of great significance to continuously track the function usage of customers, especially large customers and new customers.

First of all, enterprise customers have high requirements for the stability of the service. Once any abnormality occurs, even if it is very small, it is likely to affect the customer relationship, even lead to complaints, and affect the subsequent renewal and purchase.

Secondly, the change of function usage can reflect the customer’s dependence on the platform. When the customer’s use of a function suddenly decreases or continues to decrease, we should pay attention to whether the customer’s business has contracted or whether part of the customer’s business has been migrated to competitive products. At this time, we should timely understand the reasons and maintain the hospitality relationship. If a function is suddenly heavily used by customers, you can start preparing for the upcoming additional purchase of customers.

Observability design for SaaS applications

(key enterprise message interface monitoring panel)

The other part is the customer’s own interface. Many of the functions provided by the SaaS platform involve the customer’s own business data and need to be connected with the customer’s own system. Therefore, SaaS usually provides many interface standards to be implemented by the customer, and then the SaaS platform calls them in the business process. These external interfaces are not controlled by the platform, provide uneven service quality, and have a high probability of exceptions. Although these interface failures do not affect the overall service of the SaaS platform, on the one hand, they may be disastrous to specific customers. On the other hand, the customer’s first reaction after a failure is usually the failure of the SaaS platform, which requires you to find out as soon as possible, which will waste a lot of time of the team. The customer may not have such perfect monitoring measures, so,We must monitor these interface calls according to the tenant dimension and notify the customer in time when there are exceptions, which is not only responsible for the customer, but also reduce our own work pressure.

Business log

Log is the main means of post analysis. Perfect log information can produce the following important values:

  • When the service result does not meet the expectation, you can call the service link information for complete recovery analysis to find the cause of the problem.
  • Monitor and alarm error level logs and logs with specific keywords.
  • Through structured logs, statistics and analysis of transfer volume, execution results and other information to assist in operation data statistics.

To these values, there are some best practices to follow when using a logging system.

First of all, good log content should be just enough information recorded.The problem of too little recorded information is obvious, but too much will also have problems, which will dilute the really useful information, increase the burden of the whole log system, and even affect the performance of the core business.

Secondly, the log information should be structured. A complete log information should include complete input parameters, output parameters, caller and callee information, critical path execution results, time-consuming, etc. nginx logs are a typical example. In order to facilitate logging, you can log at the architecture level and provide convenient SDK access.

Moreover, in the distributed architecture, a complete request will pass through many nodes. When analyzing the problem, it is necessary to connect the call logs on these nodes in series. This requires a complete set of log collection and query tools, the most commonly used is elk. Because the logs are distributed on different nodes, to connect them in series, the whole call link needs to be marked and recorded in the log.

Efficient and healthy market

Different monitoring dimensions are suitable for different monitoring methods and tools. In daily use and troubleshooting, if you need to switch between different tools, the efficiency will undoubtedly be greatly reduced. The usability of the observation system can be greatly improved through the unified application of the health market, the integration of data from various systems, and the display and operation on one platform.

(simple example of business monitoring market)

As an aggregation display platform, the health market is only used to display the real-time health status of the system. Each monitoring system reports data to the platform through a unified interface. The data content includes:

  • Service information: service node information, service name, node ID, node IP, etc.
  • Business domain information: the business domain to which the service belongs. If necessary, multi-level business domains can be supported.
  • Data dimension: resource, performance, high availability, business indicators and other monitoring dimensions.
  • Dimension priority: the display weights of different priorities are different. Exceptions with high priority will be displayed first.
  • Health level: indicates the health of the service. For example, it is healthy when there is no abnormality, sub-health when there are a few warnings, unhealthy when there are a large number of alarms, and downtime when the service is unavailable.
  • Detail link: used for tripping.

The health market supports the following functions:

  • Monitoring dimension aggregation: all monitoring dimension data of the same node can be aggregated and displayed, weighted according to priority weight and health level.
  • Application state aggregation: node data belonging to unified services can be aggregated, and services belonging to the same business domain can be aggregated and displayed.
  • Priority display: it can be displayed in real time according to the health priority after aggregation, and the low health priority display.
  • Data drill down analysis: supports node data drill down analysis according to monitoring dimension and service dimension.

Observability design for SaaS applications


In Netease smart enterprises, the service level has reached thousands, and the number of instances has reached 10000. The perfect observation system helps us shield the complexity of the system architecture, makes the operation state of the overall system clearly visible, and plays a great role in fault early warning and problem troubleshooting. Through the observation system, we can clearly see the flowing blood and beating pulse of the system, and guard the health of the system.

About the author

Zhou Jianghua, senior business architect of Netease smart enterprise. He has rich technical experience and has led the development of PC client, mobile terminal and server. He is a full stack engineer. At present, he mainly focuses on business architecture and technical management.