From Devops to aiops, how can Ali achieve intelligent operation and maintenance?


Abstract:Aiops is algorithmic IT operations. Aiops is a hot spot in the field of operation and maintenance. However, on the premise of meeting the business SLA, how to improve the efficiency and stability of the platform and reduce the cost of resources become the problems and challenges faced by aiops.


With the rapid development of search business, search systems are becoming platform oriented, and the operation and maintenance mode has evolved into Devops after human flesh operation and maintenance, script automatic operation and maintenance. But with the rapid development of big data and artificial intelligence, the traditional operation and maintenance methods and solutions can not meet the needs.

Based on how to improve the efficiency and stability of the platform and reduce resources, we have realized the master of online service optimization Hawkeye and the capacity planning platform torch. After several years of precipitation, we have made good practice in four aspects: configuration rationality, resource rationality setting, performance bottleneck, deployment rationality, etc. The following describes the system architecture and implementation of Hawkeye and torch.

Aiops practice and Implementation

Hawkeye: intelligent diagnosis and optimization

System introduction

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

Hawkeye is an intelligent diagnosis and optimization system. The platform is generally divided into three parts:

1. Analysis layer, including two parts:

1) bottom analysis project Hawkeye blink: Based on blink, data processing work is completed, focusing on access log analysis, full data analysis, etc. this project focuses on bottom data analysis, with the help of blink’s powerful data processing ability, daily analysis of access logs and full data of all ha3 applications on the search platform.

2) one key diagnosis project Hawkeye experience: Based on the analysis results of Hawkeye blink, it is more close to the user’s analysis, such as field information monitoring, including field type rationality, field value monotony monitoring, etc., in addition, it also includes but not limited to KMON invalid alarm, smoke case entry, engine degradation configuration, memory related configuration, recommended row and column number configuration and Detection of minimum service line proportion during switching.

The positioning of Hawkeye experience project is to make an engine diagnosis rule in the middle stage, deposit the valuable experience of optimizing and maintaining the engine of operation and maintenance personnel into the system at ordinary times, so that each new access application can enjoy such valuable experience quickly, rather than get it after stepping on the pit again and again, so that each user has a role similar to that of intelligent diagnosis expert to optimize their own The engine is our goal and the driving force for our continuous efforts. The data processing flow chart of Hawkeye experience is as follows:

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

2. Web layer: provide various API of Hawkeye analysis results and visual monitoring chart output.

3. Service layer: provide the output of Hawkeye analysis and optimization API.

Based on the above architecture, our diagnosis and optimization functions are as follows:

• resource optimization: engine lock memory optimization (invalid field analysis), real-time memory optimization, etc;

• Performance Optimization: topn slow query optimization, buildservice resource setting optimization, etc;

Intelligent diagnosis: routine inspection, intelligent Q & A, etc.

Engine lock memory optimization

For the ha3 engine, engine fields are divided into inverted index, attribute index and summary index. The engine’s lock policy can set the lock or no lock memory for these three types of indexes. The benefits of lock memory are self-evident. It can speed up access and reduce RT. but imagine that if only 50 of the 100 fields are accessed in two months, the other fields are not accessed at all in the index. This will lead to a great waste of valuable memory. The following analysis and optimization are carried out for Hawkeye, For the head application, index slimming is carried out. The following figure shows the process of lock memory optimization, saving about several million yuan in total.

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

Slow query analysis

Slow query data comes from the application’s access log. The number of queries is related to the application’s access, usually at the level of tens of millions or even hundreds of millions. Getting topn slow query from massive logs belongs to the category of big data analysis. With the help of blink’s big data analysis ability, we use divide and conquer + hash + small top heap to obtain. That is to say, we first parse the query format, obtain its query time, take MD5 value for the parsed K-V data, and then make slices according to MD5 value, calculate the topn slow query in each slice, and finally find the final topn in all the topns. Provide personalized optimization suggestions for the analyzed topn slow query to users, so as to help users improve engine query performance and indirectly improve engine capacity.

One button diagnosis

We measure the health status of the engine through the health score. Users can clearly know their service health status through the health score. The diagnosis report gives a brief description of the diagnosis time, unreasonable configuration and details, optimized benefits, diagnosis logic and the results page with problems after one click diagnosis are shown in the figure below, and the diagnosis details page is not listed due to the length problem.

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

Intelligent Question Answering

With the increase of applications, the question answering problems encountered by the platform are also rising, but in the process of question answering, it is not difficult to find many repetitive problems, such as incremental stop, consultation of common resource alarm, for these problems with fixed processing methods, it can actually provide the ability of chatops, which can be processed with the help of question answering robot. At present, Hawkeye combines the indicators of KMON and the customized alarm message template to conduct intelligent Q & A of such problems by adding diagnosis in the alarm body. Users paste the diagnosis body in the Q & a group, and at robot can get the reason of this alarm.

Torch capacity governance optimization

Hawkeye mainly improves efficiency and stability from the perspective of intelligent diagnosis and optimization, and torch focuses on reducing cost from the perspective of capacity governance. With the increase of search platform applications, it is faced with such problems as the following, which is easy to cause low resource utilization and serious waste of machine resources.

1) the business side applies for container resources at will, resulting in serious waste of resource cost. It is necessary to clearly guide the business side how many resources (including CPU, memory and disk) should be reasonably applied or resource management should shield users based on the minimization of container cost.

2) with the continuous change of business, it is not known how much QPS the online real capacity can carry. When the business needs to increase the traffic (such as various promotions), do you need to expand the capacity? If the expansion is to expand the line or increase the CPU specification of a single container? When the business needs to increase the amount of data, is it appropriate to split columns or expand the memory size of a single container? So many question marks, any one of them will make the business side circle.


As shown in the figure below, the existing resources for capacity evaluation are KMON data, and the status of online system is reported to KMON. Can KMON data be directly used for capacity evaluation?

The actual experiment shows that it is not enough, because there are many application water levels on the line are relatively low, and the capacity fitted out under the condition of high water level is not objective enough, so a pressure measurement service is needed to truly understand the performance capacity. With the pressure measurement, what is the next problem to be solved? There is a large risk on the pressure line. The limited resources of the pressure pre delivery can’t really touch the bottom line because of the poor machine configuration. Therefore, it is necessary to clone the simulation. A single example on the real clone line is then pressed and tested, so it can be accurate and safe. With the pressure test data, the next step is to find the lowest cost resource allocation through algorithm analysis. With the above core support, each task is managed through the task management module for automatic capacity evaluation.

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

The above is our solution. Next, we will give priority to the overall architecture, and then introduce the specific implementation of each core module.

system architecture

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

As shown in the figure, from the bottom up, the first is access layer. In order to access the platform, only the application information and cluster information of each application under the platform (at present, there are ha3 and SP under tisplus), the application management module will integrate the application information, and then the task management module will abstract each application into a capacity evaluation task.

The general process of a complete capacity evaluation task is as follows: first, clone a single case, and then conduct automatic pressure measurement on the cloned single case to the limit capacity. The pressure measurement data and daily data will be processed by the data factory and the formatted data will be handed over to the decision center. The decision center will first use the pressure measurement data and daily data to conduct capacity evaluation through the algorithm service, and then judge the revenue, If the revenue is high, clone and pressure test verification will be carried out in combination with the algorithm capacity optimization proposal. If the verification passes the persistence of the results, simple capacity evaluation will be carried out in case of verification failure (simple capacity evaluation combined with the ultimate performance of pressure test). The completion of capacity evaluation and the failure decision center will clean up the temporary resources of clone and pressure test application without causing waste of resources.

The top is the application layer. Considering that the torch capacity management is not only customized for tisplus, the application layer provides large capacity inventory, capacity evaluation, capacity report and revenue inventory for other platforms to access and embed. In addition, it also provides capacity API for other systems to call.

Capacity evaluation also relies on searching many other systems, Maat, KMON, Hawkeye, drogo, cost system and so on, forming a closed loop.

Architecture implementation

Clone simulation

Clone simulation is simply understood as a single example of clone online application. Ha3 application is to clone a complete line, and SP is to clone an independent service. With the birth of the powerful tool of search hippo, resources are all used in the way of containers. With the development of Suez ops and sophon, it is possible to clone an application quickly. The specific implementation of clone control module is given below:

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

At present, clone is divided into shallow clone and deep clone. Shallow clone is mainly used to pull the index of main application directly through shadow table for ha3 application, so as to save the build link and speed up the clone speed. Deep clone is the application that needs offline build.

The advantages of cloning are obvious:

  1. Service isolation can indirectly touch the real capacity of the bottom line through the compression test of clone environment.
  2. It is suggested that the resource optimization can be verified directly on the clone environment.
  3. After the clone environment is used, it will be released automatically and will not waste online resources.

Measurement service

Considering that most of the daily KMON data applications lack the metrics index of high water level, and the real capacity of the engine can only be obtained through the actual pressure measurement, so the pressure measurement service is required. In the early stage, we investigated the Amazon pressure measurement platform and Alibaba mom pressure measurement platform of the company, and found that they can not meet the demand of automatic pressure measurement, so based on the hippo, we developed an adaptive increased pressure walker Distributed pressure testing service.

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

Algorithm service

The goal of capacity evaluation is to minimize the cost of resources and improve the utilization rate of resources. Therefore, there is a precondition that resources can be quantified by cost. Cost is also an important dimension to measure the value of the platform from search to platformization. Therefore, we have developed a price formula with the financial department, which is also a prerequisite. After a lot of experiments and analysis with algorithm students, we found that The problem can be transformed into a programming problem with constraints. The objective function of optimization is the price formula (there are several variables in memory, CPU and disk). The constraints are that the container specifications and the number of containers provided must meet the needs of the lowest QPS memory and disk.

AIOps outlook

Through the implementation of Hawkeye diagnosis optimization and torch capacity management on tisplus search platform, the cost has been greatly reduced, the efficiency and stability have been improved, and the confidence has been established for the application of aiops to other online systems. Therefore, the next step is to integrate Hawkeye and torch to build aiops platform, so that other online services can also enjoy the benefits brought by aiops. Therefore, openness and ease of use are the two primary considerations in platform design.

For this reason, we will focus on the construction of four basic databases:

Operation and maintenance index library: the online system log, monitoring index, event and application information are standardized and integrated to facilitate access to various operation and maintenance indexes in the process of strategy implementation.

Operation and maintenance knowledge base: it provides retrieval and calculation functions through the problem set and experience accumulated by es in daily Q & A, which is convenient for automatic diagnosis and self-healing of similar online problems.

Operation and maintenance component library: The Clone simulation pressure test and algorithm model are componentized, which is convenient for users to select algorithm flexibly for strategy implementation, and it is easy to use clone simulation and pressure test to effectively verify optimization suggestions.

Operation and maintenance strategy library: through canvas, users drag and write UDP to quickly realize the operation and maintenance strategy of their own system. Operation and maintenance index library, operation and maintenance knowledge base and operation and maintenance component library provide rich and diverse data and components, making the implementation of operation and maintenance strategy simple enough.

Based on the above-mentioned infrastructure construction and combination strategy, data under various operation and maintenance scenarios can be generated, and comprehensive fault handling, intelligent Q & A, capacity management and performance optimization can be applied in various scenarios.

From Devops to aiops, how can Ali achieve intelligent operation and maintenance?

This article is the sharing of aiops practice of Alibaba search Zhongtai technology series. It has been 3 years since the construction of search Zhongtai from 0 to 1, but it is far from our vision of making the world have no difficult search. The road ahead will be full of challenges. No matter the SaaS capability of business perspective, search algorithm productization, cloud Devops & aiops, or business station building, we will meet world-class challenges.

Author: yunradium

Read the original text

This article is from alitech, a partner of yunqi community. If you need to reprint it, please contact the original author.