In the cloud era, the IT operation and maintenance of enterprises is facing the challenges of complex architecture, diversified business requirements and massive operation and maintenance data. How to realize accurate alarm, abnormal intelligent diagnosis, root cause positioning, anomaly prediction and anomaly automatic repair has become the urgent needs of enterprise digital transformation.
On September 26, Teng Shengbo, a senior technical expert of Alibaba, published a report entitled“Cloud server unattended and self service practice”The keynote speech of Alibaba cloud elastic computing team shared how to use artificial intelligence technology to enable operation and maintenance automation, realize unattended server on cloud, help users reduce the complexity of cloud server instance management, and ensure the stable and efficient operation of instance services. This paper is based on Teng Shengbo’s speech.
Picture: Teng Shengbo, senior technical expert of Alibaba
The content structure of this paper is as follows:
1. Why should servers on the cloud need to be unattended?
2. Alibaba cloud’s unattended self service
3. Data and AI behind unattended
1. Why should servers on the cloud need to be unattended?
Operation and maintenance is a kind of service, including both infrastructure software services and human services. The service object is the business team using infrastructure in the enterprise, while cloud computing IAAs is an operation and maintenance service. The service object has been developed into developers and operation and maintenance teams using cloud services. With the extensive implementation of cloud computing, most enterprises have been in the cloud. At present, there are more than 1 million users’ business running on the alicloud platform, and more and more users are served by the alicloud platform.
With the expansion of the scale of platform users, we find that platform users generally face three pain points in the operation and maintenance of ECS instances
（1）High cost of background communicationWhy is my case wrong?
（2）Manual processing takes a long timeWhy hasn’t this problem been solved for so long?
（3）Opaque customer operationThe problem seems to be fixed, but what did you just do?
For this reason, we need to put more manpower into customer service personnel, so that the user’s problems can be effectively solved. In order to avoid the linear increase of customer side operation and maintenance cost caused by the expansion of user scale, we began to use artificial intelligence technology to enable user operation and maintenance management.When unmanned retail and unmanned driving become the trend, we think that the future cloud server will also realize unattended。
In fact, it has been 10 years since the launch of Alibaba cloud elastic computing products, which has precipitated many ECS instance operation and maintenance management experience and abnormal “behavior” rules. Therefore, relying on the data-driven of machine learning, we built a set of unattended architecture of cloud server through the analysis of abnormal “behavior” data, and launched a series of self-service, which realized the self diagnosis, self-repair, self optimization and self-operation and maintenance of ECS instances, helping users reduce the complexity of ECS instance management, so as to ensure the stable and efficient operation of instance services 。
2. Unattended self service combat
The operation and maintenance of cloud computing IAAs can be divided into service side operation and customer side operation and maintenanceService side operation and maintenance is the operation and maintenance work of cloud platform, which is usually invisible to users. It mainly involves three levels: infrastructure, basic products and upper management and control, including operation and maintenance of computer room and physical equipment, resource virtualization, resource scheduling, and hot migration. With the expansion of user scale, the operation and maintenance work will become more and more complex. The operation and maintenance work on the user side is visible to the users themselves. It is mainly the user’s modification and automation of ECS instances, including capacity expansion, restart, monitoring, customer service, work order response, resource arrangement and operation and maintenance arrangement.
We build the unattended architecture of server on cloud, provides a series of self-service for users of alicloud platform. Broadly speaking, Alibaba cloud’s self-service includes four dimensions: ECS instance itself, instance lifecycle management, system management and automation, market and ecology, as shown in the figure below.
Figure: self service in a broad sense
In a narrow sense, alicloud self-service enables users to diagnose, repair and recommend ECS instances. On the same day, Alibaba cloud self-service has provided a series of self-service tools, such as instance diagnosis tool, instance optimization recommendation, automatic repair tool, best template recommendation and ECS event automation,Covering 80% of ECS common problems, reducing the average period of problem solving from a few hours to minutesThe whole process does not need customer service manual participation, and there is no risk of privacy leakage. The server on the cloud is unattended. In the future, with the continuous drive of AI + data, the diagnosis and repair of ECS instances will become more and more accurate.
Intelligent diagnosis of ECS instance
According to the data statistics of the platform, users mainly face four types of problems when using ECS instances:
(1) The instance cannot be accessed remotely
(2) Instance failed to start / stop
(3) Instance performance exception
(4) Disk expansion does not take effect
Therefore, in terms of the ability of intelligent diagnosis, we cover the dimensions of ECS system services, disk health services, network health services and guest OS system configuration. Users can complete the intelligent health diagnosis of instances with one click.
Automatic repair of ECS instance
After the intelligent diagnosis is completed, we will also provide users with an automatic repair scheme for ECS instances. After the former locates the problem, the automatic repair can solve the problem in 1-3 minutes, mainly to complete the ECS system service repair, network problem repair and disk repair.
It is not enough to implement automatic repair only. We believe that automatic repair should also be transparent and compliant。 We provide automation engine through OOS, and provide execution capability in guest OS through cloud assistant command. OOS and cloud assistant command jointly help users complete automatic repair. At the same time, we open source code of OOS + cloud assistant command of O & M choreography service to make all repair logic visible to users. All repair operations can also be done through ECS instance Through the role control of alicloud ram, all permissions can be controlled, and all records can be audited through the action trail of alicloud operation audit, so as to achieve real transparency and compliance.
3. AI and data capabilities behind unattended
What makes us realize intelligent diagnosis and automatic repair is AI + data, a powerful technical support under the iceberg. Relying on the underlying data platform, we have completed the collection, cleaning, analysis and model construction of data including physical machine data, virtualization data, network data, control surface data and data in guestos. In addition, with the continuous optimization of AI algorithm, we have built user profiles, decision trees, prediction and recommendation models, so as to ensure the accuracy of anomaly diagnosis and automatic repair And efficient.
At present, in the overall ECS self-service architecture, it mainly relies on the data of real-time monitoring log service, middleware monitoring, API request monitoring, console monitoring and self-service diagnosis of the control and monitoring center, and realizes the early warning and processing of problems through the machine learning engine, and then drives the operation and maintenance choreography service OOS to realize automatic repair of problems.
Through this AI driven self-service architecture, the current Alibaba cloud ECS real-time memory anomaly perception accuracy rate is more than 70%, and the implementation of prediction link delay is controlled within 100s; in addition, combining expert experience, case base and knowledge base, we build a powerful diagnosis decision tree, which provides a strong basis for speeding up the problem location and repair.
In the past two years, Alibaba cloud elastic computing team has continuously invested in building abnormal behavior data sets. In the future, it plans to evolve into the “Imagenet dataset” of Alibaba Group in anomaly prediction and open source, hoping to contribute more value to the development of anomaly prediction in the industry.
Link to original text
This article is the original content of Alibaba cloud and can not be reproduced without permission.