Round table dialogue: challenges and opportunities for enterprise operation and maintenance in the cloud Era


Introduction: four big business operation and maintenance experts held a dialogue to discuss “challenges and opportunities faced by business operation and maintenance in the cloud era”.

Editor’s note: going to the cloud has become an irresistible choice for enterprises. The “software defines everything” feature of cloud computing promotes the trend of automatic operation and maintenance, such as agile flexibility, Devops, intelligent operation and maintenance and infrastructure, i.e. code, which brings opportunities for the further upgrading of enterprise R & D operation and maintenance system and new challenges for enterprise architects and operation and maintenance engineers.

On December 10, at the 2021 on Cloud Architecture and operation and maintenance summit, Alibaba cloud invited Dong Shixiao, director of CSDN ecological content, as the person in charge, to have a dialogue with four celebrities, including Chen Gang, technical director of efficient operation and maintenance community in East China, Chen Jiong, senior Solution Architect of red hat, Li Tonggang, head of search vehicle infrastructure department, and you Shouzhi, head of anymen operation and maintenance, to discuss “challenges and opportunities for enterprise operation and maintenance in the cloud era”.

The following is the compilation of the transcript of the round table discussion:

Moderator: Dong Shixiao, director of CSDN ecological content

Four round table dialogue guests

Why should Q1 enterprises go to the cloud?

Moderator: cloud has been the consensus in the industry, but there are also some different voices. All the distinguished guests are senior people on the cloud. First of all, I’d like to discuss with you why enterprises want to go on the cloud? What are the benefits of going to the cloud? What are the problems?

Chen Gang: This is a frequently talked about topic. The main driving force for enterprises to go to the cloud is the huge advantage in it cost. The prices of cloud manufacturers are declining year by year, which has formed a huge temptation for many enterprises. The price of some entry-level cloud server configurations is within 1000 yuan a year. It is difficult for small and medium-sized enterprises to be indifferent to such a low price. Enterprises buy a machine and put it in the hosting room. Coupled with the labor maintenance cost, the cost is estimated to be more than 10 times that of ECs. Cost controllability is a huge driving force for enterprises to go to the cloud.

The second point is that with the development of cloud computing technology, various IAAs, PAAS and SaaS platforms and applications are becoming more and more mature. Large and small enterprises hope to keep pace with the times in cloud technology, enjoy the it advantages brought by the latest cloud technology, and maintain the leading advantage of science and technology in the fierce market competition.

At the same time, we should also rationally judge the advantages and disadvantages of enterprises on the cloud and make the most suitable choice for ourselves. I have been engaged in operation and maintenance for more than 20 years. The enterprises I serve include some e-commerce, financial enterprises and cloud manufacturers at home and abroad. I have participated in the construction and operation of some computer rooms and seen many successful and failed cases of enterprises on the cloud.

The cost saving of enterprises on the cloud has the most obvious effect on small and medium-sized enterprises, because the requirements of such enterprises are some standardized requirements, such as front-end and back-end, middleware and database. There is basically no need for customized IT framework, and the existing cloud warehouse standard scheme can be solved. However, once the enterprise is on the scale, such as the financial industry, banking, insurance, securities and other large enterprises, it will be more tortuous to go to the cloud, and even increase some it costs in the initial stage.

Because in the process of going to the cloud, many enterprises should not only maintain the stable operation of existing physical machines and virtual institutions, but also run stably in the process of going to the cloud. Therefore, they need to invest additional manpower in technology pilot and technology exploration, and always maintain the compatibility of services in the process. At the same time, many large enterprises will put forward some higher requirements, such as architecture fallback plan, which is almost equivalent to high-risk actions. These actions will undoubtedly increase the enterprise’s IT investment in the initial stage.

Chen Jiong: we often mention a topic in the traditional operation and maintenance mode, which is automation. Standardization is the premise of automation. One of the very important reasons why we go to the cloud is that we can get some standardized delivery. There are various standardized software and hardware in the cloud market. While using these software and hardware, we enjoy standardized services, which will also bring great convenience to the later operation and maintenance.

Li Tonggang: the first advantage of enterprise cloud is fast. In today’s rapidly changing external demand, every operation and maintenance personnel must consider how to improve the delivery of infrastructure and make the business process faster. After going to the cloud, you can use the resources of the cloud to quickly deliver and realize business value.

The second advantage is saving, especially in terms of security. After the cloud, enterprises can pay for cloud products on demand. Compared with the privatization deployment before the cloud, the cost will be significantly reduced.

You Shouzhi: my opinion on why enterprises should go to the cloud is divided into the following four parts:

◾ Machine room restrictions. The traditional physical computer room includes the limitations of bandwidth, special line and power, which can not meet the needs of rapid growth of enterprises. The advantages of cloud architecture will be more obvious.

◾ Utilization rate. The physical machine configuration using IDC is relatively high, which leads to an inevitable problem of utilization. Many technologies are designed to solve these problems, including business mixing and container technology, but they are not particularly good. The core of the problem lies in the lack of flexibility.

◾ Middleware. Middleware provided on the cloud includes cloud call, real person authentication, intelligent voice interaction and other products. For small and medium-sized enterprises like soul, at this stage, they invest a lot of manpower and energy to do some functional middleware, and the final income must not be higher than that of directly using cloud products.

◾ cost. The cost depends on the form and characteristics of the business. The operation and maintenance cost of small and medium-sized enterprises on the cloud is lower. One is to save sharing capacity, and the other is to save flexibility. The configuration, bidding instance, WAF and native protection of various models on the cloud provide a sharing mode to save costs for enterprises.

Q2 the biggest challenge and solution of cloud operation and maintenance?

Moderator: from the above sharing, we can know that after enterprises go to the cloud, they can enjoy standardized services, which are efficient, cost-effective, labor-saving and safe. However, for some application scenarios with special requirements, the relevant system needs to be further improved.

Next, I would like to talk to you about the challenges of cloud operation and maintenance? How do you solve these challenges in your own practice or in the process of serving customers?

Chen Jiong: what we are facing now is not the operation and maintenance of supervision, management and control in the traditional sense, but unified operation and maintenance, intelligent operation and maintenance and even cloud operation and maintenance in the future. However, in the process of landing, we still need to solve the following problems:

◾ Realize unified operation and maintenance. The environment we are facing now is very complex. It is not a single computer room or an IDC in the traditional sense, but a multi cloud environment, including private cloud, public cloud, virtualization platform and future container platform. Different platforms have different logic and need to use different skills for operation and maintenance, resulting in high requirements for operation and maintenance personnel. Therefore, we hope to break the differences between different platforms and operate all platforms in the same way.

◾ Break the isolation of operation and maintenance. At present, each operation and maintenance team operates in isolation and lacks coordination and cooperation with each other. Isolated operation and maintenance will cause great trouble. For example, in the process of the project, all teams protect their own interests and are unwilling to take the initiative to claim and solve problems, which has a great impact on work efficiency.

◾ Avoid manual operation and maintenance. In the current operation and maintenance process, there are still many manual operation and maintenance operations, which will lead to efficiency and security problems. Secondly, frequent login to the server to do some command operations also has potential security risks. Therefore, we hope to have a platform to do such repetitive work instead of manual work, so as to avoid manual repetitive work and too many login servers.

◾ Solidify the knowledge of operation and maintenance personnel. A lot of knowledge in the operation and maintenance process is stored in the brain of the operation and maintenance personnel. This knowledge is very valuable, but no system can solidify and save this knowledge so that other personnel can use it repeatedly. Ensure that there will be no problem with the operation and maintenance ability of the team when these personnel are not present.

The above four points are the major challenges we are facing at present.

Li Tonggang: there are two stages for enterprises to go to the cloud. The first stage is to go to the cloud in IDC room, and the second stage is to go to the cloud in technical architecture. In terms of technical architecture, cloud and business processes are closely combined, so compatibility issues are involved. Many enterprises have the demand of multi cloud. How to make the infrastructure compatible with two clouds at the same time is an urgent problem to be solved.

It is expected that a consensus can be reached on the cloudy technical architecture and technical agreement in the future, so as to truly reduce the cross cloud compatibility cost of enterprises.

You Shouzhi: I think there are four main difficulties in Enterprise Cloud:

◾ Migration costs. The introduction of cloud computing into infrastructure from traditional IDC is an innovation of IT infrastructure and infrastructure, and the stability and original management mode in the migration process should be rebuilt. This is indeed a big project.

◾ Safety and compliance. There is a risk of data leakage when data is migrated from the enterprise’s original IDC to the cloud.

◾ SLA guarantee and control. Both enterprises and the public cloud have signed SLA guarantee agreements. Compared with enterprises, the SLA of the public cloud is relatively high, which can generally reach 4 nines, but when the public cloud fails, the enterprise will be helpless.

◾ Long term expenses. In the early stage, the cloud is on a fixed time node, and the long-term cost of the overall cloud can also be calculated. However, with the expansion of enterprises and the change of business types, resources will tilt to a certain aspect, resulting in uncontrollable expenses.

Chen Gang: I mainly share three challenges that large enterprises will face in the process of going to the cloud.

◾ Large enterprises may be forced by some industry associations, such as the CSRC, the CIRC and the CBRC, to ensure the confidentiality and security of data. As a result, many of their data cannot be completely solved by public cloud, so they can only choose to build private cloud or build a unified cloud platform within the group as a hybrid cloud solution, which is actually equivalent to a disguised repeated construction.

◾ The operation and maintenance personnel are facing the challenge of technological transformation. Many enterprises have formed mature and stable operation and maintenance systems in terms of physical machines and virtual machines before going to the cloud. After they go to the cloud, the skill transformation of operation and maintenance personnel will face great challenges. It will be a long process to train, transform and improve the skills of existing operation and maintenance personnel or recruit new operation and maintenance personnel.

◾ The challenge of mismatch between existing platforms and cloud native technologies. Some enterprise platforms, whether natural, secondary development or Party B’s resources, may not fully match the cloud native technology services. They will double the difficulty, time and cost of going to the cloud than the general cloud of small and medium-sized enterprises, and may not guarantee the success of transformation.

Q3 what is the acceptance and landing status of domestic xops?

Moderator: the challenges of Enterprise Cloud operation and maintenance work are closely related to the security, stability, compatibility and inheritability of knowledge on the cloud. If these problems are solved, the operation and maintenance work can be carried out better. In China, we call all kinds of xops as automatic operation and maintenance. How does the enterprise accept this kind of xops? In the process of your contact, what are some good practices of automatic operation and maintenance?

Li Tonggang: automation is a topic that has been pursued in the field of operation and maintenance. I will analyze the implementation of automatic operation and maintenance from two aspects.

◾ First, automatic operation and maintenance has a large number of alarm monitoring. If the amount of data is too large, the alarm will be invalid. We can summarize its trend through the accumulation and analysis of historical data, and then automatically adjust the threshold of the index through automatic learning and some mathematical models. The original alarm is a fixed value, but the accuracy can not be high enough by relying on the fixed value. Therefore, the trend of historical data can be automatically learned through machine learning to achieve automatic alarm. At present, we have sorted out more than 100 indicators and are docking with Alibaba cloud’s SLS service.

◾ Second, automatically analyze the root cause of the fault. In the network topology, the alarm service should be perceived fastest. If the logs of the whole service to database layer to service layer chain are complete, in theory, we can deduce whether the fault is the database or virtual machine or other reasons according to the business fault.

In short, starting from the direction of data, we hope to bring some excellent and unexpected results to the field of operation and maintenance.

You Shouzhi: first of all, let’s talk about the concept of Devops. Devops has been accepted by many domestic companies. Its core advantage is to improve labor efficiency and reduce repetitive work. From Devops to aiops is our future wind vane, which can complete the evolution from manual decision-making and manual execution to automatic decision-making and automatic execution. The following two points are the landing of aiops in soul:

◾ The first point is to control the cost of resources. First, control from the level of resource application to prevent the waste of resources, then to the control of service water level, automatically turn on the elastic expansion and contraction capacity, business index perception, automatic switching and automatic scheduling of traffic, and finally the automatic circuit breaker mechanism of business.

◾ The second point is the business monitoring level. Firstly, the analysis of monitoring indicators can help us quickly locate the root cause of the problem. Secondly, judge the fault type, analyze the number of people affected by the fault, the fault level and the recommendation of historical faults, which can help us quickly solve the fault.

Chen Gang: I mainly analyze the situation of xops landing in China from two aspects.

First of all, the understanding and application of xops by several domestic first-line Internet companies have been relatively mature. Even in some fields, these big companies themselves are the wind vane of xops business. At the same time, they also have their own original output in the international field.

Secondly, in the past two years, I have mainly done consulting and training on Devops transformation for large domestic financial enterprises. Their understanding of Devops is still in its infancy, and they have a wait-and-see and follow-up attitude. At the same time, they also hope to follow up and understand aiops, chatops and gitops simultaneously.

For example, in October this year, Huatai Securities and Zhejiang Mobile passed the aiops capability certificate issued by the information and Communication Research Institute under the Ministry of industry and information technology, including anomaly detection, alarm convergence, root cause analysis and fault prediction. Shanghai Pudong Development Bank, Guotai Junan and other large bank securities are also in the process of capacity-building and certification of aiops.

The domestic xops consulting work generally needs to last for half a year to a year, because there are many difficulties to overcome in the process of xops landing, but as long as we keep moving forward, xops will blossom everywhere in China.

Chen Jiong: red hat also has a relatively complete solution for operation and maintenance automation. From the perspective of many years of practice, we found that the scenarios of using automation in domestic enterprises mainly include the following:

◾ Use automation to drive standardization. Through the introduction of automation platform, help enterprises establish a set of standardized system, including how to realize standardization in its system, platform and other settings.

◾ Through automation to achieve the system’s automatic inspection, automatic configuration management and a series of daily management.

◾ Use automation platform to help enterprises conduct root cause analysis of faults and even self-healing of faults.

◾ Help enterprises realize automatic application release, even disaster recovery switching automation, etc.

The scenarios that automation can realize are very rich. You can realize as many functions as you can imagine.

What is the core competitiveness of operation and maintenance personnel in the Q4 cloud era?

Moderator: In conclusion, the acceptance and application of xops by the first-line manufacturers are relatively high, but looking at the transformation of the whole industry, xops still has room for improvement, such as systematic popularization and application.

As mentioned earlier, it is safer to save money and labor after going to the cloud, but does this mean that many jobs, including operation and maintenance personnel, will be replaced? How can operation and maintenance personnel in the cloud era build their core competitiveness? What do you think of this problem?

You Shouzhi: I will elaborate my views on this issue from three aspects.

◾ First of all, we should change our ideas. There are some repetitive or simple tasks in operation and maintenance, such as building resources or underlying basic environment, which will be heavily dependent, but not necessarily necessary.

◾ Secondly, it is the change of work focus. After going to the cloud, this part of repeated or simple work will be replaced by the capabilities of the public cloud itself. But for the operation and maintenance personnel, this is not a bad thing. They can pay more attention to the stability of the business, have more time to improve themselves and bravely jump out of the comfortable area.

◾ Finally, how to understand and make good use of the public cloud. My understanding of public cloud is that it can meet more than 80% of the needs of all enterprises, but it can hardly meet 100% of their needs. We should make good use of the existing 80%, build the private part faster and better, and see the results faster from the company and business level. The value of operation and maintenance is to improve business stability, which is what enterprises are most concerned about, and it is a good solution that this part of the capability is provided by the public cloud.

Chen Gang: enterprises don’t need so many operation and maintenance personnel after they get on the cloud. Will these people face unemployment? I will elaborate on this issue according to my own experience.

A few years ago, I participated in a project to empower enterprises with Devops, then migrate applications to k8s, and introduce some cloud native practices. During the progress of the project, I need to lead two colleagues to tackle and explore the technology from beginning to end, and finally form a landing plan.

At that time, there were about 20 operation and maintenance colleagues in the Department, most of whom were mainly based on the operation and maintenance skills of traditional data centers and physical and virtual machines. In the process of transformation, some operation and maintenance personnel are indeed worried that their technology will lose its core competitiveness advantage, but in the process of transformation, we can actually accumulate a lot of documents and ppts on best practices, conduct training and publicity within the enterprise, and strive to raise the operation and maintenance colleagues who want to learn and improve their skills to the level required by the company in time.

In the process of cloud transformation, the operation and maintenance personnel of enterprises either train the existing operation and maintenance personnel to improve their skills to meet the requirements of transformation, or introduce new operation and maintenance personnel from the outside. There is no third way to go. I believe that as long as the operation and maintenance personnel have the desire to improve their skills and keep pace with the times, they will be able to move forward steadily, and the operation and maintenance personnel can make more contributions to the society after standing in the cloud.

Finally, the fierce market competition in modern society is like sailing against the current. If you don’t advance, you will fall back. In the process of enterprises going to the cloud, some low-level operation and maintenance personnel who are unwilling to make enterprising transformation will indeed be eliminated. In fact, this is the concept of survival of the fittest. There will be a mechanism of survival of the fittest not only in the operation and maintenance industry, but also in places with social division of labor.

Chen Jiong: in the process of it construction, products, processes and personnel are always the three main topics. Enterprise cloud does not mean the unemployment of operation and maintenance personnel, but our requirements for personnel have changed under the new environment and platform.

In the past, operation and maintenance personnel only needed to be able to type commands, write code and scripts, but after going to the cloud, it was far from enough. They need to formulate the system standards and complete operation and maintenance processes of the whole operation and maintenance, and even look at a full life cycle management from a closed-loop perspective. And when analyzing the root cause of the fault, it can find it independently from different angles. It is also very important to be able to identify which software and systems can be integrated for better collaborative operation when building the environment.

Therefore, the operation and maintenance personnel should not be replaced, but their ability should be greatly improved to meet the needs of it operation and maintenance in the future. This is my point of view. Thank you, host.

Li Tonggang: I think on the other hand, this is actually an opportunity.

In the past, the operation and maintenance personnel were proud to understand various middleware technologies, but in fact, this may not be the most meaningful thing for the operation and maintenance post. The essence of operation and maintenance is to ensure the stability of business and the rationality of it cost. However, these two goals are not achieved by technology, but need to formulate solutions in combination with the actual situation of the company. This is a complete set of system.

At present, the ability of this piece is difficult to replace by machines, so we can hand over some simple and repeated things to machines, and people can do things that machines can’t do. On the one hand, this will improve the skills of the personnel themselves. On the other hand, the company can also obtain direct business value.

Summary of round table dialogue

Moderator: indeed, container and cluster technologies pose great challenges to operation and maintenance personnel, but what operation and maintenance personnel need to do is actively meet and learn new technologies. In addition, after the cloud, the operation and maintenance personnel can do more things that the machine can’t do, such as the formulation of processes, specifications and so on.

Although enterprises still face many challenges on the cloud, there are more opportunities. I believe that with the improvement of enterprise operation and maintenance system, the blessing of new operation and maintenance technology and the improvement of the core competence of operation and maintenance personnel, the cloud will be more and more wonderful. Let’s work together.

