Stone from other mountain – which is the best operation and maintenance platform?


Devops full link

The following figure is a well-known software R & D link. In a R & D organization with high iteration frequency, it may experience the following cycles several times a day. For the R & D organizations with large user groups or undergoing large business expansion, in addition to focusing on the fast online application, how to ensure the high reliability and high availability of the application has become the focus, that is, the service online should be fast and the operation should be good.

Stone from other mountain - which is the best operation and maintenance platform?

How to make development simpler and more efficient? Let’s discuss this problem from two perspectives:

  • Organization mode
  • R & D tool

Organization of operation and maintenance personnel

  • One way is to set up a special operation and maintenance team. An operation and maintenance team often undertakes the cooperation of multiple development teams. In addition to the operation and maintenance of a middleware such as DBA, the disadvantage of this mode lies in that many operation and maintenance engineers are deeply involved in such trivial matters as environment configuration, log collection, business recovery, phenomenon recording, etc. they don’t have time to read the project source code and improve their ability. It’s difficult to analyze the deeper business problems, and the development team often has no time to separate themselves, so the operation and maintenance is easy to fall into being. Moving situation.
  • In another way, developers are responsible for the development and operation and maintenance of their own modules. The advantage is that developers are familiar with the source code of this module, so the efficiency of locating problems is much higher. At the same time, developers can directly get feedback from downstream users and integrate it into R & D. The disadvantage is that it is difficult to guarantee the time of code development after developers fall into frequent user problem location.

In recent years, SRE, a high-level operation and maintenance profession, has also sprung up in China, especially in the cloud computing industry. SRE needs to be proficient in knowledge and skills such as network, programming, algorithm, data structure, operating system, security, etc. When problems such as network failure and system failure occur in the cloud platform, this is sometimes even fatal to cloud tenants / users, so many SREs are transformed from high-level developers.

In the service reliability level of Google SRE, SRE ensures the health status of application services through product, development, capacity planning, testing, root cause analysis, event response and monitoring. From this level, we can see that Google advocates that operation and maintenance should actively control the direction of service development, not just reactive fire fighting after the accident. At present, the elite operation and maintenance of SRE needs to be explored and practiced in China.

Stone from other mountain - which is the best operation and maintenance platform?

It is not a particularly suitable way to roughly separate the development and operation, or simply merge the development and operation. From the author’s R & D experience, one way for everyone to think and discuss is to divide the work according to the actual business situation: for example, the developers in the team take turns to take charge of the operation and maintenance of the whole project. Because each developer needs to be familiar with the public code of the project, and understand other module codes relatively quickly, this way can basically eliminate most of the problems, and the remaining small part can be paired with the person in charge of the specified module. In addition, assigning O & M contact persons to “each service team” and “inviting O & M engineers to participate in development team meetings” are measures to strengthen the cooperation between O & M and development.

About the use of tools

In addition to the right way to organize people, the right tools can also inject capabilities into the R & D team.

When configuring the R & D environment, R & D organizations can choose to build their own code management and continuous environment through open source tools. The disadvantage of this approach is that there needs to be a dedicated CI team to maintain the continuous build environment. Once the environment is destroyed, the development will stop. And because the data of each open source tool is not connected, developers need to switch between multiple tools. Another way is through the existing software development management system, such asCoding R & D management systemIn order to realize one-stop R & D process management, it does not need to build and maintain many R & D tools and R & D environment. It supports the completion of a full set of software development process in the browser, and truly achieves coding anytime anywhere.

When developers passCoding R & D management systemAfter rapid development and deployment of the application, the next step is to make the application run reliably under the auxiliary monitoring of the operation and maintenance tools (not all applications need the operation and maintenance tools, which need to be targeted). R & D organizations can choose to develop their own operation and maintenance tools, or choose existing operation and maintenance tools.

At present, the operation and maintenance tools are gradually developing towards application-oriented development, because applications directly provide users with business capabilities, and both development and operation and maintenance are driven by business value. The mainstream operation and maintenance tools mainly include infrastructure level monitoring, application level monitoring, business level analysis and monitoring.

Stone from other mountain - which is the best operation and maintenance platform?

Next, let’s see what specific capabilities the existing O & M tools generally provide:

  • Monitoring of infrastructure environment: report the utilization of CPU, memory, disk, file system, network and other resources of the server as a whole.
  • Application performance monitoring: monitor the access efficiency of middleware used by application, such as persistent database, cache database, message middleware, etc.; monitor the response speed of application request, including latency, throughput, etc.
  • Application call link tracing: in a distributed system, a request often needs to be processed by multiple processes. In case of user request call failure or error, the operation and maintenance platform supports the analysis and fault link location of the whole call link.
  • Log data collection and analysis: the collection of logs is mainly to assist the application call link analysis and performance monitoring. The operation and maintenance personnel do not need to enter the background to search for logs in large quantities.
  • Automatic fault recovery
  • Flexible alarm
  • Visual panel displays monitoring and alarm information

Foreign hot operation and maintenance tools include Zipkin (distributed tracking), pinpoint (distributed tracking), logstash (data collection), etc. At present, the major domestic cloud manufacturers also basically provide application operation and maintenance platforms, including Tencent blue whale, Alibaba arms, Huawei APM, etc. The following is a brief comparison of the capabilities of these operation and maintenance platforms:

Stone from other mountain - which is the best operation and maintenance platform?

At present, most of the operation and maintenance platforms mainly collect the application index information through agent and probe, and then summarize and process it and react on the visual interface. In addition to the above tools and platforms, aiops has gradually become a trend in the future. Aiops conducts intelligent business fault diagnosis through the application of AI technology, and automatically recovers application faults in an attempt to let the R & D organization bid farewell to the era of human flesh operation and maintenance completely. The author is also looking forward to this day. Operation and maintenance personnel do not need to worry about the unemployment of aiops. Tools and platforms only improve operation and maintenance efficiency and will not replace operation and maintenance.

After defects are found in the operation and maintenance phase, developers can process the corresponding defects in coding, and record the type, priority, module, description, handler and other information of each defect. Software defects are inevitable, but only through the management and recovery of defects can we know the causes of defects (human factors / environmental factors / tool problems, etc.), so as to improve and avoid the repetition of similar defects. The management of defects is also helpful for managers to evaluate software quality correctly. The defect handler can quickly repair and deploy the defect through coding, which can greatly shorten the recovery time and reduce the business loss caused by the defect.

Under the guidance of Devops concept, the author suggests that developers should consider how to develop maintainable code in addition to functions when developing business code, and improve operation and maintenance efficiency through appropriate log, error code, exception and other measures; operation and maintenance personnel also need to gradually improve their ability, transform from traditional complex operation and maintenance, and embark on the road of agile and automatic operation and maintenance.

Written in the end

We can see that with the significant improvement of Devops tool chain automation, the threshold of Devops becomes lower. The result of embracing automation is that the R & D process will become more and more quiet. In top open source projects, the committers only make things clear through email and issue in daily life, without hot and long meetings, or colorful and colorful worksheets. But these are based on Devops good practice. We believe that on the road of practicing Devops, software development will be simpler and more efficient in the future.

Reference resources:…
Gene Kim; Jez humble; Patrick DeBois; John Willis. Devops Practice Guide