Author: Mu Ye
Application real time monitoring service (arms) is an application performance management (APM) product, which includes three sub products: application monitoring, Prometheus monitoring and front-end monitoring, covering distributed applications, container environment, browsers, applets and apps Performance management can help users realize full stack performance monitoring and end-to-end full link tracking diagnosis, making application operation and maintenance easier and more efficient than ever.
I am mainly responsible for the front-end monitoring platform of Alibaba cloud arms, which is more technical. I want to talk about how to grow up in the business, during which I also have confusion and confusion. I hope my experience or methods can help front-end students with similar situations.
My personal growth is mainly divided into three stages
(1) First contact in the field of monitoring, establish their own monitoring knowledge system
(2) Follow up on business pain points to build core competence of monitoring platform
(3) Business scenarios continue to expand to establish a system to ensure business stability
First contact in the field of monitoring, establish their own monitoring knowledge system
The problems faced by the initial business: after the business is online, users encounter errors in the actual access, and the business party cannot quickly perceive them; after the online fault occurs, many scenes cannot be quickly reproduced and the reasons can not be investigated. Based on these business pain points, the team decided to build a front-end monitoring platform to solve these problems.
I formally started to contact the front-end monitoring field from retcode2.0. Facing a new field, I need to quickly establish my own monitoring knowledge system from 0-1. It’s a very fulfilling and challenging process, but when you finish it, you’ll have a great sense of accomplishment. In the face of the unknown and challenges, here is a summary of what I think are more important experiences.
Have the courage to break their own boundaries and expand their own technology stack
The whole system of front-end monitoring can be summarized as follows: acquisition, log storage, log Segmentation & calculation, data analysis, alarm, that is, the work is no longer limited to the development of front-end business, it needs nginx service operation and maintenance capability, real-time / offline analysis capability, node application development and operation and maintenance capability, etc., so I took the first step, from the front-end to the whole stack, let me as a whole Familiar with and able to control the front-end monitoring from the acquisition to the final alarm diagnosis of the whole process, on this basis, I can cover the whole monitoring platform.
For responsible products, we need to have a strong sense of owner, make the work bigger and stronger, and serve customers well. The development, iteration, optimization and innovation of every function should be taken seriously to ensure the best of every link. In this process, my role has also changed, from the initial function implementation to the selection and decision maker of the leading technical solutions of the product capability. During this period of time, I was inadvertently aware of these changes.
Take one of my own experiences as an example: at first, the deployment of log server is operation and maintenance. Students directly configure it on the machine and then provide services. After I took over, I encountered a big problem: capacity expansion. Normal application expansion is a very simple thing. By submitting expansion application form through PSP, the expansion can be completed quickly. However, the current nginx log service has no baseline configuration, so it can not be directly expanded by PSP, and can only be manually configured.
For capacity expansion, there are two problems in the current scheme
(1) The time cost of manual configuration expansion is high
(2) It is not possible to guarantee the consistency of the version numbers of all kinds of packages on all machines.
In order to solve these problems, we need to understand the ability of nginx log service and the ability related to operation and maintenance. Through discussion with PE and back-end students, we finally decided to adopt the form of dokcer to solve the problem of capacity expansion at that time, which not only improved the operation and maintenance efficiency, but also laid a good foundation for the subsequent overseas business support.
So the advice to you is not to simply complete the functions of the product, but to have the awareness of the owner, carefully examine the problems faced by the business, be able to take the initiative to follow up and change, slowly accumulate, and the follow-up will produce qualitative changes.
Follow up on business pain points to build core competence of monitoring
After the platform was built from 0-1, in order to serve more businesses and polish product capabilities, it was officially upgraded to the front-end monitoring platform of Alibaba cloud arms. The basic ability of monitoring has all been online. How to develop in the future is a problem I need to think about. If we continue to do iterative optimization on this basis, there will be no obvious breakthrough and growth for products and individuals.
For technical products, most of them are dominated by technical students. When the products develop to a certain stage, they will face the problem of how to make follow-up breakthroughs. I have two suggestions
(1) Go deep into the problems faced by the business and develop technical solutions.
First, ask yourself a few questions:
· who is the business party?
What problems do businesses have when using their own products?
What are the demands of the business?
What are the problems with your products?
If we dig deep into these questions and list the top 5 answers, we will find that there are many things worth doing and breaking through.
In the initial field of front-end monitoring, products are focused on the statistical display stage of the collected and reported data. Anomalies are found through the fluctuation of data indicators, and then the positioning of anomalies is directly dependent on the original log. If the original log cannot cover the information, it can only be reproduced and checked by the business students themselves. Most of the time, the statistical data can not be explained, which directly leads to business students questioning the accuracy of the data. Therefore, monitoring products should evolve from the initial data statistics to problem positioning. At this stage, leading and completing the corresponding problem diagnosis link.
(2) Expanding vision (Technology & Business)
Before leading a product plan / making a technical plan, it is necessary to make advance research and assist in making decisions. The purpose of the research is to expand our technology & business vision, and the corresponding ways can be as follows:
Competitive product analysis: current mature products such as Tingyun, dynatrace, oneapm, etc;
Input / discussion from students in related fields: products, back-end application monitoring, etc.
An online problem investigation is not an independent front-end monitoring or application monitoring directly to the cause. After expanding their cognitive field, discuss with the back-end middleware students, and finally develop a full link monitoring scheme to meet the demands of business problem investigation. Through this case, we can see that if we don’t take a step, we can’t see and give a solution.
Business scenarios continue to expand to establish a system to ensure business stability
How to find a breakthrough after the overall product capability tends to be stable? I have also gone through the misunderstanding, hoping to make a breakthrough in technology, so the starting point is what technologies can reflect the depth of my products, which directly leads to the more consideration, the more confused. In fact, the correct starting point is still mentioned in the second part: starting from the business pain point to develop solutions. In this part, it is no longer a single point solution, but a systematic solution.
I have some experiences to share
Open mind, win win cooperation
Technical products will receive demands from various business parties, and it is very difficult to support all kinds of demands in the case of limited manpower. At this time, we can straighten out the mentality, pull the appeal side students to cooperate and build, better meet the demands of the business side, and at the same time, let the product continue to expand the support scene.
The development of front-end technology is very rapid. At the beginning of the rapid development of small programs, the monitoring demands of small programs also follow. However, at that time, the team was not familiar with the technical architecture of small programs, and the cost of monitoring on this basis was high. Among them, nailing has many small programs with large access level, and has strong demands for monitoring. After considering the business demands and product development, it cooperates with nailing students to support the monitoring demands of various small programs. In this cooperation, I deeply realized the effect of complementary advantages and getting twice the result with half the effort.
Complete the understanding of the whole system in the early stage, and then consider the solution from all aspects of the system.
With the continuous access of business, the problems of computing resources and storage resources required for monitoring are constantly exposed. At this time, my work is not only to ensure the stability of business, but also to ensure the stability of the platform. Therefore, the security scheme is considered and formulated in the acquisition stage, reporting stage, storage stage and computing stage. After the completion of systematic stability construction, it can not only find the risk one minute before the big promotion, but also ensure the stability of the platform, support the monitoring demands of various sites in the big promotion, and precipitate and enable the daily stability operation and maintenance work after the big promotion.
Everyone’s experience and responsible work are different. They can’t directly copy other people’s successful experience. At the same time, many of the summary points are easy to understand but difficult to do. However, they can find their own recognized content from the experience and summary of excellent students. They can adhere to and constantly practice on themselves. Only with continuous practice can they gradually transform into their own ability.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.