Summary:Today, let’s talk about how DLI implements monitoring alarms to improve the overall operation and maintenance capability, so as to better provide customers with serverless DLI.
DLI is a serverless big data computing service that supports multi-mode engine. Operation and maintenance free is also an important feature when facing customers as a serverless cloud service. So how do we realize the operation and maintenance of the whole service for the service itself? Today, let’s talk about how DLI implements monitoring alarms to improve the overall operation and maintenance capability, so as to better provide customers with serverless DLI.
The figure above shows the overall deployment architecture of DLI service. As a serverless service, it fully embraces the cloud native technology. Whether it is the micro service providing task management or the computing unit finally executing tasks, it is deployed based on kubernetes, which also better realizes the rapid elastic scaling of serverless.
For the monitoring and alarm of DLI service, we mainly consider the following aspects:
1. Global dimension, mainly QPS, success rate and response delay of the overall API
As a serverless big data computing service, DLI provides services externally in the form of rest API. Therefore, the QPS and response delay of API directly reflect the external ability of the service, and the success rate is the direct embodiment of service SLA.
2. OS dimension, mainly including CPU utilization, memory utilization, disk utilization and uplink and downlink traffic of container host
No matter how the deployed architecture and technology evolve, the monitoring of basic resources is the most basic and necessary.
3. Container dimension, mainly including CPU utilization, memory utilization, k8s space and user space utilization, and pod health
Container is the evolution of virtual machine, so the resource monitoring of container is also the most basic. Our microservices or computing units run on the kubernetes cluster as containers, so it is also necessary to monitor the health status of pod.
4. Micro service dimension, mainly including traffic, performance, health inspection and key logs
Monitoring is to better find and solve problems, so the core is business level monitoring. DLI is a complex distributed serverless application, which is divided into different micro services according to different domain models. Therefore, the monitoring of internal traffic and performance of micro services is an important index to measure the reliability of each micro service. A good system often has a perfect log system. Monitoring key logs can help us quickly find and locate problems. Therefore, this is also our focus on monitoring the business dimension.
The monitoring of the above aspects is some key steps for us to realize the automatic operation and maintenance of cloud services. Through these, we can better find problems before customers and ensure service SLA. Of course, these are far from enough. As the saying goes, “it’s a long way to go, I’ll look up and down”. More automatic and intelligent operation and maintenance is the goal of serverless service.