Unveil the top AI model of KPI anomaly detection

Time:2021-9-25

Abstract: the 2020gde global developer competition KPI anomaly detection has come to an end. The “atomic bomb from introduction to Mastery” from Futian Lianhua street in Shenzhen is lucky to achieve the top 1 in the general list. Here, I share with you the solutions of Futian Lianhua street in Shenzhen in this competition.

GDE global developer competition – KPI anomaly detection came to an end, and the “atomic bomb from introduction to Mastery” from Lianhua street, Futian, Shenzhen was lucky to achieve successTop 1 in the general listHis solution is given below.

Background introduction

The core network plays an important role in the network of mobile operators. Its abnormality often leads to existing network failures such as call failure and network delay, which has a significant negative impact on the service quality of the whole network, mostly affecting hundreds of thousands of users and causing a large area of complaints [1]. Therefore, it is necessary to quickly and timely find the abnormal risk of the core network and eliminate the fault before the impact expands.
KPI is a kind of indicator that can reflect the network performance and equipment operation status. This competition question provides the real KPI data of an operator’s core network in the form of KPI time series with a sampling interval of 1 hour, which players need to use [August 1, 2019 September 23, 2019) to model the data, use the trained model to predict the data in the next 7 days, and identify the outliers in the KPI Series in the next week.

Evaluation indicators:

F1 is used as the evaluation index in this competition. The specific calculation formula is as follows:

P = TP/(TP+FP)

R = TP/(TP+FN)

F1 = 2PR/(P+R)

Data exploration

Data * * has 20 different KPIs. Different KPIs have different physical meanings and represent different core network indicators. Since the competition questions need to predict the data in the next 7 days, the weekly distribution of modeling samples is also checked. It can be seen from Fig. 1 that the abnormal rate in the first three weeks of training concentration is significantly lower than that in the following weeks. Through further analysis, it can be found that although all the data of [2019-08-01, 2019-09-23) are provided in the competition question, all 20 KPIs are normal before 2019-08-15, and the first abnormal point starts from 02:00:00, 2019-08-15 (Fig. 2) Therefore, it is speculated that the data distribution before 8.15 is different from that of subsequent samples, or there is a problem of abnormal labeling before 8.15. In the experiment, the modeling effect of eliminating the samples before 8.15 is also better than retaining the samples in this time period, which further verifies the speculation.

Unveil the top AI model of KPI anomaly detection

Figure 1. KPI week level exception statistics

Unveil the top AI model of KPI anomaly detection

Fig. 2. Earliest data time of abnormality or not

Time series diagram plays an important role in quickly understanding data and business. After observing 20 time series, I roughly divide anomalies into four categories, as shown in Fig. 3.

Unveil the top AI model of KPI anomaly detection

Fig3. Anomaly classification (in the time series, the red dot is abnormal and the blue dot is normal)

1. Boundary type anomaly

As shown in part a (red box) of Fig. 3, the value range of abnormal samples in boundary anomalies is completely different from that of normal values, that is, there is a clear decision boundary to completely separate abnormal points.

2. Trend destructive anomaly

As shown in part B (green box) of Fig. 3, the trend of normal sample points tends to follow a trend, and the trend destructive abnormal points will deviate from this trend, but the value range may still be within the value range of normal samples. This kind of abnormality is greatly different from adjacent points and the value difference between normal points at the same time.

3.0 value type abnormality

As shown in part C (Orange Box) of Fig. 3, the value of such exception is directly 0. According to my understanding of the business, the value of 0 should not appear in normal KPIs. According to the analysis, 19 of the 20 KPIs should not be 0. Only one KPI has a normal value of 0, and non-0 is an exception.

4. Other types of abnormalities

As shown in part D (purple box) in Fig. 3, such anomalies often have no damage trend and the value is within the normal range, but may deviate from the normal value at the same time.

Problem solving ideas

The contest * * has 20 different KPIs with different physical meanings and various types of exceptions. If all KPIs are taken as a whole to establish a unified two classification model, the model effect is unsatisfactory and it is difficult to enter the front row. However, if each KPI is modeled separately, at least 20 different models need to be established, maintained and optimized, and the maintenance cost is too high, Therefore, my idea is to classify and model KPIs or exceptions.

3.1 boundary discovery

The decision tree will divide the samples into different feature space ranges according to the target distribution (as shown in Fig. 4), which is very suitable for boundary discovery and determination. Therefore, for boundary type anomalies, i.e. those with completely different values of good and bad samples, I use the decision tree to automatically discover and determine the boundary, as follows:

Traverse 20 KPIs. If only the f1score of the univariate shallow simple decision tree established by using the value of time series in the training set is 1, it is considered that the KPI is a boundary abnormal KPI, and the prediction results of the decision tree are used as the decision boundary to predict the future samples of the same KPI.

Unveil the top AI model of KPI anomaly detection

Fig4. Boundary discovery based on decision tree

After traversing the KPIs, it can be seen that there are 7 KPIs, and all exceptions are boundary exceptions, that is, the value range of all exceptions of 7 KPIs in the training set is different from that of normal samples. The final results also show that the scheme can identify 100% of the anomalies of boundary KPI not only in the training set, but also in the test set.

3.2 exploration of non boundary anomalies

Non boundary anomalies often have a certain period of KPI trend. If the time period is separated for analysis, the time series can be observed from a two-dimensional perspective.

Unveil the top AI model of KPI anomaly detection

Fig 5. Two dimensional display of time series

In KPI_ Id = 9415a… For example, if the date information is stripped, the x-axis is only the hour of the day, and the y-axis is still the value of time series, it will be shown in Fig. 5. At this time, the whole time series is presented in a two-dimensional space, and most of the abnormal values (red dots) deviate far from the normal values (blue dots). A simple idea is to use unsupervised method to identify the anomalies in the graph. In fact, in the actual production environment, as many as 5000 + original KPIs and 300 + derivative KPIs, it is difficult to obtain the time series with exception labels. Therefore, statistical methods or unsupervised algorithms are often used for exception detection in the production environment [1,2]. However, under the current labeled competition problems, after many attempts, unsupervised algorithms such as iforest, DBSCAN and time series decomposition methods such as prophet can not outperform supervised machine learning algorithms. Therefore, for non boundary exceptions, it is finally decided to use supervised machine learning algorithm for modeling.

3.3 KPI type division

In 3.1, seven boundary type abnormal KPIs are found based on the simple decision tree, but the remaining 13 KPIs have different physical meanings and need to be grouped for modeling. The basic idea of grouping is that similar KPIs should be divided into the same group. Pearson correlation coefficient is the most familiar correlation index. Its physical meaning is to express the degree of change of two variables in the same direction or in the opposite direction. It is very suitable for the similarity analysis of time series. Through the similarity analysis of the remaining 13 KPIs, we can find that the pairwise correlation coefficient between the following two groups of IDS is 0.9 or more. cluster1=[9415a…, 600a5…, ed63c…]

cluster2=[b3842…, bb6bb…, 3fe4d…]

Unveil the top AI model of KPI anomaly detection

Fig. 6. Cluster1 time series comparison example

Taking cluster1 as an example (Fig. 6), it can be seen that the time series of different KPIs in similarity groups not only have similar trends, but also when one KPI in the group is abnormal, other KPIs will also be abnormal synchronously, showing very high linkage. Therefore, it is very important to establish the model of similarity grouping. The abnormal recall and false alarm are often three times, that is, one will score three times, and the wrong one will lose three times. The core point of rapid scoring in the middle of the race is the establishment of this part of the model. For the remaining seven KPIs, I finally divide them into three sub categories for group modeling according to whether they contain cycles: semi periodic type: cluster3_ 1 =   [4f493…] aperiodic: cluster3_ 2 =   [29374…, 8f522…] strongly periodic: cluster3_ 3 =   [681cb…, 0a9f5…, 355ed…, 3e1f1…] among them, semi periodic KPIs only show periodic trend in some time periods, and the values in other time periods are almost the same. The value of aperiodic KPI has no obvious correlation with time, and the value of strong periodic KPI fluctuates periodically with time.

Characteristic structure

According to the above analysis and my understanding of time series, the following five types of variables are constructed in this competition. 1. Basic variables: the hour of the day, the day of the week, and KPI_ Various codes of ID, such as label encoder, target encoder, etc; 2. Difference variables: first-order difference, second-order difference and third-order difference; 3. Translation variable: the KPI at the last n time points_ The value of ID or difference and its simple derivation, such as the value 24 hours ago; 4. Sliding window variable: the KPI in the past n periods_ Various statistical variables of ID and their simple derivation, such as the mean value of value in the past 24 hours; 5. Statistics of strong correlation window: for example, the total number of samples within the range of 0.95-1.05 in the last seven days;

Model scheme

Since the first mock exam is difficult to establish a unified model that can be applied to all KPI, and there are more models to be established and tuned in the process of solving problems, in order to improve efficiency, I decided to use the LightGBM with faster training speed and better effect to establish two classification models for each group.
In the actual modeling, I found that only using the data of [2019-08-152019-09-08] is better than all the data or using the data of the next few weeks closer to the test set samples. Combined with the phenomenon that the anomaly rate in Fig. 1 continues to decrease significantly in the next few weeks, I judge that the anomaly distribution of [2019-09-09, 2019-09-22] may be different or there are some labeling problems. After further exploration, it is found that the introduction of grafting learning can make full use of all abnormal data and achieve better results.
Grafting learning is a kind of transfer learning, which is used to describe the method of taking the output of one tree model a as the input of another tree model B (A and B often have different data distribution or completely belong to different products, which is essentially different from the conventional fusion of identically distributed data). This method is similar to grafting in tree propagation, so it is named [3]. In the ijcai2018 advertising algorithm competition, the data distribution of the first six days is different from that of the last day, so most people use the same distribution data of the second half of the seventh day to predict the second half of the day, while the plant boss trains a model with the data of the first six days, predicts the score obtained on the seventh day as the feature of the seventh day model, and then uses the data of the second half of the seventh day to predict the second half of the day, Finally, he easily won the solo champion. Afterwards, the plant said that it was the easiest game he played. After all, people used half a day’s data and plants used six and a half days of data [3,4,5]. In other scenarios with different data distribution, there are also grafting learning in the top scheme, such as ant financial services ATEC payment risk identification Top1 Scheme [6], CCF BDCI 2018 personalized package matching Top1 Scheme [7], etc. [3].

After several attempts, I finally decided to take the sample with abnormal date as the 1-layer model sample and [2019-08-152019-09-08] sample combined with the 1-layer model score as the input of the 2-layer model. The model framework is shown in Fig. 7. The introduction of this framework has obviously raised scores in this competition and is one of the key points of the above score.

Unveil the top AI model of KPI anomaly detection

Fig7. Model framework

In combination with the previous content, my final modeling scheme is shown in Fig. 8. Firstly, the KPI boundary is automatically found and the anomalies of 7 boundary KPIs are solved. For the remaining 13 KPIs, they are first disassembled into similar groups (6 KPIs) and dissimilar groups (7 KPIs) according to the similarity. Similar groups are composed of cluster1 and cluster2 with high correlation coefficient in the group, Dissimilar groups are divided into semi periodic group cluster3 according to whether they contain cycles or not_ 1. Non periodic group cluster3_ 2 and strong periodic group cluster3_ 3. Model different groups respectively, and finally summarize to generate the final results. Finally, the program achieved the highest score online and the highest score in defense.

Unveil the top AI model of KPI anomaly detection

Fig 8. Modeling scheme

Acknowledge

Thank you very much for the help and guidance of brother Xi Xu, Dr. tiaoyun, sister Su Yan and sister Xiao AI during the competition. Brother Xi Xu is as enthusiastic as ever and can always answer your questions and solve your doubts at the first time.

Thanks for the wonderful sharing after the Lushan big man game [2], which has benefited a lot. I haven’t seen the sharing of Huawei cloud developer salon before. After reading it this time, I think it can’t be poked. I can’t miss every issue in the future. Finally, I wish Huawei and NAIE prosper and create brilliance!

Reference
[1] Network ai-kpi anomaly detection, sharp weapon secret https://bbs.huaweicloud.com/v…
[2] Devrun developer Salon – what is the KPI anomaly detection of popular network https://vhall.huawei.com/fe/w…
[3] Brief introduction of grafting learning https://zhuanlan.zhihu.com/p/…
[4] Migration learning of structured data: grafting https://zhuanlan.zhihu.com/p/…
[5] Ijcai-2018 Top1 sharing https://github.com/plantsgo/i…
[6] ATEC payment risk competition Top1 solution https://zhuanlan.zhihu.com/p/…
[7] CCF BDCI 2018 personalized package matching Top1 scheme https://github.com/PPshrimpGo…

Click focus to learn about Huawei cloud’s new technologies for the first time~