By Rashida nasrin sucky
Source: towards Data Science
Anomaly detection can be handled as a statistical task of outlier analysis. But if we develop a machine learning model, it can be automated as usual and save a lot of time.
There are many use cases for exception detection. Credit card fraud detection, fault machine detection or hardware system detection based on abnormal characteristics and disease detection based on medical records are good examples. There are more use cases. The application of anomaly detection will only be more and more.
In this article, I will explain the process of developing exception detection algorithms from scratch in Python.
Formulas and processes
If the probability of a training instance is high, this is normal. If the probability of a training instance is very low, it is an abnormal example. For different training sets, the definitions of high probability and low probability are different. We’ll discuss it later.
If I want to explain how anomaly detection works, it’s simple.
- Calculate the average using the following formula:
Here m is the length of the data set or the number of training data, and\(x^i\)This is a separate training example. If you have multiple training features, you need to calculate the average energy of each feature in most cases.
- Calculate the variance using the following formula:
Here, Mu is the average value calculated in the previous step.
- Now, use this probability formula to calculate the probability of each training example.
Don’t be confused by the summation symbol in this formula! This is actually Sigma for variance.
You’ll see what it looks like when we implement the algorithm later.
- We now need to find the critical value of probability. As I mentioned earlier, if the probability of a training example is very low, it is an abnormal example.
What is the low probability?
There are no general restrictions. We need to find this for our training data set.
We obtain a series of probability values from the output obtained in step 3. For each probability, whether the data is abnormal is obtained by setting the threshold
Then calculate the accuracy, recall and F1 score of a series of probabilities.
The accuracy can be calculated using the following formula
The recall rate is calculated as follows:
ad locum,True positives(real case) refers to the number of cases in which an exception is detected by the algorithm, and its real situation is also an exception.
False Positives(false positive example) when the algorithm detects an abnormal example, but in practice, it is not abnormal, false positives will occur.
False Negative(false counterexample) means that an example detected by the algorithm is not abnormal, but it is actually an abnormal example.
From the above formula, you can see that higher accuracy and higher recall rate are always good, because it means that we have more real positive examples. But at the same time, false positive examples and false negative examples play a vital role, as you can see in the formula. This requires a balance point. Depending on your industry, you need to decide which one is tolerable to you.
A good way is to take an average. There is a unique formula for calculating the average. This is the F1 score. F1 score formula is:
Here, P and R represent accuracy and recall, respectively.
I don’t want to elaborate on why this formula is so unique. Because this article is about exception detection. If you are more interested in this article, you can check:https://towardsdatascience.com/a-complete-understanding-of-precision-recall-and-f-score-concepts-23dc44defef6
Based on the F1 score, you need to select your threshold probability.
Anomaly detection algorithm
I will use the data set of Andrew NG’s machine learning course, which has two training characteristics. I did not use a real dataset in this article because it is very suitable for learning. It has only two characteristics. In any real data set, it is impossible to have only two features.
The advantage of two features is that data can be visualized, which is very useful for learners. Feel free to download the dataset from this link and continue:
First, import the necessary packages
import pandas as pd import numpy as np
Import dataset. This is an excel dataset. Here, training data and cross validation data are stored in separate tables. So let’s bring the training data.
df = pd.read_excel('ex8data1.xlsx', sheet_name='X', header=None) df.head()
Let’s compare column 0 with column 1.
plt.figure() plt.scatter(df, df) plt.show()
You may know which data is abnormal by looking at this figure.
Check how many training examples are in this dataset:
m = len(df)
Calculate the average value of each feature. Here we have only two characteristics: 0 and 1.
s = np.sum(df, axis=0) mu = s/m mu
0 14.112226 1 14.997711 dtype: float64
Let’s calculate the variance according to the formula described in the “formula and process” section above:
vr = np.sum((df - mu)**2, axis=0) variance = vr/m variance
0 1.832631 1 1.709745 dtype: float64
Now make it diagonal. As I explained in the section “formulas and processes” after the probability formula, the summation symbol is actually variance
var_dia = np.diag(variance) var_dia
array([[1.83263141, 0. ], [0. , 1.70974533]])
k = len(mu) X = df - mu p = 1/((2*np.pi)**(k/2)*(np.linalg.det(var_dia)**0.5))* np.exp(-0.5* np.sum(X @ np.linalg.pinv(var_dia) * X,axis=1)) p
The training part has been completed.
The next step is to find the threshold probability. If the probability is lower than the threshold probability, the example data is abnormal data. But we need to find that threshold for our special case.
For this step, we use cross validation data and tags.
For your case, you only need to keep part of the original data for cross validation.
Now import cross validation data and labels:
cvx = pd.read_excel('ex8data1.xlsx', sheet_name='Xval', header=None) cvx.head()
The labels are as follows:
cvy = pd.read_excel('ex8data1.xlsx', sheet_name='y', header=None) cvy.head()
I’ll convert ‘cvy’ to a numpy array because I like to use arrays. However, the data frame is also good.
y = np.array(cvy)
#Part of an array array([, , , , , , , , ,
Here, y value 0 indicates that this is a normal example, and y value 1 indicates that this is an abnormal example.
Now, how do I choose a threshold?
I don’t want to just check all the probabilities in the probability table. This may not be necessary. Let’s check the probability value again.
count 3.070000e+02 mean 5.905331e-02 std 2.324461e-02 min 1.181209e-23 25% 4.361075e-02 50% 6.510144e-02 75% 7.849532e-02 max 8.986095e-02 dtype: float64
As shown in the figure, we don’t have much exception data. So, if we start with a value of 75%, this should be good. But to be safe, I’ll start with the average.
Therefore, we will start from the average and lower probability range. We will check the F1 score of each probability in this range.
First, define a function to calculate real examples, false positive examples and false negative examples:
def tpfpfn(ep): tp, fp, fn = 0, 0, 0 for i in range(len(y)): if p[i] <= ep and y[i] == 1: tp += 1 elif p[i] <= ep and y[i] == 0: fp += 1 elif p[i] > ep and y[i] == 1: fn += 1 return tp, fp, fn
List the probabilities below or equal to the average probability.
eps = [i for i in p if i <= p.mean()]
Check the length of the list
Define a function to calculate F1 score according to the formula discussed above:
def f1(ep): tp, fp, fn = tpfpfn(ep) prec = tp/(tp + fp) rec = tp/(tp + fn) f1 = 2*prec*rec/(prec + rec) return f1
All functions are ready!
Now calculate the F1 scores for all epsilons or the range of probability values we previously selected.
f =  for i in eps: f.append(f1(i)) f
[0.14285714285714285, 0.14035087719298248, 0.1927710843373494, 0.1568627450980392, 0.208955223880597, 0.41379310344827586, 0.15517241379310345, 0.28571428571428575, 0.19444444444444445, 0.5217391304347826, 0.19718309859154928, 0.19753086419753085, 0.29268292682926833, 0.14545454545454545,
This is part of the F-score table. The length should be 133.
The f score is usually between 0 and 1, and the higher the F1 score, the better. Therefore, we need to take the highest score of F from the list of F scores just calculated.
Now use the “argmax” function to determine the index of the maximum value of the F-score.
Now use this index to get the threshold probability.
e = eps e
Find abnormal instances
We have a critical probability. We can find the labels of our training data.
If the probability value is less than or equal to the threshold, the data is abnormal data, otherwise it is normal data. We represent normal data and abnormal data as 0 and 1 respectively,
label =  for i in range(len(df)): if p[i] <= e: label.append(1) else: label.append(0) label
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
This is part of the tag list.
I will add this calculation tag to the training dataset above:
df['label'] = np.array(label) df.head()
I draw the data in red where the label is 1 and in black where the label is 0. Here are the results.
Does that make sense?
Yes, right? The red data is obviously abnormal.
I try to explain the process of developing exception detection algorithm step by step, which I hope is understandable. If you can’t understand it just by reading, I suggest you run every piece of code. That’s clear.
Welcome to panchuang AI blog:
Official Chinese document of sklearn machine learning:
Welcome to panchuang blog resources summary station: