Outlier detection based on RNN self encoder

Time:2021-5-11

By David woroniuk
Compile VK
Source: towards Data Science

What is an exception

Outliers, usually called outliers, refer to data points, data sequences or patterns that do not conform to the overall behavior of data series. Therefore, anomaly detection is the task of detecting data points or sequences that do not conform to patterns in a wider range of data.

Effective detection and deletion of abnormal data is very useful for many business functions, such as detection of broken links embedded in websites, peak Internet traffic or drastic changes in stock prices. Marking these phenomena as outliers, or making pre planned countermeasures, can save time and money.

Exception type

Generally, abnormal data can be divided into three categories: additive abnormal value, time variation abnormal value or horizontal variation abnormal value.

Additive outliersIt is characterized by a sudden sharp increase or decrease in value, which may be driven by exogenous or endogenous factors. Examples of additive outliers may be the large growth of website traffic due to the appearance of TV programs (external cause), or the short-term growth of stock trading volume due to strong quarterly performance (internal cause).

Time variation outliersIt is characterized by a short sequence, which does not conform to the broader trend in the data. For example, if a website server crashes, the website traffic will drop to zero on a series of data points until the server restarts, and then the traffic will return to normal.

Abnormal value of horizontal variationIt is a common phenomenon in the commodity market, because the high demand for electricity in the commodity market is intrinsically related to bad weather conditions. Therefore, we can observe a “horizontal change” between summer and winter electricity prices, which is caused by weather driven demand changes and renewable energy generation changes.

What is self encoder

An automatic encoder is a neural network designed to learn the low dimensional representation of a given input. An automatic encoder usually consists of two parts: one is learning to map the input data to the low dimensional representation, and the other is learning to map the representation back to the input data.

Because of this structure, the encoder network iteratively learns an effective data compression function, which maps the data to a low dimensional representation. After training, the decoder can successfully reconstruct the original input data, and the reconstruction error (the difference between the input generated by the decoder and the reconstructed output) is the objective function of the whole training process.

realization

Now that we understand the underlying architecture of the auto encoder model, we can begin to implement it.

The first step is to install the libraries, packages and modules we will use:

#Data processing:
import numpy as np
import pandas as pd
from datetime import date, datetime

#RNN self encoder:
from tensorflow import keras
from tensorflow.keras import layers

#Drawing:
!pip install chart-studio
import plotly.graph_objects as go

Secondly, we need to obtain some data for analysis. This paper uses the historical encryption software package to obtain bitcoin historical data from June 6, 2013 to now. The following code also generates the daily bitcoin rate of return and intraday price volatility, then deletes any missing data lines and returns the first five lines of the data frame.

#Import historical crypto package:
!pip install Historic-Crypto
from Historic_Crypto import HistoricalData

#Obtain bitcoin data and calculate earnings and intraday Volatility:
dataset = HistoricalData(start_date = '2013-06-06',ticker = 'BTC').retrieve_data()
dataset['Returns'] = dataset['Close'].pct_change()
dataset['Volatility'] = np.abs(dataset['Close']- dataset['Open'])
dataset.dropna(axis = 0, how = 'any', inplace = True)
dataset.head()

Now that we have some data, we should visually scan each sequence for potential outliers. Plot below_ dates_ The values function can iteratively draw each sequence contained in the data frame.

def plot_dates_values(data_timestamps, data_plot):
  '''
  This function provides a plan view of the input sequence
  Arguments: 
          data_ Timestamps: timestamps associated with each data instance.
          data_ Plot: the data sequence to be drawn.
  Returns:
          Fig: use the slider and button to display the graphics of the sequence.
  '''

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = data_timestamps, y = data_plot,
                           mode = 'lines',
                           name = data_plot.name,
                           connectgaps=True))
  fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1 Years", step="year", stepmode="backward"),
            dict(count=2, label="2 Years", step="year", stepmode="backward"),
            dict(count=3, label="3 Years", step="year", stepmode="backward"),
            dict(label="All", step="all")
        ]))) 
  
  fig.update_layout(
    title=data_plot.name,
    xaxis_title="Date",
    yaxis_title="",
    font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  return fig.show()

We can now call the above functions repeatedly to generate bitcoin trading volume, closing price, opening price, volatility and yield curves.

plot_dates_values(dataset.index, dataset['Volume'])

It is worth noting that there are some peaks of trading volume in 2020, and it may be useful to investigate whether these peaks are abnormal or indicate a broader sequence.

plot_dates_values(dataset.index, dataset['Close'])

![](http://qiniu.aihubs.net/newplot (1).png)

There was a significant rise in the closing price in 2018, and then it fell to the level of technical support. However, an upward trend is prevalent throughout the data.

plot_dates_values(dataset.index, dataset['Open'])

![](http://qiniu.aihubs.net/newplot (2).png)

The daily opening price is similar to the above closing price.

plot_dates_values(dataset.index, dataset['Volatility'])

![](http://qiniu.aihubs.net/newplot (3).png)

The price and volatility in 2018 are obvious. Therefore, we can study whether these volatility peaks are regarded as abnormal by the self encoder model.

plot_dates_values(dataset.index, dataset['Returns'])

![](http://qiniu.aihubs.net/newplot (4).png)

Due to the randomness of the revenue series, we choose to test the outliers in the daily trading volume of bitcoin, which is characterized by the trading volume.

Therefore, we can start the data preprocessing of the automatic encoder model. The first step of data preprocessing is to determine the appropriate segmentation between training data and test data. Generate as outlined below_ train_ test_ Split function can split training and test data by date. When the following function is called, two data frames, namely training data and test data, are generated as global variables.

def generate_train_test_split(data, train_end, test_start):
  '''
  This function decomposes the dataset into training data and test data by using strings. As' train '_ 'end' and 'test'_ The string provided by the 'start' parameter must be consecutive days.
  Arguments: 
          Data: data is divided into training data and test data.
          train_ End: the end date of training data (STR).
          test_ Start: start date of test data (STR).
  Returns:
          training_ Data: data used in model training (pandas dataframe).
          testing_ Data: data used in model testing (panda dataframe).
  '''
  if isinstance(train_end, str) is False:
    raise TypeError("train_end argument should be a string.")
  
  if isinstance(test_start, str) is False:
    raise TypeError("test_start argument should be a string.")

  train_end_datetime = datetime.strptime(train_end, '%Y-%m-%d')
  test_start_datetime = datetime.strptime(test_start, '%Y-%m-%d')
  while train_end_datetime >= test_start_datetime:
    raise ValueError("train_end argument cannot occur prior to the test_start argument.")
  while abs((train_end_datetime - test_start_datetime).days) > 1:
    raise ValueError("the train_end argument and test_start argument should be seperated by 1 day.")

  training_data = data[:train_end]
  testing_data = data[test_start:]

  print('Train Dataset Shape:',training_data.shape)
  print('Test Dataset Shape:',testing_data.shape)

  return training_data, testing_data


#We now call the above functions to generate training and test data
training_data, testing_data = generate_train_test_split(dataset, '2018-12-31','2019-01-01')

In order to improve the accuracy of the model, we can “standardize” or scale the data. This function can scale the training data frame generated above, save the training average and training standard, so as to standardize the test data in the future.

Note: it is important to scale the training and test data at the same level, otherwise the differences in scale will cause interpretability problems and model inconsistency.

def normalise_training_values(data):
  '''
  This function normalizes the input value with the mean and standard deviation.
  Arguments: 
          Data: the dataframe column to standardize.
  Returns:
          Values: normalized data (numpy array) for model training.
          Mean: training set mean, used to standardize the test set (float).
          STD: standard deviation of training set, used to standardize test set (float).
  '''
  if isinstance(data, pd.Series) is False:
    raise TypeError("data argument should be a Pandas Series.")

  values = data.to_list()
  mean = np.mean(values)
  values -= mean
  std = np.std(values)
  values /= std
  print("*"*80)
  print("The length of the training data is: {}".format(len(values)))
  print("The mean of the training data is: {}".format(mean.round(2)))
  print("The standard deviation of the training data is {}".format(std.round(2)))
  print("*"*80)
  return values, mean, std


#Now call the above function:
training_values, training_mean, training_std = normalise_training_values(training_data['Volume'])

As we said above, normalize_ training_ The values function, we now have a numpy array, which contains an array called training_ We have standardized the training data of values_ Mean and training_ STD is stored as a global variable to standardize the test set.

We can now start to generate a series of sequences that can be used to train the auto encoder model. We define the window size of 30 and provide 3D training data of a shape (2004,30,1)

#Define the number of time steps for each sequence:
TIME_STEPS = 30

def generate_sequences(values, time_steps = TIME_STEPS):
  '''
  This function generates the length sequence 'time' to be passed to the model_ STEPS'。
  Arguments: 
          Values: generates the normalized values of the sequence (numpy array).
          time_ Steps: the length of the sequence (int).
  Returns:
          train_ Data: 3D data (numpy array) for model training.
  '''
  if isinstance(values, np.ndarray) is False:
    raise TypeError("values argument must be a numpy array.")
  if isinstance(time_steps, int) is False:
    raise TypeError("time_steps must be an integer object.")

  output = []

  for i in range(len(values) - time_steps):
    output.append(values[i : (i + time_steps)])
  train_data = np.expand_dims(output, axis =2)
  print("Training input data shape: {}".format(train_data.shape))

  return train_data
  
#Now call the above function to generate X_ train:
x_train = generate_sequences(training_values)

Now that we have finished the training data processing, we can define the automatic encoder model, and then fit the model to the training data. define_ The model function uses the training data shape to define the appropriate model, and returns the summary of the self encoder model and the self encoder model.

def define_model(x_train):
  '''
  This function uses X_ The RNN model is generated from the dimension of train.
  Arguments: 
          x_ Train: 3D data (numpy array) for model training.
  Returns:
          Model: model architecture (tensorflow object).
          model_ Summary: a summary of the model architecture.
  '''

  if isinstance(x_train, np.ndarray) is False:
    raise TypeError("The x_train argument should be a 3 dimensional numpy array.")

  num_steps = x_train.shape[1]
  num_features = x_train.shape[2]

  keras.backend.clear_session()
  
  model = keras.Sequential(
      [
       layers.Input(shape=(num_steps, num_features)),
       layers.Conv1D(filters=32, kernel_size = 15, padding = 'same', data_format= 'channels_last',
                     dilation_rate = 1, activation = 'linear'),
       layers.LSTM(units = 25, activation = 'tanh', name = 'LSTM_layer_1',return_sequences= False),
       layers.RepeatVector(num_steps),
       layers.LSTM(units = 25, activation = 'tanh', name = 'LSTM_layer_2', return_sequences= True),
       layers.Conv1D(filters = 32, kernel_size = 15, padding = 'same', data_format = 'channels_last',
                     dilation_rate = 1, activation = 'linear'),
       layers.TimeDistributed(layers.Dense(1, activation = 'linear'))
      ]
  )

  model.compile(optimizer=keras.optimizers.Adam(learning_rate = 0.001), loss = "mse")
  return model, model.summary()

Then, model_ The fit function calls define internally_ Model function, and then provide epochs and batch to the model_ Size and validation_ The loss parameter. Then we call this function to start the model training process.

def model_fit():
  '''
  This function calls' define 'above_ Model () 'function, and then according to X_ Train data is used to train the model.
  Arguments: 
          N/A.
  Returns:
          Model: a trained model.
          History: a summary of how the model was trained (training errors, validation errors).
  '''
  #In X_ Call the above define on the train_ Model function:
  model, summary = define_model(x_train)

  history = model.fit(
    x_train,
    x_train,
    epochs=400,
    batch_size=128,
    validation_split=0.1,
    callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", 
                                              patience=25, 
                                              mode="min", 
                                              restore_best_weights=True)])
  
  return model, history


#Call the above function to generate the model and its history
model, history = model_fit()

Once the model has been trained, it is necessary to draw the training and validation loss curve to know whether the model has bias (under fitting) or variance (over fitting). This can be done by calling the following plot_ training_ validation_ Loss function.

def plot_training_validation_loss():
  '''
  This function draws the training and validation loss curve of the training model, which can be used for visual diagnosis of under fitting or over fitting.
  Arguments: 
          N/A.
  Returns:
          Fig: visual representation of training loss and validation of model
  '''
  training_validation_loss = pd.DataFrame.from_dict(history.history, orient='columns')

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = training_validation_loss.index, y = training_validation_loss["loss"].round(6),
                           mode = 'lines',
                           name = 'Training Loss',
                           connectgaps=True))
  fig.add_trace(go.Scatter(x = training_validation_loss.index, y = training_validation_loss["val_loss"].round(6),
                           mode = 'lines',
                           name = 'Validation Loss',
                           connectgaps=True))
  
  fig.update_layout(
  title='Training and Validation Loss',
  xaxis_title="Epoch",
  yaxis_title="Loss",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  return fig.show()


#Call the above function:
plot_training_validation_loss()

![](http://qiniu.aihubs.net/newplot (5).png)

It is worth noting that the training and verification loss curves converge in the whole chart, and the verification loss is still slightly greater than the training loss. Given the shape error and relative error, we can confirm that there is no under fitting or over fitting in the automatic encoder model.

Now, we can define the reconstruction error, which is one of the core principles of the automatic encoder model. The reconstruction error is expressed as training loss, and the reconstruction error threshold is the maximum of training loss. Therefore, any value greater than the maximum value of training loss can be regarded as an abnormal value when calculating the test error.

def reconstruction_error(x_train):
  '''
  This function calculates the reconstruction error and displays the histogram of the training mean absolute error
  Arguments: 
          x_ Train: 3D data (numpy array) for model training.
  Returns:
          Fig: visualization of training Mae distribution.
  '''

  if isinstance(x_train, np.ndarray) is False:
    raise TypeError("x_train argument should be a numpy array.")

  x_train_pred = model.predict(x_train)
  global train_mae_loss
  train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis = 1)
  histogram = train_mae_loss.flatten() 
  fig =go.Figure(data = [go.Histogram(x = histogram, 
                                      histnorm = 'probability',
                                      name = 'MAE Loss')])  
  fig.update_layout(
  title='Mean Absolute Error Loss',
  xaxis_title="Training MAE Loss (%)",
  yaxis_title="Number of Samples",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  print("*"*80)
  print("Reconstruction error threshold: {} ".format(np.max(train_mae_loss).round(4)))
  print("*"*80)
  return fig.show()


#Call the above function:
reconstruction_error(x_train)

Above, we will train_ Mean and training_ STDs are saved as global variables so that they can be used to scale test data. We now define normalise_ testing_ The values function is used to scale the test data.

def normalise_testing_values(data, training_mean, training_std):
  '''
  The function uses training mean and standard deviation to normalize the test data and generate a numpy array of test values.
  Arguments: 
          Data: data used (panda dataframe column)
          Mean: training set average (floating point number).
          STD: standard deviation of training set.
  Returns:
          Values: array (numpy array)
  '''
  if isinstance(data, pd.Series) is False:
    raise TypeError("data argument should be a Pandas Series.")

  values = data.to_list()
  values -= training_mean
  values /= training_std
  print("*"*80)
  print("The length of the testing data is: {}".format(data.shape[0]))
  print("The mean of the testing data is: {}".format(data.mean()))
  print("The standard deviation of the testing data is {}".format(data.std()))
  print("*"*80)

  return values

Then, in testing_ This function is called on the volume column of data. So, test_ Values are materialized as numpy arrays.

#Call the above function:
test_value = normalise_testing_values(testing_data['Volume'], training_mean, training_std)

On this basis, the generation test loss function is defined, and the difference between the reconstructed data and the test data is calculated. If any value is greater than the training maximum loss value, it is stored in the global exception list.

def generate_testing_loss(test_value):
  '''
  This function uses the model to predict exceptions in the test set. In addition, the function generates an "exception" global variable, which contains the outliers identified by RNN.
  Arguments: 
          test_ Value: array of tests (numpy array).
  Returns:
          Fig: visualization of training Mae distribution.
  '''
  x_test = generate_sequences(test_value)
  print("*"*80)
  print("Test input shape: {}".format(x_test.shape))

  x_test_pred = model.predict(x_test)
  test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis = 1)
  test_mae_loss = test_mae_loss.reshape((-1))

  global anomalies
  anomalies = (test_mae_loss >= np.max(train_mae_loss)).tolist()
  print("Number of anomaly samples: ", np.sum(anomalies))
  print("Indices of anomaly samples: ", np.where(anomalies))
  print("*"*80)

  histogram = test_mae_loss.flatten() 
  fig =go.Figure(data = [go.Histogram(x = histogram, 
                                      histnorm = 'probability',
                                      name = 'MAE Loss')])  
  fig.update_layout(
  title='Mean Absolute Error Loss',
  xaxis_title="Testing MAE Loss (%)",
  yaxis_title="Number of Samples",
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  return fig.show()


#Call the above function:
generate_testing_loss(test_value)

In addition, the distribution of MAE is introduced and compared with the direct loss of MAE.

![](http://qiniu.aihubs.net/newplot (6).png)

Finally, outliers are visually shown below.

def plot_outliers(data):
  '''
  This function determines the position of outliers in the time series, and these outliers are drawn out in turn.
  Arguments: 
          Data: initial data set (pandas dataframe).
  Returns:
          Fig: visual representation of outliers in the sequence determined by RNN.
  '''

  outliers = []

  for data_idx in range(TIME_STEPS -1, len(test_value) - TIME_STEPS + 1):
    time_series = range(data_idx - TIME_STEPS + 1, data_idx)
    if all([anomalies[j] for j in time_series]):
      outliers.append(data_idx + len(training_data))

  outlying_data = data.iloc[outliers, :]

  cond = data.index.isin(outlying_data.index)
  no_outliers = data.drop(data[cond].index)

  fig = go.Figure()
  fig.add_trace(go.Scatter(x = no_outliers.index, y = no_outliers["Volume"],
                           mode = 'markers',
                           name = no_outliers["Volume"].name,
                           connectgaps=False))
  fig.add_trace(go.Scatter(x = outlying_data.index, y = outlying_data["Volume"],
                           mode = 'markers',
                           name = outlying_data["Volume"].name + ' Outliers',
                           connectgaps=False))
  
  fig.update_xaxes(rangeslider_visible=True)

  fig.update_layout(
  title='Detected Outliers',
  xaxis_title=data.index.name,
  yaxis_title=no_outliers["Volume"].name,
  font=dict(
        family="Arial",
        size=11,
        color="#7f7f7f"
    ))
  
  
  return fig.show()


#Call the above function:
plot_outliers(dataset)

The remote data characterized by the automatic encoder model is represented by orange, while the consistent data is represented by blue.

![](http://qiniu.aihubs.net/newplot (7).png)

We can see that a large part of the bitcoin volume data in 2020 is considered abnormal – possibly due to the increase in retail trading activity driven by covid-19?

Try auto encoder parameters and new data sets to see if you can find any anomalies in the closing price of bitcoin, or download different cryptocurrencies using the historical encryption library!

Link to the original text:https://towardsdatascience.com/outlier-detection-with-rnn-autoencoders-b82e2c230ed9

Welcome to panchuang AI blog:
http://panchuang.net/

Sklearn machine learning official Chinese document:
http://sklearn123.com/

Welcome to pancreato blog Resource Hub:
http://docs.panchuang.net/