Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

Time:2021-3-18

Link to the original text:http://tecdat.cn/?p=8522

Classification problems belong to the category of machine learning problems. Given a set of features, the task is to predict discrete values. Some common examples of classification questions are predicting whether a tumor is cancer or whether a student is likely to pass an exam. In this paper, in view of some characteristics of bank customers, we will predict whether customers may leave the bank after six months. The phenomenon that customers leave the organization is also called customer churn. Therefore, our task is to predict customer churn based on various customer characteristics.

 $ pip install pytorch 

data set

Let’s import the required libraries and datasets into our Python application:

 import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

We can use itpandasLibraryread_csv()Method to import the CSV file that contains our dataset.

 dataset = pd.read_csv(r'E:Datasetscustomer_data.csv') 

Let’s output the dataset:

 dataset.shape 

Output:

 (10000, 14) 

The output shows that the dataset has 10000 records and 14 columns. We can use ithead()Data box to output the first five lines of the data set.

 dataset.head() 

Output:

Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

You can see 14 columns in our dataset. Based on the first 13 columns, our task is to predict the value of column 14, i.eExited

Exploratory data analysis

Let’s do some exploratory data analysis on the dataset. We will first predict the proportion of customers who actually leave the bank after six months and use the pie chart for visualization. Let’s first increase the default drawing size of the graph:

 fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size 

The following script draws theExitedA pie chart of columns.

 dataset.Exited.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=['skyblue', 'orange'], explode=(0.05, 0.05)) 

Output:

Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

The output shows that 20% of our customers left the bank in our dataset. Here, 1 represents the situation that the customer left the bank, and 0 represents the situation that the customer did not leave the bank. Let’s plot the number of customers in all geographic locations in the dataset:

Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

The output shows that almost half of the customers are from France, compared with 25% in Spain and 25% in Germany.

Now, let’s plot the number of customers and churn information from each unique geographic location. We can use thecountplot()functionseabornTo perform this operation.

Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

The output shows that although the total number of French customers is twice that of Spanish and German customers, the proportion of French and German customers leaving the bank is the same. Similarly, the total number of German and Spanish customers is the same, but the number of German customers leaving the bank is twice that of Spanish customers, indicating that German customers are more likely to leave the bank after six months.

Data preprocessing

Before training pytorch model, we need to preprocess the data. If you look at a dataset, you will see that it has two types of columns: numeric columns and classified columns. The numeric column contains numeric information.CreditScoreBalanceAgeAnd so on. Similarly,GeographyandGenderAre classified columns because they contain classified information such as the location and gender of the customer. There are several columns that can be considered as number columns and category columns. For example, theHasCrCardThe value of the column can be 1 or 0. But, that’s rightHasCrCardColumn contains information about whether the customer has a credit card.

Let’s output all the columns in the dataset again and find out which columns can be considered as numeric columns and which columns should be considered as category columns.columnsThe properties of the data box display all column names:

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited'], dtype='object')

From our data columns, we will not useRowNumberCustomerIdas well asSurnameColumns because their values are completely random and independent of the output. For example, a customer’s last name has no effect on whether the customer leaves the bank or not. The rest of the columns,GeographyGenderHasCrCard, andIsActiveMemberColumns can be considered category columns. Let’s create a list of these columns: with the exception of this column, all columns can be treated as numeric columns.

 numerical_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary'] 

Finally, output(ExitedThe value in the column) is stored in theoutputsVariable.

We have created lists of categories, numbers and output columns. At present, however, the type of a classified column is not classified. You can use the following script to check the type of all columns in the dataset:
Output:

 RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object 

You can seeGeographyandGenderThe type of the column is object,HasCrCardandIsActiveThe type of the column is Int64. We need to convert the type of the classified column tocategory. We can use itastype()Function to do this,

Now, if you draw the types of the columns in the dataset again, you will see the following results:

Output

 RowNumber             int64
CustomerId            int64
Surname              object
CreditScore           int64
Geography          category
Gender             category
Age                   int64
Tenure                int64
Balance             float64
NumOfProducts         int64
HasCrCard          category
IsActiveMember     category
EstimatedSalary     float64
Exited                int64
dtype: object 

Now let’s look at itGeographyAll categories in column:

Index(['France', 'Germany', 'Spain'], dtype='object') 

When you change the data type of a column to a category, each category in the column is assigned a unique code. For example, let’s plot the first five rows of a column,GeographyAnd output the code values of the first five lines:

Output:

 0    France
1     Spain
2    France
3    France
4     Spain
Name: Geography, dtype: category
Categories (3, object): [France, Germany, Spain] 

The following script draws the code for the value in the first five lines of the columnGeography

Output:

 0    0
1    2
2    0
3    0
4    2
dtype: int8 

The output shows that France is coded as 0 and Spain is coded as 2.

The basic purpose of separating the classification column from the number column is to input the values in the number column directly into the neural network. However, you must first convert the value of the category column to a numeric type. The coding of the value in the classification column partly solves the task of the numerical conversion of the classification column.

Since we will use pytorch to train the model, we need to convert the classified column and numerical column into tensor. First, let’s transform the classified column into a tensor. In pytorch, you can create tensors from numpy arrays. We will first convert the data in the four classified columns into numpy arrays, and then stack all the columns horizontally, as shown in the following script:

 geo = dataset['Geography'].cat.codes.values
... 

The script above outputs the top ten records in the category column. The output is as follows:Output:

 array([[0, 0, 1, 1],
       [2, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 0, 0, 0],
       [2, 0, 1, 1],
       [2, 1, 1, 0],
       [0, 1, 1, 1],
       [1, 0, 1, 0],
       [0, 1, 0, 1],
       [0, 1, 1, 1]], dtype=int8) 

Now to create a tensor from the numpy array above, you simply pass the array to the module’stensorclasstorch

Output:

 tensor([[0, 0, 1, 1],
        [2, 0, 0, 1],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [2, 0, 1, 1],
        [2, 1, 1, 0],
        [0, 1, 1, 1],
        [1, 0, 1, 0],
        [0, 1, 0, 1],
        [0, 1, 1, 1]]) 

In the output, you can see that the numpy array of the class data is now converted totensorObject. Similarly, we can convert a numerical column into a tensor

 numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
... 

Output:

 tensor([[6.1900e+02, 4.2000e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00, 1.0135e+05],
        [6.0800e+02, 4.1000e+01, 1.0000e+00, 8.3808e+04, 1.0000e+00, 1.1254e+05],
        [5.0200e+02, 4.2000e+01, 8.0000e+00, 1.5966e+05, 3.0000e+00, 1.1393e+05],
        [6.9900e+02, 3.9000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 9.3827e+04],
        [8.5000e+02, 4.3000e+01, 2.0000e+00, 1.2551e+05, 1.0000e+00, 7.9084e+04]]) 

In the output, you can see the first five lines that contain the values of the six numeric columns in our dataset. The final step is to convert the output numpy array totensorObject.Output:

 tensor([1, 0, 1, 0, 0]) 

Now, let’s draw the shapes of classification data, numerical data and corresponding output:Output:

 torch.Size([10000, 4])
torch.Size([10000, 6])
torch.Size([10000]) 

Before training the model, there is a very important step. We convert the classification column to a numeric value, where the unique value is represented by a single integer. For example, in theGeographyIn the column, we see that France is represented by 0 and Germany by 1. We can use these values to train our model. However, a better way is to represent the values in the classification column in the form of an n-dimensional vector rather than a single integer.

We need to define the vector size for all classification columns. There is no strict rule about dimension. A good rule of thumb for defining the embedded size of a column is to divide the number of unique values in the column by 2 (but not more than 50). For example, for theGeographyColumn, the number of unique values is 3. TheGeographyThe corresponding embedded size of the column will be 3 / 2 = 1.5 = 2 (rounded). The following script creates a tuple containing the number of unique values and dimension sizes for all category columns:

 categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
... 

Output:

 [(3, 2), (2, 1), (2, 1), (2, 1)] 

The training data are used to train the supervised deep learning model (such as the model we developed in this paper), and the performance of the model is evaluated on the test data set. Therefore, we need to divide the data set into training set and test set, as shown in the following script:

 total_records = 10000
.... 

There are 10000 records in our dataset, 80% of which (8000 records) will be used to train the model, and the remaining 20% will be used to evaluate the performance of the model. Note that in the script above, the classified and digital data and output are divided into training set and test set. To verify that we have correctly divided the data into training and test sets:

 print(len(categorical_train_data))
print(len(numerical_train_data))
print(len(train_outputs))

print(len(categorical_test_data))
print(len(numerical_test_data))
print(len(test_outputs)) 

Output:

 8000
8000
8000
2000
2000
2000

Create forecast model

We divided the data into training set and test set. Now it’s time to define the training model. To do this, we can define a class namedModelThis class will be used to train the model. Look at the following script:

 class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        self.embedding_dropout = nn.Dropout(p)
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)



        return x 

Next, to find the size of the input layer, add the number of category columns and number columns together and store them in theinput_sizeVariable. After that,forLoop iterations and add the corresponding layers to theall_layersIn the list. The added layers are:

  • Linear: used to calculate the dot product between input and weight matrix
  • ReLu: used as the activation function
  • BatchNorm1d: used to apply batch normalization to numeric columns
  • Dropout: used to avoid over fitting

AfterforIn the loop, the list of layers to which the output layer is attached. Since we want all the layers in the neural network to execute in order, we pass the layer list to thenn.SequentialThis class.

Next, in theforwardMethod, both the category column and the number column are passed as input. Category columns are embedded in the following lines.

`embeddings = []
…`

Batch normalization of digital columns can be applied through the following script:

x_numerical = self.batch_norm_num(x_numerical)

Finally, the embedded classification column is addedxAnd number columnsx_numericalAnd pass it to sequencelayers

Training model

To train a model, we must first create itModelObjects of the class defined in the previous section.

You can see that we passed in the embedded size of the classification column, the number of numeric columns, the output size (2 in our case) and the neurons in the hidden layer. You can see that we have three hidden layers with 200, 100 and 50 neurons.
Let’s output the model and see:

 print(model) 

Output:

 Model(
  (all_embeddings): ModuleList(
 ...
  )
) 

You can see that in the first linear layer,in_featuresThe value of the variable is 11, because we have six numeric columns, and the sum of the embedding dimensions of the category columns is 5, so 6 + 5 = 11.out_featuresThe value of is 2, because we have only two possible outputs.

Before training the model, we need to define the loss function and the optimizer that will be used to train the model. The following script defines the loss function and optimizer:

 loss_function = nn.CrossEntropyLoss() 

Now, let’s train the model. The following script training model:

 epochs = 300
aggregated_losses = []

for i in range(epochs):


print(f'epoch: {i:3} loss: {single_loss.item():10.10f}') 

The number of neurons is set to 300, which means that to train the model, the complete data set will be used 300 times.forFor the way the loop is executed during each iteration, the loss is calculated using the loss function. The loss during each iteration is added to theaggregated_lossIn the list.

The output of the above script is as follows:

`epoch: 1 loss: 0.71847951
epoch: 26 loss: 0.57145703
epoch: 51 loss: 0.48110831
epoch: 76 loss: 0.42529839
epoch: 101 loss: 0.39972275
epoch: 126 loss: 0.37837571
epoch: 151 loss: 0.37133673
epoch: 176 loss: 0.36773482
epoch: 201 loss: 0.36305946
epoch: 226 loss: 0.36079505
epoch: 251 loss: 0.35350436
epoch: 276 loss: 0.35540250
epoch: 300 loss: 0.3465710580`

The following script plots the loss function for each period:

`plt.plot(range(epochs), aggregated_losses)
plt.ylabel(‘Loss’)
plt.xlabel(‘epoch’);`

Output:

Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

The output shows that the initial loss function decreases rapidly. After 250 steps, there is little loss reduction.

Make predictions

The last step is to predict the test data. To do this, we just need tocategorical_test_dataandnumerical_test_dataPass on tomodelThis class. You can then compare the returned value with the actual test output value. The following script predicts the test class and outputs the cross entropy loss of the test data.

 with torch.no_grad():
... 

Output:

 Loss: 0.36855841 

The loss on the test set is 0.3685, which is slightly more than that on the training set of 0.3465, which indicates that our model is over fitted. Since we specify that the output layer will contain two neurons, each prediction will contain two values. For example, the first five predictions are as follows:

 print(y_val[:5]) 

Output:

 tensor([[ 1.2045, -1.3857],
        [ 1.3911, -1.5957],
        [ 1.2781, -1.3598],
        [ 0.6261, -0.5429],
        [ 2.5430, -1.9991]]) 

The idea of this prediction is that if the actual output is 0, the value at index 0 should be greater than the value at index 1, and vice versa. We can use the following script to retrieve the index of the maximum value in the list:

 y_val = np.argmax(y_val, axis=1) 

Output:Now let’s export it againy_valThe first five values of the list:

 print(y_val[:5]) 

Output:

 tensor([0, 0, 0, 0, 0]) 

Because in the initially predicted output list, the value at the zero index is greater than the value at the first index for the first five records, you can see 0 in the first five rows of the processed output.

Finally, we can use thesklearn.metricsmodularconfusion_matrixaccuracy_scoreas well asclassification_reportClass to find the accuracy, precision and recall values, confusion matrix.

`from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(test_outputs,y_val))
print(classification_report(test_outputs,y_val))
print(accuracy_score(test_outputs, y_val))`

Output:

`[[1527 83]
[ 224 166]]

          precision    recall  f1-score   support

       0       0.87      0.95      0.91      1610
       1       0.67      0.43      0.52       390

micro avg 0.85 0.85 0.85 2000
macro avg 0.77 0.69 0.71 2000
weighted avg 0.83 0.85 0.83 2000

0.8465`

The output results show that our model achieves 84.65% accuracy, which is very impressive considering the fact that we randomly select all the parameters of the neural network model. I suggest you try to change the model parameters, such as the training / test ratio, the number and size of hidden layers, to see if you can get better results.

conclusion

Pytorch is a common deep learning library developed by Facebook, which can be used for various tasks, such as classification, regression and clustering. This paper introduces how to use pytorch library to classify table data.


Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

Most popular insights

1.Research hotspots of big data journal articles

2.618 online shopping data inventory

3.R language text mining, TF IDF topic modeling, sentiment analysis, n-gram modeling

4.Interactive visualization of LDA and t-sne in Python topic modeling

5.News data observation under epidemic situation

6.Python topic LDA modeling and t-sne visualization

7.Topic modeling analysis of text data in R language

8.Theme model: Data listening to the “online affairs” on the message board of people’s network

9.Semantic data analysis of LDA topic crawled by Python crawler