# Bank customer churn prediction model based on pytorch machine learning neural network classification in Python

Time：2021-3-18

### Link to the original text:http://tecdat.cn/?p=8522

Classification problems belong to the category of machine learning problems. Given a set of features, the task is to predict discrete values. Some common examples of classification questions are predicting whether a tumor is cancer or whether a student is likely to pass an exam. In this paper, in view of some characteristics of bank customers, we will predict whether customers may leave the bank after six months. The phenomenon that customers leave the organization is also called customer churn. Therefore, our task is to predict customer churn based on various customer characteristics.

`` \$ pip install pytorch ``

# data set

Let’s import the required libraries and datasets into our Python application:

`````` import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline ``````

We can use it`pandas`Library`read_csv()`Method to import the CSV file that contains our dataset.

`` dataset = pd.read_csv(r'E:Datasetscustomer_data.csv') ``

Let’s output the dataset:

`` dataset.shape ``

Output:

`` (10000, 14) ``

The output shows that the dataset has 10000 records and 14 columns. We can use it`head()`Data box to output the first five lines of the data set.

`` dataset.head() ``

Output:

You can see 14 columns in our dataset. Based on the first 13 columns, our task is to predict the value of column 14, i.e`Exited`

# Exploratory data analysis

Let’s do some exploratory data analysis on the dataset. We will first predict the proportion of customers who actually leave the bank after six months and use the pie chart for visualization. Let’s first increase the default drawing size of the graph:

`````` fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size ``````

The following script draws the`Exited`A pie chart of columns.

`` dataset.Exited.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=['skyblue', 'orange'], explode=(0.05, 0.05)) ``

Output:

The output shows that 20% of our customers left the bank in our dataset. Here, 1 represents the situation that the customer left the bank, and 0 represents the situation that the customer did not leave the bank. Let’s plot the number of customers in all geographic locations in the dataset:

The output shows that almost half of the customers are from France, compared with 25% in Spain and 25% in Germany.

Now, let’s plot the number of customers and churn information from each unique geographic location. We can use the`countplot()`function`seaborn`To perform this operation.

The output shows that although the total number of French customers is twice that of Spanish and German customers, the proportion of French and German customers leaving the bank is the same. Similarly, the total number of German and Spanish customers is the same, but the number of German customers leaving the bank is twice that of Spanish customers, indicating that German customers are more likely to leave the bank after six months.

# Data preprocessing

Before training pytorch model, we need to preprocess the data. If you look at a dataset, you will see that it has two types of columns: numeric columns and classified columns. The numeric column contains numeric information.`CreditScore``Balance``Age`And so on. Similarly,`Geography`and`Gender`Are classified columns because they contain classified information such as the location and gender of the customer. There are several columns that can be considered as number columns and category columns. For example, the`HasCrCard`The value of the column can be 1 or 0. But, that’s right`HasCrCard`Column contains information about whether the customer has a credit card.

Let’s output all the columns in the dataset again and find out which columns can be considered as numeric columns and which columns should be considered as category columns.`columns`The properties of the data box display all column names:

``Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited'], dtype='object')``

From our data columns, we will not use`RowNumber``CustomerId`as well as`Surname`Columns because their values are completely random and independent of the output. For example, a customer’s last name has no effect on whether the customer leaves the bank or not. The rest of the columns,`Geography``Gender``HasCrCard`, and`IsActiveMember`Columns can be considered category columns. Let’s create a list of these columns: with the exception of this column, all columns can be treated as numeric columns.

`` numerical_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary'] ``

Finally, output（`Exited`The value in the column) is stored in the`outputs`Variable.

We have created lists of categories, numbers and output columns. At present, however, the type of a classified column is not classified. You can use the following script to check the type of all columns in the dataset:
Output:

`````` RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object ``````

You can see`Geography`and`Gender`The type of the column is object,`HasCrCard`and`IsActive`The type of the column is Int64. We need to convert the type of the classified column to`category`. We can use it`astype()`Function to do this,

Now, if you draw the types of the columns in the dataset again, you will see the following results:

Output

`````` RowNumber             int64
CustomerId            int64
Surname              object
CreditScore           int64
Geography          category
Gender             category
Age                   int64
Tenure                int64
Balance             float64
NumOfProducts         int64
HasCrCard          category
IsActiveMember     category
EstimatedSalary     float64
Exited                int64
dtype: object ``````

Now let’s look at it`Geography`All categories in column:

``Index(['France', 'Germany', 'Spain'], dtype='object') ``

When you change the data type of a column to a category, each category in the column is assigned a unique code. For example, let’s plot the first five rows of a column,`Geography`And output the code values of the first five lines:

Output:

`````` 0    France
1     Spain
2    France
3    France
4     Spain
Name: Geography, dtype: category
Categories (3, object): [France, Germany, Spain] ``````

The following script draws the code for the value in the first five lines of the column`Geography`

Output:

`````` 0    0
1    2
2    0
3    0
4    2
dtype: int8 ``````

The output shows that France is coded as 0 and Spain is coded as 2.

The basic purpose of separating the classification column from the number column is to input the values in the number column directly into the neural network. However, you must first convert the value of the category column to a numeric type. The coding of the value in the classification column partly solves the task of the numerical conversion of the classification column.

Since we will use pytorch to train the model, we need to convert the classified column and numerical column into tensor. First, let’s transform the classified column into a tensor. In pytorch, you can create tensors from numpy arrays. We will first convert the data in the four classified columns into numpy arrays, and then stack all the columns horizontally, as shown in the following script:

`````` geo = dataset['Geography'].cat.codes.values
... ``````

The script above outputs the top ten records in the category column. The output is as follows:Output:

`````` array([[0, 0, 1, 1],
[2, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[2, 0, 1, 1],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 1, 1, 1]], dtype=int8) ``````

Now to create a tensor from the numpy array above, you simply pass the array to the module’s`tensor`class`torch`

Output:

`````` tensor([[0, 0, 1, 1],
[2, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[2, 0, 1, 1],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 1, 1, 1]]) ``````

In the output, you can see that the numpy array of the class data is now converted to`tensor`Object. Similarly, we can convert a numerical column into a tensor

`````` numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
... ``````

Output:

`````` tensor([[6.1900e+02, 4.2000e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00, 1.0135e+05],
[6.0800e+02, 4.1000e+01, 1.0000e+00, 8.3808e+04, 1.0000e+00, 1.1254e+05],
[5.0200e+02, 4.2000e+01, 8.0000e+00, 1.5966e+05, 3.0000e+00, 1.1393e+05],
[6.9900e+02, 3.9000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 9.3827e+04],
[8.5000e+02, 4.3000e+01, 2.0000e+00, 1.2551e+05, 1.0000e+00, 7.9084e+04]]) ``````

In the output, you can see the first five lines that contain the values of the six numeric columns in our dataset. The final step is to convert the output numpy array to`tensor`Object.Output:

`` tensor([1, 0, 1, 0, 0]) ``

Now, let’s draw the shapes of classification data, numerical data and corresponding output:Output:

`````` torch.Size([10000, 4])
torch.Size([10000, 6])
torch.Size([10000]) ``````

Before training the model, there is a very important step. We convert the classification column to a numeric value, where the unique value is represented by a single integer. For example, in the`Geography`In the column, we see that France is represented by 0 and Germany by 1. We can use these values to train our model. However, a better way is to represent the values in the classification column in the form of an n-dimensional vector rather than a single integer.

We need to define the vector size for all classification columns. There is no strict rule about dimension. A good rule of thumb for defining the embedded size of a column is to divide the number of unique values in the column by 2 (but not more than 50). For example, for the`Geography`Column, the number of unique values is 3. The`Geography`The corresponding embedded size of the column will be 3 / 2 = 1.5 = 2 (rounded). The following script creates a tuple containing the number of unique values and dimension sizes for all category columns:

`````` categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
... ``````

Output:

`` [(3, 2), (2, 1), (2, 1), (2, 1)] ``

The training data are used to train the supervised deep learning model (such as the model we developed in this paper), and the performance of the model is evaluated on the test data set. Therefore, we need to divide the data set into training set and test set, as shown in the following script:

`````` total_records = 10000
.... ``````

There are 10000 records in our dataset, 80% of which (8000 records) will be used to train the model, and the remaining 20% will be used to evaluate the performance of the model. Note that in the script above, the classified and digital data and output are divided into training set and test set. To verify that we have correctly divided the data into training and test sets:

`````` print(len(categorical_train_data))
print(len(numerical_train_data))
print(len(train_outputs))

print(len(categorical_test_data))
print(len(numerical_test_data))
print(len(test_outputs)) ``````

Output:

`````` 8000
8000
8000
2000
2000
2000``````

# Create forecast model

We divided the data into training set and test set. Now it’s time to define the training model. To do this, we can define a class named`Model`This class will be used to train the model. Look at the following script:

`````` class Model(nn.Module):

def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
super().__init__()
self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
self.embedding_dropout = nn.Dropout(p)
self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

return x ``````

Next, to find the size of the input layer, add the number of category columns and number columns together and store them in the`input_size`Variable. After that,`for`Loop iterations and add the corresponding layers to the`all_layers`In the list. The added layers are:

• `Linear`: used to calculate the dot product between input and weight matrix
• `ReLu`: used as the activation function
• `BatchNorm1d`: used to apply batch normalization to numeric columns
• `Dropout`: used to avoid over fitting

After`for`In the loop, the list of layers to which the output layer is attached. Since we want all the layers in the neural network to execute in order, we pass the layer list to the`nn.Sequential`This class.

Next, in the`forward`Method, both the category column and the number column are passed as input. Category columns are embedded in the following lines.

`embeddings = []
…`

Batch normalization of digital columns can be applied through the following script:

`x_numerical = self.batch_norm_num(x_numerical)`

Finally, the embedded classification column is added`x`And number columns`x_numerical`And pass it to sequence`layers`

# Training model

To train a model, we must first create it`Model`Objects of the class defined in the previous section.

You can see that we passed in the embedded size of the classification column, the number of numeric columns, the output size (2 in our case) and the neurons in the hidden layer. You can see that we have three hidden layers with 200, 100 and 50 neurons.
Let’s output the model and see:

`` print(model) ``

Output:

`````` Model(
(all_embeddings): ModuleList(
...
)
) ``````

You can see that in the first linear layer,`in_features`The value of the variable is 11, because we have six numeric columns, and the sum of the embedding dimensions of the category columns is 5, so 6 + 5 = 11.`out_features`The value of is 2, because we have only two possible outputs.

Before training the model, we need to define the loss function and the optimizer that will be used to train the model. The following script defines the loss function and optimizer:

`` loss_function = nn.CrossEntropyLoss() ``

Now, let’s train the model. The following script training model:

`````` epochs = 300
aggregated_losses = []

for i in range(epochs):

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}') ``````

The number of neurons is set to 300, which means that to train the model, the complete data set will be used 300 times.`for`For the way the loop is executed during each iteration, the loss is calculated using the loss function. The loss during each iteration is added to the`aggregated_loss`In the list.

The output of the above script is as follows:

`epoch: 1 loss: 0.71847951
epoch: 26 loss: 0.57145703
epoch: 51 loss: 0.48110831
epoch: 76 loss: 0.42529839
epoch: 101 loss: 0.39972275
epoch: 126 loss: 0.37837571
epoch: 151 loss: 0.37133673
epoch: 176 loss: 0.36773482
epoch: 201 loss: 0.36305946
epoch: 226 loss: 0.36079505
epoch: 251 loss: 0.35350436
epoch: 276 loss: 0.35540250
epoch: 300 loss: 0.3465710580`

The following script plots the loss function for each period:

`plt.plot(range(epochs), aggregated_losses)
plt.ylabel(‘Loss’)
plt.xlabel(‘epoch’);`

Output:

The output shows that the initial loss function decreases rapidly. After 250 steps, there is little loss reduction.

# Make predictions

The last step is to predict the test data. To do this, we just need to`categorical_test_data`and`numerical_test_data`Pass on to`model`This class. You can then compare the returned value with the actual test output value. The following script predicts the test class and outputs the cross entropy loss of the test data.

`````` with torch.no_grad():
... ``````

Output:

`` Loss: 0.36855841 ``

The loss on the test set is 0.3685, which is slightly more than that on the training set of 0.3465, which indicates that our model is over fitted. Since we specify that the output layer will contain two neurons, each prediction will contain two values. For example, the first five predictions are as follows:

`` print(y_val[:5]) ``

Output:

`````` tensor([[ 1.2045, -1.3857],
[ 1.3911, -1.5957],
[ 1.2781, -1.3598],
[ 0.6261, -0.5429],
[ 2.5430, -1.9991]]) ``````

The idea of this prediction is that if the actual output is 0, the value at index 0 should be greater than the value at index 1, and vice versa. We can use the following script to retrieve the index of the maximum value in the list:

`` y_val = np.argmax(y_val, axis=1) ``

Output:Now let’s export it again`y_val`The first five values of the list:

`` print(y_val[:5]) ``

Output:

`` tensor([0, 0, 0, 0, 0]) ``

Because in the initially predicted output list, the value at the zero index is greater than the value at the first index for the first five records, you can see 0 in the first five rows of the processed output.

Finally, we can use the`sklearn.metrics`modular`confusion_matrix``accuracy_score`as well as`classification_report`Class to find the accuracy, precision and recall values, confusion matrix.

`from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(test_outputs,y_val))
print(classification_report(test_outputs,y_val))
print(accuracy_score(test_outputs, y_val))`

Output:

`[[1527 83]
[ 224 166]]

``````          precision    recall  f1-score   support

0       0.87      0.95      0.91      1610
1       0.67      0.43      0.52       390
``````

micro avg 0.85 0.85 0.85 2000
macro avg 0.77 0.69 0.71 2000
weighted avg 0.83 0.85 0.83 2000

0.8465`

The output results show that our model achieves 84.65% accuracy, which is very impressive considering the fact that we randomly select all the parameters of the neural network model. I suggest you try to change the model parameters, such as the training / test ratio, the number and size of hidden layers, to see if you can get better results.

# conclusion

Pytorch is a common deep learning library developed by Facebook, which can be used for various tasks, such as classification, regression and clustering. This paper introduces how to use pytorch library to classify table data.

Most popular insights

## [Q & A share the second bullet] MySQL search engine, after watching the tyrannical interviewer!

Hello, I’m younger brother. A few days ago, I shared the second interview question, the interview site of search engine in MySQL. This question is the interview at normal temperature. After reading it, I’m sure you will gain something in terms of database engine If you haven’t read my first share, you can refer to […]