### Link to the original text:http://tecdat.cn/?p=8522

Classification problems belong to the category of machine learning problems. Given a set of features, the task is to predict discrete values. Some common examples of classification questions are predicting whether a tumor is cancer or whether a student is likely to pass an exam. In this paper, in view of some characteristics of bank customers, we will predict whether customers may leave the bank after six months. The phenomenon that customers leave the organization is also called customer churn. Therefore, our task is to predict customer churn based on various customer characteristics.

` $ pip install pytorch `

**data set**

Let’s import the required libraries and datasets into our Python application:

```
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

We can use it`pandas`

Library`read_csv()`

Method to import the CSV file that contains our dataset.

` dataset = pd.read_csv(r'E:Datasetscustomer_data.csv') `

Let’s output the dataset:

` dataset.shape `

**Output:**

` (10000, 14) `

The output shows that the dataset has 10000 records and 14 columns. We can use it`head()`

Data box to output the first five lines of the data set.

` dataset.head() `

**Output:**

You can see 14 columns in our dataset. Based on the first 13 columns, our task is to predict the value of column 14, i.e`Exited`

。

**Exploratory data analysis**

Let’s do some exploratory data analysis on the dataset. We will first predict the proportion of customers who actually leave the bank after six months and use the pie chart for visualization. Let’s first increase the default drawing size of the graph:

```
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
```

The following script draws the`Exited`

A pie chart of columns.

` dataset.Exited.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=['skyblue', 'orange'], explode=(0.05, 0.05)) `

**Output:**

The output shows that 20% of our customers left the bank in our dataset. Here, 1 represents the situation that the customer left the bank, and 0 represents the situation that the customer did not leave the bank. Let’s plot the number of customers in all geographic locations in the dataset:

The output shows that almost half of the customers are from France, compared with 25% in Spain and 25% in Germany.

Now, let’s plot the number of customers and churn information from each unique geographic location. We can use the`countplot()`

function`seaborn`

To perform this operation.

The output shows that although the total number of French customers is twice that of Spanish and German customers, the proportion of French and German customers leaving the bank is the same. Similarly, the total number of German and Spanish customers is the same, but the number of German customers leaving the bank is twice that of Spanish customers, indicating that German customers are more likely to leave the bank after six months.

**Data preprocessing**

Before training pytorch model, we need to preprocess the data. If you look at a dataset, you will see that it has two types of columns: numeric columns and classified columns. The numeric column contains numeric information.`CreditScore`

，`Balance`

，`Age`

And so on. Similarly,`Geography`

and`Gender`

Are classified columns because they contain classified information such as the location and gender of the customer. There are several columns that can be considered as number columns and category columns. For example, the`HasCrCard`

The value of the column can be 1 or 0. But, that’s right`HasCrCard`

Column contains information about whether the customer has a credit card.

Let’s output all the columns in the dataset again and find out which columns can be considered as numeric columns and which columns should be considered as category columns.`columns`

The properties of the data box display all column names:

`Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited'], dtype='object')`

From our data columns, we will not use`RowNumber`

，`CustomerId`

as well as`Surname`

Columns because their values are completely random and independent of the output. For example, a customer’s last name has no effect on whether the customer leaves the bank or not. The rest of the columns,`Geography`

，`Gender`

，`HasCrCard`

, and`IsActiveMember`

Columns can be considered category columns. Let’s create a list of these columns: with the exception of this column, all columns can be treated as numeric columns.

` numerical_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary'] `

Finally, output（`Exited`

The value in the column) is stored in the`outputs`

Variable.

We have created lists of categories, numbers and output columns. At present, however, the type of a classified column is not classified. You can use the following script to check the type of all columns in the dataset:**Output:**

```
RowNumber int64
CustomerId int64
Surname object
CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
Exited int64
dtype: object
```

You can see`Geography`

and`Gender`

The type of the column is object,`HasCrCard`

and`IsActive`

The type of the column is Int64. We need to convert the type of the classified column to`category`

. We can use it`astype()`

Function to do this,

Now, if you draw the types of the columns in the dataset again, you will see the following results:

**Output**

```
RowNumber int64
CustomerId int64
Surname object
CreditScore int64
Geography category
Gender category
Age int64
Tenure int64
Balance float64
NumOfProducts int64
HasCrCard category
IsActiveMember category
EstimatedSalary float64
Exited int64
dtype: object
```

Now let’s look at it`Geography`

All categories in column:

`Index(['France', 'Germany', 'Spain'], dtype='object') `

When you change the data type of a column to a category, each category in the column is assigned a unique code. For example, let’s plot the first five rows of a column,`Geography`

And output the code values of the first five lines:

**Output:**

```
0 France
1 Spain
2 France
3 France
4 Spain
Name: Geography, dtype: category
Categories (3, object): [France, Germany, Spain]
```

The following script draws the code for the value in the first five lines of the column`Geography`

：

**Output:**

```
0 0
1 2
2 0
3 0
4 2
dtype: int8
```

The output shows that France is coded as 0 and Spain is coded as 2.

The basic purpose of separating the classification column from the number column is to input the values in the number column directly into the neural network. However, you must first convert the value of the category column to a numeric type. The coding of the value in the classification column partly solves the task of the numerical conversion of the classification column.

Since we will use pytorch to train the model, we need to convert the classified column and numerical column into tensor. First, let’s transform the classified column into a tensor. In pytorch, you can create tensors from numpy arrays. We will first convert the data in the four classified columns into numpy arrays, and then stack all the columns horizontally, as shown in the following script:

```
geo = dataset['Geography'].cat.codes.values
...
```

The script above outputs the top ten records in the category column. The output is as follows:**Output:**

```
array([[0, 0, 1, 1],
[2, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[2, 0, 1, 1],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 1, 1, 1]], dtype=int8)
```

Now to create a tensor from the numpy array above, you simply pass the array to the module’s`tensor`

class`torch`

。

**Output:**

```
tensor([[0, 0, 1, 1],
[2, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[2, 0, 1, 1],
[2, 1, 1, 0],
[0, 1, 1, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 1, 1, 1]])
```

In the output, you can see that the numpy array of the class data is now converted to`tensor`

Object. Similarly, we can convert a numerical column into a tensor

```
numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1)
...
```

**Output:**

```
tensor([[6.1900e+02, 4.2000e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00, 1.0135e+05],
[6.0800e+02, 4.1000e+01, 1.0000e+00, 8.3808e+04, 1.0000e+00, 1.1254e+05],
[5.0200e+02, 4.2000e+01, 8.0000e+00, 1.5966e+05, 3.0000e+00, 1.1393e+05],
[6.9900e+02, 3.9000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 9.3827e+04],
[8.5000e+02, 4.3000e+01, 2.0000e+00, 1.2551e+05, 1.0000e+00, 7.9084e+04]])
```

In the output, you can see the first five lines that contain the values of the six numeric columns in our dataset. The final step is to convert the output numpy array to`tensor`

Object.**Output:**

` tensor([1, 0, 1, 0, 0]) `

Now, let’s draw the shapes of classification data, numerical data and corresponding output:**Output:**

```
torch.Size([10000, 4])
torch.Size([10000, 6])
torch.Size([10000])
```

Before training the model, there is a very important step. We convert the classification column to a numeric value, where the unique value is represented by a single integer. For example, in the`Geography`

In the column, we see that France is represented by 0 and Germany by 1. We can use these values to train our model. However, a better way is to represent the values in the classification column in the form of an n-dimensional vector rather than a single integer.

We need to define the vector size for all classification columns. There is no strict rule about dimension. A good rule of thumb for defining the embedded size of a column is to divide the number of unique values in the column by 2 (but not more than 50). For example, for the`Geography`

Column, the number of unique values is 3. The`Geography`

The corresponding embedded size of the column will be 3 / 2 = 1.5 = 2 (rounded). The following script creates a tuple containing the number of unique values and dimension sizes for all category columns:

```
categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns]
...
```

**Output:**

` [(3, 2), (2, 1), (2, 1), (2, 1)] `

The training data are used to train the supervised deep learning model (such as the model we developed in this paper), and the performance of the model is evaluated on the test data set. Therefore, we need to divide the data set into training set and test set, as shown in the following script:

```
total_records = 10000
....
```

There are 10000 records in our dataset, 80% of which (8000 records) will be used to train the model, and the remaining 20% will be used to evaluate the performance of the model. Note that in the script above, the classified and digital data and output are divided into training set and test set. To verify that we have correctly divided the data into training and test sets:

```
print(len(categorical_train_data))
print(len(numerical_train_data))
print(len(train_outputs))
print(len(categorical_test_data))
print(len(numerical_test_data))
print(len(test_outputs))
```

**Output:**

```
8000
8000
8000
2000
2000
2000
```

**Create forecast model**

We divided the data into training set and test set. Now it’s time to define the training model. To do this, we can define a class named`Model`

This class will be used to train the model. Look at the following script:

```
class Model(nn.Module):
def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
super().__init__()
self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
self.embedding_dropout = nn.Dropout(p)
self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)
return x
```

Next, to find the size of the input layer, add the number of category columns and number columns together and store them in the`input_size`

Variable. After that,`for`

Loop iterations and add the corresponding layers to the`all_layers`

In the list. The added layers are:

`Linear`

: used to calculate the dot product between input and weight matrix`ReLu`

: used as the activation function`BatchNorm1d`

: used to apply batch normalization to numeric columns`Dropout`

: used to avoid over fitting

After`for`

In the loop, the list of layers to which the output layer is attached. Since we want all the layers in the neural network to execute in order, we pass the layer list to the`nn.Sequential`

This class.

Next, in the`forward`

Method, both the category column and the number column are passed as input. Category columns are embedded in the following lines.

`embeddings = []

…`

Batch normalization of digital columns can be applied through the following script:

`x_numerical = self.batch_norm_num(x_numerical)`

Finally, the embedded classification column is added`x`

And number columns`x_numerical`

And pass it to sequence`layers`

。

**Training model**

To train a model, we must first create it`Model`

Objects of the class defined in the previous section.

You can see that we passed in the embedded size of the classification column, the number of numeric columns, the output size (2 in our case) and the neurons in the hidden layer. You can see that we have three hidden layers with 200, 100 and 50 neurons.

Let’s output the model and see:

` print(model) `

**Output:**

```
Model(
(all_embeddings): ModuleList(
...
)
)
```

You can see that in the first linear layer,`in_features`

The value of the variable is 11, because we have six numeric columns, and the sum of the embedding dimensions of the category columns is 5, so 6 + 5 = 11.`out_features`

The value of is 2, because we have only two possible outputs.

Before training the model, we need to define the loss function and the optimizer that will be used to train the model. The following script defines the loss function and optimizer:

` loss_function = nn.CrossEntropyLoss() `

Now, let’s train the model. The following script training model:

```
epochs = 300
aggregated_losses = []
for i in range(epochs):
print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
```

The number of neurons is set to 300, which means that to train the model, the complete data set will be used 300 times.`for`

For the way the loop is executed during each iteration, the loss is calculated using the loss function. The loss during each iteration is added to the`aggregated_loss`

In the list.

The output of the above script is as follows:

`epoch: 1 loss: 0.71847951

epoch: 26 loss: 0.57145703

epoch: 51 loss: 0.48110831

epoch: 76 loss: 0.42529839

epoch: 101 loss: 0.39972275

epoch: 126 loss: 0.37837571

epoch: 151 loss: 0.37133673

epoch: 176 loss: 0.36773482

epoch: 201 loss: 0.36305946

epoch: 226 loss: 0.36079505

epoch: 251 loss: 0.35350436

epoch: 276 loss: 0.35540250

epoch: 300 loss: 0.3465710580`

The following script plots the loss function for each period:

`plt.plot(range(epochs), aggregated_losses)

plt.ylabel(‘Loss’)

plt.xlabel(‘epoch’);`

**Output:**

The output shows that the initial loss function decreases rapidly. After 250 steps, there is little loss reduction.

**Make predictions**

The last step is to predict the test data. To do this, we just need to`categorical_test_data`

and`numerical_test_data`

Pass on to`model`

This class. You can then compare the returned value with the actual test output value. The following script predicts the test class and outputs the cross entropy loss of the test data.

```
with torch.no_grad():
...
```

**Output:**

` Loss: 0.36855841 `

The loss on the test set is 0.3685, which is slightly more than that on the training set of 0.3465, which indicates that our model is over fitted. Since we specify that the output layer will contain two neurons, each prediction will contain two values. For example, the first five predictions are as follows:

` print(y_val[:5]) `

**Output:**

```
tensor([[ 1.2045, -1.3857],
[ 1.3911, -1.5957],
[ 1.2781, -1.3598],
[ 0.6261, -0.5429],
[ 2.5430, -1.9991]])
```

The idea of this prediction is that if the actual output is 0, the value at index 0 should be greater than the value at index 1, and vice versa. We can use the following script to retrieve the index of the maximum value in the list:

` y_val = np.argmax(y_val, axis=1) `

**Output:**Now let’s export it again`y_val`

The first five values of the list:

` print(y_val[:5]) `

**Output:**

` tensor([0, 0, 0, 0, 0]) `

Because in the initially predicted output list, the value at the zero index is greater than the value at the first index for the first five records, you can see 0 in the first five rows of the processed output.

Finally, we can use the`sklearn.metrics`

modular`confusion_matrix`

，`accuracy_score`

as well as`classification_report`

Class to find the accuracy, precision and recall values, confusion matrix.

`from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(test_outputs,y_val))

print(classification_report(test_outputs,y_val))

print(accuracy_score(test_outputs, y_val))`

**Output:**

`[[1527 83]

[ 224 166]]

```
precision recall f1-score support
0 0.87 0.95 0.91 1610
1 0.67 0.43 0.52 390
```

micro avg 0.85 0.85 0.85 2000

macro avg 0.77 0.69 0.71 2000

weighted avg 0.83 0.85 0.83 2000

0.8465`

The output results show that our model achieves 84.65% accuracy, which is very impressive considering the fact that we randomly select all the parameters of the neural network model. I suggest you try to change the model parameters, such as the training / test ratio, the number and size of hidden layers, to see if you can get better results.

**conclusion**

Pytorch is a common deep learning library developed by Facebook, which can be used for various tasks, such as classification, regression and clustering. This paper introduces how to use pytorch library to classify table data.

Most popular insights

1.Research hotspots of big data journal articles

2.618 online shopping data inventory

3.R language text mining, TF IDF topic modeling, sentiment analysis, n-gram modeling

4.Interactive visualization of LDA and t-sne in Python topic modeling

5.News data observation under epidemic situation

6.Python topic LDA modeling and t-sne visualization

7.Topic modeling analysis of text data in R language

8.Theme model: Data listening to the “online affairs” on the message board of people’s network

9.Semantic data analysis of LDA topic crawled by Python crawler