Deep Learning Trick: Reducing Overfitting of Deep Networks with Weight Constraints

Time:2019-7-23

Summary:Learn tips in depth, restrict weights to reduce the possibility of model over-fitting, and attach keras implementation code.

In deep learning, batch normalization and adding some regular terms to the loss function can generally improve the performance of the model. These two methods are basically weight constraints, which can be used to reduce the over-fitting of training data and improve the performance of the model for new data.

At present, there are many types of weight constraints, such as maximization or normalization of unit vectors. Some methods also need to configure hyperparameters.

In this tutorial, Keras API is used to add weight constraints to the deep learning neural network model to reduce over-fitting.

After completing this tutorial, you will understand:

  • How to use Keras API to create vector norm constraints;
  • How to use Keras API to add weight constraints to MLP, CNN and RNN layers?
  • How to reduce over-fitting by adding weight constraints to existing models?

Next, let’s start.

This tutorial is divided into three parts:

  • Weight constraints in Keras;
  • Weight constraints on layers;
  • Case study on weight constraints;

Weight Constraints in Keras

Keras API supports weight constraints, and constraints can be specified at each level.

Usage constraints typically involve setting input weights on the layerkernel_constraintParameter, deviation weight set tobias_constraintThis is the case. Generally, the weighting constraint method does not involve bias weights.

A set of different vector specifications inkeras.constraintsModules can be used as constraints:

  • Maximum Norm (max_norm): The coercive weight is equal to or lower than the given limit;
  • Non-negative norm (non_neg): the coercive weight is positive;
  • Unit_norm: Forced weight is 1.0;
  • Min-Max Norm (min_max_norm): The coercive weight is between one range;

For example, constraints can be imported and instantiated:

# import norm
from keras.constraints import max_norm
# instantiate norm
norm = max_norm(3.0)

Weight Constraints on Layers

Weighting specifications can be used at most layers of Keras. Here are some common examples:

MLP Weight Constraints

The following example is to set the maximum norm weight constraint on the full connection layer:

# example of max norm on a dense layer
from keras.layers import Dense
from keras.constraints import max_norm
...
model.add(Dense(32, kernel_constraint=max_norm(3), bias_constraint==max_norm(3)))
...

CNN Weight Constraints

The following example is to set the maximum norm weight constraint on the convolution layer:

# example of max norm on a cnn layer
from keras.layers import Conv2D
from keras.constraints import max_norm
...
model.add(Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint==max_norm(3)))
...

RNN Weight Constraints

Unlike other layer types, recurrent neural networks allow us to set weight constraints on input weights and deviations as well as cyclic input weights. Through the Layerrecurrent_constraintConstraints on setting recursive weights for parameters.

The following example sets the maximum norm weight constraint on the LSTM layer:

# example of max norm on an lstm layer
from keras.layers import LSTM
from keras.constraints import max_norm
...
model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint==max_norm(3)))
...

Based on the above basic knowledge, the following is an example of practice.

Case Study on Weight Constraints

In this section, we will demonstrate how to use weight constraints to reduce the over-fitting problem of MLP for simple binary classification problems.

This example only provides a template for readers to apply weight constraints to their own neural networks for classification and regression.

Bi-classification problem

The standard binary classification problem is used to define two semicircle observations, one semicircle for each class. Each observation value has two input variables, which have the same proportion and the output value is 0 or 1 respectively. This data set is also called the “Moon” data set. This is because the shape of each class is similar to that of the Moon when it is drawn.

have access tomake_moons()Function generates observation results, setting parameters to add noise and turn off randomly, so that the same sample can be generated each time the code is run.

# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

Two variables X and Y coordinates can be plotted on the graph, and the color of the category to which the data points belong is used as the color of observation.

The complete examples of generating and plotting data sets are listed below:

# generate two moons dataset
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()

Running this example creates a scatter plot in which you can see that the corresponding categories display images similar to semicircles or moon shapes.

The data set shown above shows that it is a good test problem, because it can not be divided into straight lines and needs non-linear methods, such as neural networks.

Only 100 samples are generated, which is small for the neural network. It also provides the probability of over-fitting training data set, and has higher error on the test data set. Therefore, it is also a good example of applying regularization. In addition, the noise of samples gives the model the opportunity to learn all aspects of inconsistent samples.

Overfitting of Multilayer Perceptors

In machine learning, MLP model can solve this kind of binary classification problem.

MLP model has only one hidden layer, but it has more nodes than the nodes needed to solve the problem, thus providing the possibility of over-fitting.

Before defining the model, we need to divide the data set into training set and test set, and divide the data set into training set and test set according to the ratio of 3:7.

# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]

Next, define the model. The number of nodes in the hidden layer is set to 500 and the activation function is RELU, but the Sigmoid activation function is used in the output layer to predict the output category of 0 or 1.

The model is optimized by using binary cross-entropy loss function, which is suitable for binary classification problems and gradient descent method of Adam version.

# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Then, the number of iterations is set to 4,000, and the default batch training sample number is 32.

# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

The test data set is used as a validation data set to verify the performance of the algorithm.

# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Finally, the performance of the model on the training and test set is drawn for each period. If the model does overfit the training data set, the corresponding curve will show that the accuracy of the model on the training set continues to increase, while the performance on the test set first increases, then decreases.

# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()

Combine the above processes and list the complete examples:

# mlp overfit on the moons dataset
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()

Running the example, the performance of the model on training data set and test data set is given.

It can be seen that the performance of the model on the training data set is better than that on the test data set, which is a sign of the occurrence of fitting. Given the randomness of neural networks and training algorithms, the specific results of each simulation may be different. Because the model is over-fitting, it is not usually expected that the same accuracy can be obtained by repeated running on the same data set.

Train: 1.000, Test: 0.914

Create a graph to show the accuracy of the model on the training and test set. From the figure, we can see the expected shape of the model when it is over-fitting, in which the test accuracy reaches a critical point and begins to decrease again.

Overfitting of Multilayer Perceptors with Weight Constraints

To compare with the above, weight constraints are now used for MLP. At present, there are some different weight constraints available. In this paper, we choose a simple and useful constraint, which simply standardizes weights so that their norm is equal to 1.0. This constraint has the effect of forcing all incoming weights to become smaller.

It can be used in Kerasunit_normImplement and add this constraint to the first hidden layer as follows:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

In addition, it can also be usedmin_max_normAndminandmaximumSet to 1.0 to achieve the same result, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=min_max_norm(min_value=1.0, max_value=1.0)))

However, it is not possible to obtain the same result through maximum norm constraints, because it allows specifications to be equal to or lower than specified limits; for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=max_norm(1.0)))

The complete code with unit specification constraints is listed below:

# mlp overfit on the moons dataset with a unit norm constraint
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from keras.constraints import unit_norm
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()

Running the example, the performance of the model on training data set and test data set is given.

As can be seen from the figure below, strict restrictions on weights do improve the performance of the model on the verification set, and do not affect the performance of the training set.

Train: 1.000, Test: 0.943

From the training and test accuracy curve, the model has been fitted on the training data set, and the accuracy of the model in the training and test data set remains at a stable level.

extend

This section lists some tutorials that readers may wish to explore:

  • Reporting weight criteria: Update the example to calculate the network weight, and prove that the use of constraints does make the magnitude smaller;
  • Constraint output layer: update the example to add constraints to the output layer of the model and compare the results;
  • Constraint bias: update the example to add constraints to bias weights and compare the results;
  • Repeated evaluation: update the example to fit and evaluate the model many times, and report the mean and standard deviation of model performance;

Further reading

For further insight, here are some other resources on this topic:

Blog

  • Introduction to Vector Norms in Machine Learning

API

  • Keras Constraints API
  • Keras constraints.py
  • KerasCore Layers API
  • Keras Convolutional Layers API
  • Keras Recurrent Layers API
  • sklearn.datasets.make_moons API

Author: [Direction]

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.