Simplified regularization of machine learning: L2 regularization

Time：2021-10-19

Please see the following generalization curve, which shows the loss of training set and verification set relative to the number of training iterations

Figure 1. Loss of training set and verification set

Figure 1 shows that the training loss of a model gradually decreases, but the verification loss eventually increases. In other words, the generalization curve shows that the model is over fitted with the data in the training set. According to Okam’s razor law, perhaps we can prevent over fitting by reducing the complexity of complex models. This principle is called regularization.

In other words, it is not just aimed at minimizing losses (minimizing empirical risk):

minimize ( Loss ( Data|Model ))

Instead, the goal is to minimize loss and complexity, which is called structural risk minimization:

minimize(Loss(Data|Model) + complexity(Model))

Now, our training optimization algorithm is a function composed of two contents: one is the loss term, which is used to measure the fit between the model and the data, and the other is the regularization term, which is used to measure the complexity of the model.

The machine learning crash course focuses on these two common ways to measure model complexity (these two ways are somewhat related):

• Model complexity is taken as a function of the weight of all features in the model.
• The model complexity is taken as a function of the total number of features with non-zero weight.

If the model complexity is a function of weight, the higher the absolute value of feature weight, the greater the contribution to the model complexity

We can use L2The complexity is measured by the regularization formula, which defines the regularization term as the sum of squares of all feature weights:

L_2 regularization term = ||w||_{2}^{2} = w_{1}^{2} + w_{2}^{2} + … +w_{n}^{2}

In this formula, the weight close to 0 has little impact on the complexity of the model, while the outlier weight may have a great impact.

For example, a linear model has the following weights:

w_1 = 0.2, w_2 = 0.5, w_3 = 5, w_4 = 2, \\w_5 = 0.25, w_6 = 0.75

L_ 2 the regularization term is 26.915:

w_{1}^{2} + w_{2}^{2} + \mathbf{w_{3}^{2}} + w_{4}^{2} + w_{5}^{2 }+ w_{6}^{2}\\
= 0.2^{2} + 0.5^{2} + \mathbf{5^{2}} + 1^{2} + 0.25^{2} + 0.75^{2}\\
= 0.04 + 0.25 + \mathbf{25} + 1 + 0.0625 + 0.5625
= 26.915

But w_ The square value of 3 (BOLD above) is 25, which contributes almost all the complexity. Sum of squares of all 5 other weights to L_ The contribution of regularization is only 1.915