# As a loss function

## L1 norm loss function

**L1 norm loss function**Also known as the minimum absolute error. Overall, it takes the target value of $y_ I $and estimate $f (x_ i) $of**absolute difference **Minimize the sum of.

$$S=\sum_{i=1}^n|Y_i-f(x_i)|$$

## L2 norm loss function

**L2 norm loss function**Also known as the least square error, in general, it takes the target value of $y_ I $and estimate $f (x_ i) $of**Sum of squares of differences**Minimize.

$$S=\sum_{i=1}^n(Y_i-f(x_i))^2$$

L1 loss function |
L2 loss function |

Robust | Not very robust |

Instability | Stable solution |

There may be multiple solutions | It’s always a solution |

**To sum up**The error is averaged by L2 norm loss（**If the error is greater than 1, the error will be amplified a lot**）The error of the model will be larger than L1 norm, so the model will be more sensitive to samples, so it is necessary to adjust the model to minimize the error. If a sample is an outlier, the model needs to be adjusted to fit a single outlier, which will sacrifice many other normal samples, because the error of these normal samples is smaller than that of this single outlier.

# As regularization

We often see an additional term added after the loss function, which is usually L1 norm, L2 norm, which is called L1 regularization and L2 regularization in Chinese, or L1 norm and L2 function.

L1 regularization and L2 regularization can be regarded as penalty terms of loss function. The so-called “punishment” refers to the limitation of some parameters in the loss function.A term added after the loss function to prevent over fitting of the model.

## L1 normalization

L1 norm is Laplacian distribution and is not completely differentiable. There will be many corners in the image. The contact opportunities of these angles and objective functions are much greater than those of other parts. It will cause the optimal value to appear on the axis,**Therefore, the weight of one dimension will be 0**, generate**Sparse weight matrix**To prevent over fitting.

Of the least square loss function**L1 regularization**：

L1 regularization refers to the**Sum of absolute values**

## L2 normalization

The L2 norm is fully differentiable because of its Gaussian distribution. Compared with L1, the edges and corners on the image are much smoother. Generally, the optimal value does not appear on the axis. When the regular term is minimized, it can be**The parameter tends to zero**In the end, we live with very small parameters.

In machine learning, normalization is an important technique to prevent over fitting. Mathematically speaking, it will add a regular term to prevent the coefficient from fitting too well and over fitting. The only difference between L1 and L2 is that L2 is the sum of squares of weights, while L1 is the sum of weights. As follows:

Of the least square loss function**L2 regularization**：

L2 regularization refers to the**The sum of squares and then the square root**

# effect

**L1 regularization**

- Advantages: the output is sparse, that is, a sparse model can be generated, which can be used for feature selection; to a certain extent, L1 can also prevent over fitting
- Disadvantages: however, it is inefficient in non sparse case

**L2 regularization**：

- Advantages: high computational efficiency (because of the existence of analytical solutions); can prevent model over fitting
- Disadvantages: non sparse output; no feature selection

**Sparse model and feature selection**: sparsity I have explained in detail in this article. If the features conform to sparsity, it means that many elements of the feature matrix are 0, and only a few elements are non-zero matrices, which means that only a few features contribute to the model, and most of the features have no contribution or small contribution (because the coefficient in front of them is 0 or very small value, even if the model is removed) In this case, we can only focus on the characteristics that the coefficient is non-zero. This is the relationship between sparse model and feature selection.

Reference [1] explains why L1 regularization can produce sparse model (how L1 coefficient equals to 0), and why L2 regularization can prevent over fitting. As many formulas are involved, please move on.

# difference

1. L1 regularization is the parameter of the model**Sum of absolute values**。

L2 regularization is the model of each parameter**Square root of sum of squares**。

2. L1 tends to produce a small number of features, and**The other features are 0**，A sparse weight matrix is generated.

L2 will choose more features, these**The characteristics will be close to 0**。

# A few more questions

**1. Why is the smaller the parameter, the simpler the model is?**

The more complex the model, the more attempt to fit all samples, including outliers. This will result in a larger fluctuation in a smaller interval, which is also reflected in the larger derivative of this interval.

The larger the parameter, the larger the derivative. Therefore, the smaller the parameters, the simpler the model.

**2. What are the benefits of sparsity?**

Because of the sparsity of parameters, the feature selection is realized to a certain extent. In general, most features do not contribute to the model. Although these useless features can reduce the error in the training set, they will interfere with the samples of the test set. With the introduction of sparse parameter, the weight of useless features can be set to 0

**3. Why can L1 norm and L2 norm avoid over fitting?**

Adding regularization term is to add constraints to the original objective function. When the contour of the objective function intersects L1 and L2 norm functions for the first time, the optimal solution is obtained.

# reference

CSDN blog: intuitive understanding of regularized items L1 and L2 in machine learning

Differences between L1 and L2 as Loss Function and Regularization