Sparse vectors usually contain many dimensions. establish**Feature combination**Will result in more dimensions. Due to the use of such high latitude feature vectors, the model may be very large and require a lot of ram.

In high latitude sparse vectors, it is best to reduce the weight to exactly 0 as much as possible. A weight of exactly 0 basically removes the corresponding feature from the model. Setting the feature to 0 saves RAM space and reduces noise in the model.

Take a housing dataset that covers all regions of the world (not just California). If the global dimension is divided into buckets according to minutes (60 minutes per degree), about 10000 latitudes will be generated in a sparse coding process; If the global longitude is divided by points, about 20000 dimensions will be generated in a sparse coding process. The combination of these two features will produce about 200 million latitudes. Many of these 200 million latitudes represent very limited residential areas (such as the ocean), which is difficult to generalize using these data. It would be unwise to pay for RAM storage for these unnecessary latitudes. Therefore, it is better to reduce the weight of meaningless latitude to 0, so that we can avoid paying the storage cost of these model coefficients during reasoning.

We may be able to add appropriately selected regularization terms to turn this idea into an optimization problem to be solved during training.

L_ 2 can regularization accomplish this task? Unfortunately, I can’t. L_ 2 regularization can make the weights smaller, but it can’t make them exactly 0.0.

Another method is to try to create a regularization term to reduce the count of non-zero coefficient values in the model. It makes sense to increase this count only if the model can fit the data. Unfortunately, although this counting method looks attractive, it will turn our convex optimization problem into a non convex optimization problem, that is**NP hard**。 (if you look closely, you will find that it is related to the knapsack problem.) therefore, L_ 0 regularization is not an effective method in practice.

However, L_ 1 regularization this regularization term acts like L_ 0, but it has the advantage of convex optimization and can be calculated effectively. Therefore, we can use L_ 1 regularization makes the coefficient lacking a lot of information in the model exactly 0, so as to save ram in reasoning.

### L1 and L2 regularization

L_ 1 and l_ 2. Reduce the weight in different ways:

1.L_ 2 will reduce the weight ^ 2

2.L_ 1 will reduce the weight.

Therefore, L_ 2 and l_ 1 has different derivatives:

1.L_ The derivative of 2 is 2 * weight.

2.L_ The derivative of 1 is K (a constant, other values are independent of weight).

You can put L_ The effect of the derivative of 2 is understood as X% of the weight removed each time. As Zeno knows, for any number, even if billions of subtractions are performed by subtracting x% each time, the final value will never be exactly 0. (Zeno is not familiar with the floating-point longitude limit, which may make the result exactly 0.) in short, L_ 2 usually does not change the weight to 0.

You can put L_ The function of the derivative of 1 is understood as subtracting a constant from the weight each time. However, since the absolute value is subtracted, L_ 1 has a discontinuity at 0, which causes the subtraction result intersecting 0 to become – 0.2, L_ 1 sets the weight to 0. That’s it, L_ 1 changes the weight to 0.

L_ 1 regularization – reducing the absolute value of all weights – proved to be very effective for the width model.

Note that this description applies to a one-dimensional model.

This work adoptsCC agreement, reprint must indicate the author and the link to this article