The AI Lab
Compile  VK
Source  medium
This paper studies the over fitting of sparsenn model, and explores a variety of regularization methods, such as Max norm / constant norm of embedded vector, dropout of sparse feature ID, freezing of parameters, embedding shrinkage, etc. However, as far as we know, in a single training, there is no significant reduction in the effect of over fitting.
Regularized full connection layer and sparse parameters
The random gradient descent optimizer uses a small number of samples to update the full join layer and sparse parameters. Given a small batch of examples, usually all the full connection layer parameters will be updated (assuming there is no gate or dropout), while only a small number of sparse parameters will be activated in forward propagation and updated in back propagation. For example, suppose a sparse feature encodes the AD ID that the user clicks in the past week. Although we may have millions of unique ad IDS, the number of ad IDs that appear in a small batch (usually 100 samples) is very limited compared to the cardinality.
The difference between regularized sparse parameters and full connection layer parameters is that we need to identify the activated sparse parameters in small batch at run time, and then only regularize these parameters.
One thing to note when regularizing fully connected layer parameters is that deviations in the fully connected layer usually do not need to be regularized. Therefore, these deviations need to be identified and automatically excluded from the regularization.
L2 regularization
J (W) is the loss of experience_ Dense  2 is the square of the L2 norm of the fully connected layer sparse parameter (also known as the L2 regularizer);  w_ The same is true of sparse.
Parameter W_ The gradient of the loss l of I is decomposed into the empirical loss J and the socalled “weight attenuation” term λ * W_ I.
In order to implement L2 regularizer, add lambda * W_ I to update l about W_ I. Lambda is called weight attenuation in implementation.
L2 regularization vs Max norm regularization

L2 regularization can be applied to both fully connected layer and sparse parameters, both of which may be over fitted. However, Max norm is only suitable for sparse parameters because the vector norm of weight matrix is not well defined in the full connection layer.

The L2 regularization term in the loss function is differentiable, which is equivalent to adding an attenuation term to punish the large weight in the gradient descent; max norm breaks the forward backward propagation framework because if the norm of the updated embedding vector is greater than 1, it normalizes the embedded vector.
There are several experiments which are caused by the observed sparsenn’s over fitting when transmitting training data many times. The training setup is very simple. We only consider one user feature and one advertising feature, not the full connectivity layer feature.
The experiment is divided into two parts
(a) Describe the experiment
(b) Further hypothesis and test method.
Let’s take the following settings as an example.
set up

Client characteristics (sparse)_ USER_ CLK_ AD_ IDS) and advertisement side features (sparse)_ AD_ OBJ_ ID)。

N_ Train = 1 represents the training days, n_ Test = 1 day represents the number of days tested; in a comparison, the number of days of training and the number of days of testing are fixed.

The test (average) normalized entropy curve was drawn the next day to simulate production conditions.
Shuffling: will get the same result.
Yes, across_ Ts_ shuffle, shuffle_ all, shuffle_ within_ Partition shuffling
Reducing learning rate is a regularization method of logistic regression. But it doesn’t work for sparsenn.
Lower learning rate, num_ passes = 2
Regularization by limiting the embedding norm (const is described here_ norm； max_ Nor’s results are similar). Here, the cost function is independent of the regularization applied.
Const norm
Reduce the learning rate with the minimum capacity:
When you try to reduce the size to 2 and num_ When replicas = 1 to minimize model capacity, you’ll see
num_ passes=1/sparse_ 8711 when alpha = 0.002
and
num_ 8703 when passes = 2.
Finally, in num_ With passes > 1, we achieved success!
But, num_ Passes = 3 destroys our shortterm happiness; we’ve been trying to exceed 0.8488, which is what the current sparsenn parameter can generate (dimensionality = ~ 32, learning rate = 0.04 and num_ replicas=2).
The learning rate of minimum capacity num_ passes = 1
Minimum capacity learning rate, num_ passes = 2
Minimum capacity learning rate, num_ passes = 3
SGD optimizer
What if the learning rate is reset? For this experiment, you can copy the data of a day in a continuous partition. Graph (a) represents the data with num passes = 1, and graph (b) represents multiple training of training data, where num_ Passes = 1 indicates the same data on a contiguous partition; num_ Passes = 2 means multiple training on the same partition. The results are the same.
In the case of dropout, sparsenn provides dropout_ Ratio and sparse_ dropout_ ratio。 The sparse dropout will remove the connection from the embedded layer to the full connection layer, while the full connection layer dropout will lose the connection in the network.
Dropout num_passes = 1
Dropout num_passes = 2
Maximum entropy regularizer?
You can try to regularize the embedded entropy so that the embedded dimensions can be preserved, rather than remembering the user ad pair. For example, in the user movie recommendation problem, if the movie is represented by (action, drama, emotion, comedy), it can be well summarized, but if the user movie pair is remembered, it will fail in the test data.
Through these experiments:

Embedding connects users with advertising. When the dimension increases, the embedded network about the user to the advertisement will be better (and the performance will be improved, there is no limit on the dimension of the test). On the contrary, when the dimension is low, advertising is related to more users (and performance is lower).

When the hash value is small, it will resolve multiple advertisements into an embedding. They do so without any semantic hashing, i.e. completely unrelated ads are resolved into the same embedding, so performance is degraded. Perhaps adding dimensions will restore performance. Perhaps semantic hashing will enable us to solve this problem.

The training data is stratified according to the impression of each advertisement: frequent advertisement, mediumsized advertisement and small batch advertisement..

Frequent ads are perfectly remembered, that is, the average click through rate of each user is encoded by the advertising and user dimensions. For them, retraining doesn’t hurt because they already remember it.

The mediumsized advertisements cause large variance, but it is averaged by the law of large numbers. In the continuous training, mediumsized advertising can obtain more training data and be improved. In single pass training, advertisements memorize a small number of users with initial randomness (due to initialization); in multi pass training, advertisements only remember a few users and cannot be extended to other users.
By starting to try to regularize and store these ml parameters, you can become a senior engineer in ml. Who says ml is hard to learn?
Link to the original text: https://medium.com/swlh/thescienceofoptimizationinml26b0b2bb3d62
Welcome to visit pan Chuang AI blog station:
http://panchuang.net/
Sklearn machine learning Chinese official document:
http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/