Compared with SVM, how does FM model learn cross features? How to optimize it?

Time:2020-5-22

In the calculation of advertising and recommendation system, CTR prediction is a very important link. To judge whether an item is recommended or not, it needs to be determined according to the ranking of CTR estimated click rate. The commonly used methods in the industry are artificial features + LR, gbdt + LR, FM and FFM models.

In recent years, many improved methods based on FM have been proposed, such as deepfm, FNN, PNN, DCN, xdeepfm, etc. today, I will share FM with you.

Factorization machine (FM) was proposed by Steffen Rende in 2010. The model mainly solves the problem of large-scale sparse data classification through feature combination.

1. What is a factorization machine?

Problems with one hot?

In the face of the problem of CTR estimation, we often turn it into the following type of binary problem.

click Gender different countries
1 male China
0 female U.S.A
1 female France

Because gender, country and other features are category features, one hot encoding is often used to convert them into numerical types.

click Gender = male Gender = female Country = China Country = us Country = France
1 1 0 1 0 0
0 0 1 0 1 0
1 0 1 0 0 1

Compared with SVM, how does FM model learn cross features? How to optimize it?

As can be seen from the above figure, after one hot coding, the feature space of each sample has become much larger, and the feature matrix has become very sparse. In real life, we can often see more than 10 ⁷ dimensional feature vectors.

If we use a single linear model to learn the user’s click and scoring habits, we can easily ignore the potential combination of features, such as: women like cosmetics, men like playing games, and users who buy milk powder often buy diapers, etc.

Second order polynomial kernel SVM

In order to learn cross features, SVM introduces the concept of kernel function. The simplest and direct way is to assign a weight parameter to the combination of two features. These new weight parameters are the same as the parameters corresponding to the original features, which are handed to the model for learning in the training stage. In this way, the following prediction functions are formed:

Compared with SVM, how does FM model learn cross features? How to optimize it?

This is actually the SVM model whose kernel function is chosen as the second-order polynomial kernel. The model designed in this way seems to be able to learn the information brought by the intersection of features, but this is only a theoretical improvement, but the model does not have a good generalization ability when dealing with a large number of sparse data.

Due to $W_ The value of {I, J} $depends entirely on $X_ I $and $X_ The product of J $, in the case of sparse data, there may be training set $X_ Ix_ When J $is always zero, the model cannot effectively update the weight $W_ {I, J} $, further, in the prediction phase, the model encounters $X_ Ix_ It is difficult to generalize effectively if J $is not zero.

Factorizer model

Since the generalization performance of second-order polynomial kernel SVM is insufficient, the reason is $W_ The value of {I, J} $depends entirely on $X_ I $and $X_ The most direct way is to break through this limitation.

The solution to the FM model is for each dimension’s characteristics ($X_ I $) learn a representation vector ($V_ I $, which can be understood as the embedding vector of the feature ID). And then put $X_ I $and $X_ The weight of the product of J $is set as the dot product of each representation vector, that is, the prediction function with the following form:

Compared with SVM, how does FM model learn cross features? How to optimize it?

Obviously, FM model also has the advantage of second-order polynomial kernel SVM: it can learn the information brought by the intersection of two features.

It can be seen from the expression below that FM is very similar to calculating the embedding of each one hot coded feature, and then learning the influence of embedding similarity between different features on the final prediction results. Because we can compress tens of millions of dimension sparse vector into tens or hundreds of dimension embedding, which greatly reduces the number of parameters of the model, so as to enhance the generalization ability of the model and get better prediction results.

Let’s go back to the example in the previous section: training set $X_ Ix_ J $is always zero. In the second-order polynomial kernel SVM, because of the parameter weight $W_ {I, J} $can’t be updated, the model can’t learn $X_ I $and $X_ J $cross. But in FM, $X_ I $and $X_ The parameter of J $is not completely determined by $X_ I $and $X_ The product of J $. Specifically, the representation vector of each one-dimensional feature is determined by the intersection of the feature and all other dimensional features. So, as long as there is a certain $k $that makes $X_ I $and $X_ If the product of K $is not always zero, then the representation vector of the i-th dimension feature $V_ I ^→ $can learn effective information – similarly to $V_ J ^→ $has the same conclusion. So, even in training sessions, $X_ Ix_ J $is always zero and its parameter $⟨ V_ i^→,v_ J ^→⟩ $is also updated by learning, so it can show good generalization performance.

FM and matrix decomposition

Compared with SVM, how does FM model learn cross features? How to optimize it?

Collaborative filtering based on matrix decomposition is a common recommendation scheme in the recommendation system. Users’ scores on items are collected from historical data, which can be either an explicit score or an implicit feedback score. Due to the large number of users and items, the user and item pairs with scoring are usually very rare. Collaborative filtering based on matrix decomposition is to predict the scoring of items by users without behavior, which is actually a scoring prediction problem.

The method of matrix decomposition assumes that user’s rating r of item is determined by the similarity between user embedding and item embedding, as well as the prejudice of user and item.

These parameters can be obtained by minimizing the empirical error:

$min_{p,q,b}\sum_{(u,i)∈K}(r_{ui}-\overline{r}_{ui})^2+λ(||p_u||^2+||q_i||^2+b_u^2+b_i^2)$

From the above description, the second-order matrix of FM also uses the technique of matrix decomposition, so what is the relationship between collaborative filtering based on matrix decomposition and FM? Taking user’s item scoring prediction problem as an example, collaborative filtering based on matrix decomposition can be regarded as a special example of FM. For each sample, FM can be regarded as a vector formed by the vector connection after the onehot coding with only userid and itemid. In addition, FM can adopt more features and learn more combination patterns, which can not be done by a single matrix decomposition model! Therefore, FM is more universal than matrix decomposition! In fact, the schemes that can be done by matrix decomposition are directly on FM now!

Compared with SVM, how does FM model learn cross features? How to optimize it?

Efficiency issues

Considering the second-order combination of features in FM model, when there are N original features, there will be $(n ^ 2-N) / 2 $cross features. Therefore, without any optimization, the complexity of FM model will be $o (N2) $, specifically $o (KN ^ 2) $(where k is the length of representation vector). This is unacceptable in scenarios with very large feature sizes.

So the question is, is there a way to reduce the complexity to $o (KN) $? The answer is yes. Let’s take a look at a series of transformations for feature intersections.

Compared with SVM, how does FM model learn cross features? How to optimize it?

You can see that the time complexity is $o (KN) $.

Parameter learning

From the above description, it can be known that FM can be predicted in linear time. Therefore, the parameters of the model can be learned by gradient descent method (such as random gradient descent) for various loss functions. The gradient of FM model is:

Compared with SVM, how does FM model learn cross features? How to optimize it?

Due to $\ sum_ {j=1}^nv_ {j,f}x_ J $is only related to $f $. It is independent of $I $. It can be calculated in advance, and each gradient update can be completed in constant time complexity. Therefore, the complexity of FM parameter training is also $o (KN) $. In conclusion, FM can be trained and predicted in linear time, which is a very efficient model.

summary

The FM model has two advantages:

  1. In the case of high sparsity, the intersection between features can still be estimated, and can be generalized to the unobserved intersection
  2. The time complexity of parameter learning and model prediction is linear

Optimization points of FM model:

  1. It is characterized by full cross, which consumes resources. Generally, the cross effect of user and user, item and item is less than that of user and item.
  2. Use matrix calculation instead of for loop calculation.
  3. The construction of high-order cross features.

If you think the article is helpful to you, please like it, forward it and collect it.

Original is not easy, your support is my creative power!

Compared with SVM, how does FM model learn cross features? How to optimize it?