# Compared with SVM, how does FM model learn cross features? How to optimize it?

Time：2020-5-22

In the calculation of advertising and recommendation system, CTR prediction is a very important link. To judge whether an item is recommended or not, it needs to be determined according to the ranking of CTR estimated click rate. The commonly used methods in the industry are artificial features + LR, gbdt + LR, FM and FFM models.

In recent years, many improved methods based on FM have been proposed, such as deepfm, FNN, PNN, DCN, xdeepfm, etc. today, I will share FM with you.

Factorization machine (FM) was proposed by Steffen Rende in 2010. The model mainly solves the problem of large-scale sparse data classification through feature combination.

### 1. What is a factorization machine?

#### Problems with one hot?

In the face of the problem of CTR estimation, we often turn it into the following type of binary problem.

click Gender different countries
1 male China
0 female U.S.A
1 female France

Because gender, country and other features are category features, one hot encoding is often used to convert them into numerical types.

click Gender = male Gender = female Country = China Country = us Country = France
1 1 0 1 0 0
0 0 1 0 1 0
1 0 1 0 0 1

As can be seen from the above figure, after one hot coding, the feature space of each sample has become much larger, and the feature matrix has become very sparse. In real life, we can often see more than 10 ⁷ dimensional feature vectors.

If we use a single linear model to learn the user’s click and scoring habits, we can easily ignore the potential combination of features, such as: women like cosmetics, men like playing games, and users who buy milk powder often buy diapers, etc.

#### Second order polynomial kernel SVM

In order to learn cross features, SVM introduces the concept of kernel function. The simplest and direct way is to assign a weight parameter to the combination of two features. These new weight parameters are the same as the parameters corresponding to the original features, which are handed to the model for learning in the training stage. In this way, the following prediction functions are formed:

This is actually the SVM model whose kernel function is chosen as the second-order polynomial kernel. The model designed in this way seems to be able to learn the information brought by the intersection of features, but this is only a theoretical improvement, but the model does not have a good generalization ability when dealing with a large number of sparse data.

Due to $W_ The value of {I, J}$depends entirely on $X_ I$and $X_ The product of J$, in the case of sparse data, there may be training set $X_ Ix_ When J$is always zero, the model cannot effectively update the weight $W_ {I, J}$, further, in the prediction phase, the model encounters $X_ Ix_ It is difficult to generalize effectively if J$is not zero.

## summary

The FM model has two advantages:

1. In the case of high sparsity, the intersection between features can still be estimated, and can be generalized to the unobserved intersection
2. The time complexity of parameter learning and model prediction is linear

Optimization points of FM model:

1. It is characterized by full cross, which consumes resources. Generally, the cross effect of user and user, item and item is less than that of user and item.
2. Use matrix calculation instead of for loop calculation.
3. The construction of high-order cross features.

If you think the article is helpful to you, please like it, forward it and collect it.

Original is not easy, your support is my creative power!

## Python basics Chinese series tutorial · translation completed

Original: Python basics Python tutorial Protocol: CC by-nc-sa 4.0 Welcome anyone to participate and improve: a person can go very fast, but a group of people can go further. Online reading Apache CN learning resources catalog introduce Seven reasons to learn Python Why Python is great Learn Python introduction Executing Python scripts variable character string […]