A Brief Description of FM Algorithms


1 Linear Model

Usually, the phenotype of linear model is y=a x+b, y is dependent variable, x is independent variable, and the parameter a is solved by least square method or gradient descent.

A Brief Description of FM Algorithms

1.1 Logical Regression

For categorization problem, we use sigmoid transformation for the output of linear model

A Brief Description of FM Algorithms

Question: Why does sigmoid function be used in logistic regression?
Paper: The equivalence of logistic regression and maximum entropymodels
Http://www.win-vector.com/dfi…self-brain tonic
Logical regression is a special case when maximum entropy corresponds to category two

Logical regression objective function:

A Brief Description of FM Algorithms

Note: The target variables here are – 1, +1

The objective function of logistic regression is deduced as follows:
Assume that the target variable is + 1, – 1
A Brief Description of FM Algorithms
Based on maximum likelihood estimation:
A Brief Description of FM Algorithms
The likelihood function is:
A Brief Description of FM Algorithms
Logarithmic likelihood:
A Brief Description of FM Algorithms
Take the minus sign and solve the minimum value:
A Brief Description of FM Algorithms

Characteristics of 1.2 Linear Model

  • Generally, discretization of features is used to solve univariate non-linear problems.
  • Simple, efficient, explanatory, early application in recommendation, advertising, search and other businesses
  • Characteristic Combination, Human Meat Solution

1.3 Feature Combination

Meaning of introducing feature combination:
By observing a large number of sample data, we can find that the correlation between some features and label will be improved after correlation. For example, “cosmetics” goods and “women” sex, “ball sports accessories” goods and “men” sex, “film tickets” goods and “film” category preferences, etc.

Artificial feature combination:
Middle-aged + Gender Male = Middle-aged Uncle; Juvenile + Gender Female = Lori
Disadvantage: Business experience and large amounts of data analysis are needed

Quadratic function combination:

A Brief Description of FM Algorithms

The number of parameters increases with the square of feature dimension
Too many parameters require more data to be trained.
For highly sparse data, there may be a lack of x_i_x_j!= 0 mode in the data, which leads to difficulty in training secondary parameters.

W weight matrix:

A Brief Description of FM Algorithms
FM draws on the idea of matrix decomposition method
Given the matrix data W = R ^ (m * n), two matrices are computed.
A Brief Description of FM Algorithms
Making U*V Reduce Matrix W as much as possible

A Brief Description of FM Algorithms


2.1 Improvement of feature combination

Lower rank constraints are imposed on the weight matrix and decomposed.

A Brief Description of FM Algorithms
_Combination of characteristic parameters, from the previous n(n-1)/2 parameters reduced to NK
Reduce the influence of inadequate cross-term parameter learning
Specifically, the coefficients of xhxi and xixj are vh, VI and vi, VJ respectively. There is a common term VI between them. That is to say, all the samples containing “xi’s non-zero combination feature” (there exists a certain j_i, making xixj_0) can be used to learn implicit vector vi, which largely avoids the impact of data sparsity. In the polynomial model, WHI and Wij are independent of each other.
Wij must have enough samples to be trained well. However, in the case of sparse input, most of the x_i values in input feature X are 0, not to mention x_i, x_j is not 0 at the same time. So the polynomial model can not be well trained.
On the contrary, FM, although it depends on training v_i, has the advantage that all other non-zero elements will become part of the gradient, so there will be more training data for v_i training, and the appropriate parameters can be better trained.
Because the cross-term parameter W is no longer independent, the unknown cross-mode can be learned by the observed one.

Training set:
Men, football
Women, cosmetics

Test Set:
Men, cosmetics??
Linear time complexity (only related to non-zero features)
The < vi, vj> is required mainly by using the formula [(a + B + c) 2 A2 B2 c2) / 2 to find the crossover term. The specific process is as follows:

A Brief Description of FM Algorithms

The reduction of time complexity is mainly reflected in:
1) Prediction of feature cross-terms
2) Time-finding of parameter gradient of feature cross-terms

2.1 Model Solution

For binary classification problems, the loss function is:

A Brief Description of FM Algorithms
A Brief Description of FM Algorithms
A Brief Description of FM Algorithms
Updating parameter theta

A Brief Description of FM Algorithms
How to do sparsity of embedding features or distributed representations?
There is a method of coupled group lasso to do L1 for latent feature vectors as a whole.

2.3 FM training demo

Training process code based on SDG algorithm (python):

def stocGradAscent(dataMatrix, classLabels, k, max_iter, alpha):
    '''Training FM Model by Random Gradient Decline Method
    Input: data Matrix (mat) feature
            ClassLabels (mat) tag
            Dimensions of k(int)v
            Max_iter(int) Maximum Iteration Number
            Alpha (float) learning rate
    Output: w0 (float), w (mat), V (mat): weight
    # dataMatrix=np.mat(dataMat)
    m, n = np.shape(dataMatrix)
    # 1. Initialization parameters
    W = np. zeros ((n, 1)# where n is the number of features
    W0 = 0 Bias
    # k=4
    V = initialize_v (n, k) # initialize V

    # 2. Training
    for it in range(max_iter):
        For X in range (m):  Random optimization, for each sample
            # x=1
            inter_1 = dataMatrix[x] * v 
            inter_2 = np.multiply(dataMatrix[x], dataMatrix[x]) * \
                      Multiplication of np. multiply (v, v) # multiply corresponding elements
            # Complete the crossover
            interaction = np.sum(np.multiply(inter_1, inter_1) - inter_2) / 2.
            P = W0 + dataMatrix [x] * W + interaction # calculates the predicted output
            # derivation of loss function
            loss = sigmoid(classLabels[x] * p[0, 0]) - 1

            w0 = w0 - alpha * loss * classLabels[x]
            for i in range(n):
                if dataMatrix[x, i] != 0:
                    w[i, 0] = w[i, 0] - alpha * loss * classLabels[x] * dataMatrix[x, i]

                    for j in range(k):
                        v[i, j] = v[i, j] - alpha * loss * classLabels[x] * \
                                            (dataMatrix[x, i] * inter_1[0, j] - \
                                             v[i, j] * dataMatrix[x, i] * dataMatrix[x, i])

                        # Calculating the value of loss function
        # if it % 1000 == 0:
        print("\t------- iter: ", it, " , cost: ", \
              getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels))
        cost.append(getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels))

    # 3. Return the parameters of the final FM model
    return w0, w, v,cost

Logical regression literature: