1 Linear Model
Usually, the phenotype of linear model is y=a x+b, y is dependent variable, x is independent variable, and the parameter a is solved by least square method or gradient descent.
1.1 Logical Regression
For categorization problem, we use sigmoid transformation for the output of linear model
Question: Why does sigmoid function be used in logistic regression?
Paper: The equivalence of logistic regression and maximum entropymodels
Logical regression is a special case when maximum entropy corresponds to category two
Logical regression objective function:
Note: The target variables here are – 1, +1
The objective function of logistic regression is deduced as follows:
Assume that the target variable is + 1, – 1
Based on maximum likelihood estimation:
The likelihood function is:
Take the minus sign and solve the minimum value:
Characteristics of 1.2 Linear Model
- Generally, discretization of features is used to solve univariate non-linear problems.
- Simple, efficient, explanatory, early application in recommendation, advertising, search and other businesses
- Characteristic Combination, Human Meat Solution
1.3 Feature Combination
Meaning of introducing feature combination:
By observing a large number of sample data, we can find that the correlation between some features and label will be improved after correlation. For example, “cosmetics” goods and “women” sex, “ball sports accessories” goods and “men” sex, “film tickets” goods and “film” category preferences, etc.
Artificial feature combination:
Middle-aged + Gender Male = Middle-aged Uncle; Juvenile + Gender Female = Lori
Disadvantage: Business experience and large amounts of data analysis are needed
Quadratic function combination:
The number of parameters increases with the square of feature dimension Too many parameters require more data to be trained. For highly sparse data, there may be a lack of x_i_x_j!= 0 mode in the data, which leads to difficulty in training secondary parameters.
W weight matrix:
FM draws on the idea of matrix decomposition method
Given the matrix data W = R ^ (m * n), two matrices are computed.
Making U*V Reduce Matrix W as much as possible
2.1 Improvement of feature combination
Lower rank constraints are imposed on the weight matrix and decomposed.
_Combination of characteristic parameters, from the previous n(n-1)/2 parameters reduced to NK
Reduce the influence of inadequate cross-term parameter learning
Specifically, the coefficients of xhxi and xixj are vh, VI and vi, VJ respectively. There is a common term VI between them. That is to say, all the samples containing “xi’s non-zero combination feature” (there exists a certain j_i, making xixj_0) can be used to learn implicit vector vi, which largely avoids the impact of data sparsity. In the polynomial model, WHI and Wij are independent of each other.
Wij must have enough samples to be trained well. However, in the case of sparse input, most of the x_i values in input feature X are 0, not to mention x_i, x_j is not 0 at the same time. So the polynomial model can not be well trained.
On the contrary, FM, although it depends on training v_i, has the advantage that all other non-zero elements will become part of the gradient, so there will be more training data for v_i training, and the appropriate parameters can be better trained.
Because the cross-term parameter W is no longer independent, the unknown cross-mode can be learned by the observed one.
Linear time complexity (only related to non-zero features)
The < vi, vj> is required mainly by using the formula [(a + B + c) 2 A2 B2 c2) / 2 to find the crossover term. The specific process is as follows:
The reduction of time complexity is mainly reflected in:
1) Prediction of feature cross-terms
2) Time-finding of parameter gradient of feature cross-terms
2.1 Model Solution
For binary classification problems, the loss function is:
Updating parameter theta
How to do sparsity of embedding features or distributed representations?
There is a method of coupled group lasso to do L1 for latent feature vectors as a whole.
2.3 FM training demo
Training process code based on SDG algorithm (python):
def stocGradAscent(dataMatrix, classLabels, k, max_iter, alpha): '''Training FM Model by Random Gradient Decline Method Input: data Matrix (mat) feature ClassLabels (mat) tag Dimensions of k(int)v Max_iter(int) Maximum Iteration Number Alpha (float) learning rate Output: w0 (float), w (mat), V (mat): weight ''' # dataMatrix=np.mat(dataMat) m, n = np.shape(dataMatrix) # 1. Initialization parameters W = np. zeros ((n, 1)# where n is the number of features W0 = 0 Bias # k=4 V = initialize_v (n, k) # initialize V cost= # 2. Training for it in range(max_iter): For X in range (m): Random optimization, for each sample # x=1 inter_1 = dataMatrix[x] * v inter_2 = np.multiply(dataMatrix[x], dataMatrix[x]) * \ Multiplication of np. multiply (v, v) # multiply corresponding elements # Complete the crossover interaction = np.sum(np.multiply(inter_1, inter_1) - inter_2) / 2. P = W0 + dataMatrix [x] * W + interaction # calculates the predicted output # derivation of loss function loss = sigmoid(classLabels[x] * p[0, 0]) - 1 w0 = w0 - alpha * loss * classLabels[x] for i in range(n): if dataMatrix[x, i] != 0: w[i, 0] = w[i, 0] - alpha * loss * classLabels[x] * dataMatrix[x, i] for j in range(k): v[i, j] = v[i, j] - alpha * loss * classLabels[x] * \ (dataMatrix[x, i] * inter_1[0, j] - \ v[i, j] * dataMatrix[x, i] * dataMatrix[x, i]) # Calculating the value of loss function # if it % 1000 == 0: print("\t------- iter: ", it, " , cost: ", \ getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels)) cost.append(getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels)) # 3. Return the parameters of the final FM model return w0, w, v,cost
Logical regression literature: