## 1 Linear Model

Usually, the phenotype of linear model is y=a x+b, y is dependent variable, x is independent variable, and the parameter a is solved by least square method or gradient descent.

### 1.1 Logical Regression

For categorization problem, we use sigmoid transformation for the output of linear model

**Question: Why does sigmoid function be used in logistic regression?**

Paper: The equivalence of logistic regression and maximum entropymodels

Http://www.win-vector.com/dfi…self-brain tonic

Logical regression is a special case when maximum entropy corresponds to category two

Logical regression objective function:

Note: The target variables here are – 1, +1

The objective function of logistic regression is deduced as follows:

Assume that the target variable is + 1, – 1

Based on maximum likelihood estimation:

The likelihood function is:

Logarithmic likelihood:

Take the minus sign and solve the minimum value:

### Characteristics of 1.2 Linear Model

- Generally, discretization of features is used to solve univariate non-linear problems.
- Simple, efficient, explanatory, early application in recommendation, advertising, search and other businesses
- Characteristic Combination, Human Meat Solution

### 1.3 Feature Combination

Meaning of introducing feature combination:

By observing a large number of sample data, we can find that the correlation between some features and label will be improved after correlation. For example, “cosmetics” goods and “women” sex, “ball sports accessories” goods and “men” sex, “film tickets” goods and “film” category preferences, etc.

Artificial feature combination:

Middle-aged + Gender Male = Middle-aged Uncle; Juvenile + Gender Female = Lori

Disadvantage: Business experience and large amounts of data analysis are needed

Quadratic function combination:

Disadvantages:

```
The number of parameters increases with the square of feature dimension
Too many parameters require more data to be trained.
For highly sparse data, there may be a lack of x_i_x_j!= 0 mode in the data, which leads to difficulty in training secondary parameters.
```

W weight matrix:

FM draws on the idea of matrix decomposition method

Given the matrix data W = R ^ (m * n), two matrices are computed.

Making U*V Reduce Matrix W as much as possible

## 2FM

### 2.1 Improvement of feature combination

Lower rank constraints are imposed on the weight matrix and decomposed.

Benefits:

_Combination of characteristic parameters, from the previous n(n-1)/2 parameters reduced to NK

Reduce the influence of inadequate cross-term parameter learning

Specifically, the coefficients of xhxi and xixj are vh, VI and vi, VJ respectively. There is a common term VI between them. That is to say, all the samples containing “xi’s non-zero combination feature” (there exists a certain j_i, making xixj_0) can be used to learn implicit vector vi, which largely avoids the impact of data sparsity. In the polynomial model, WHI and Wij are independent of each other.

Wij must have enough samples to be trained well. However, in the case of sparse input, most of the x_i values in input feature X are 0, not to mention x_i, x_j is not 0 at the same time. So the polynomial model can not be well trained.

On the contrary, FM, although it depends on training v_i, has the advantage that all other non-zero elements will become part of the gradient, so there will be more training data for v_i training, and the appropriate parameters can be better trained.

Because the cross-term parameter W is no longer independent, the unknown cross-mode can be learned by the observed one.

Training set:

Men, football

Women, cosmetics

Test Set:

Men, cosmetics??

Linear time complexity (only related to non-zero features)

The < vi, vj> is required mainly by using the formula [(a + B + c) 2 A2 B2 c2) / 2 to find the crossover term. The specific process is as follows:

The reduction of time complexity is mainly reflected in:

1) Prediction of feature cross-terms

2) Time-finding of parameter gradient of feature cross-terms

### 2.1 Model Solution

For binary classification problems, the loss function is:

Updating parameter theta

How to do sparsity of embedding features or distributed representations?

There is a method of coupled group lasso to do L1 for latent feature vectors as a whole.

http://proceedings.mlr.press/…

### 2.3 FM training demo

Training process code based on SDG algorithm (python):

```
def stocGradAscent(dataMatrix, classLabels, k, max_iter, alpha):
'''Training FM Model by Random Gradient Decline Method
Input: data Matrix (mat) feature
ClassLabels (mat) tag
Dimensions of k(int)v
Max_iter(int) Maximum Iteration Number
Alpha (float) learning rate
Output: w0 (float), w (mat), V (mat): weight
'''
# dataMatrix=np.mat(dataMat)
m, n = np.shape(dataMatrix)
# 1. Initialization parameters
W = np. zeros ((n, 1)# where n is the number of features
W0 = 0 Bias
# k=4
V = initialize_v (n, k) # initialize V
cost=[]
# 2. Training
for it in range(max_iter):
For X in range (m): Random optimization, for each sample
# x=1
inter_1 = dataMatrix[x] * v
inter_2 = np.multiply(dataMatrix[x], dataMatrix[x]) * \
Multiplication of np. multiply (v, v) # multiply corresponding elements
# Complete the crossover
interaction = np.sum(np.multiply(inter_1, inter_1) - inter_2) / 2.
P = W0 + dataMatrix [x] * W + interaction # calculates the predicted output
# derivation of loss function
loss = sigmoid(classLabels[x] * p[0, 0]) - 1
w0 = w0 - alpha * loss * classLabels[x]
for i in range(n):
if dataMatrix[x, i] != 0:
w[i, 0] = w[i, 0] - alpha * loss * classLabels[x] * dataMatrix[x, i]
for j in range(k):
v[i, j] = v[i, j] - alpha * loss * classLabels[x] * \
(dataMatrix[x, i] * inter_1[0, j] - \
v[i, j] * dataMatrix[x, i] * dataMatrix[x, i])
# Calculating the value of loss function
# if it % 1000 == 0:
print("\t------- iter: ", it, " , cost: ", \
getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels))
cost.append(getCost(getPrediction(np.mat(dataTrain), w0, w, v), classLabels))
# 3. Return the parameters of the final FM model
return w0, w, v,cost
```

Logical regression literature:

http://blog.csdn.net/cyh_24/a…