## 1. Logistic regression

The essence of logistic regression: assuming that the data obey a certain distribution, the maximum likelihood estimation is used to estimate the parameters.

LR is actually a classification. Take the simple dichotomy as an example and assume that the training samples are:

$$

\left\{ {\left( {x_1^1,x_2^1} \right),{y^1}} \right\},\left\{ {\left( {x_1^2,x_2^2} \right),{y^2}} \right\},…,\left\{ {\left( {x_1^m,x_2^m} \right),{y^m}} \right\}

$$

Where y can only take values 0 and 1. We’re looking for one**Hyperplane**It makes the new sample classification as accurate as possible_ 1},{{\rm X}_ 2} , y \} $$$, hyperplane definition:

$$

{w^T}x + b = 0

$$

That is to say, $$left ({w_ 1},{w_ 2},b} \right) $$,

And consider $${w ^ t} x + B$$

It’s continuous, and y has only 0 and 1 values. Therefore, we pack a layer of SIGMOD function (differentiable) to compress the probability value of $${w ^ t} x + B $$to 0 ~ 1

$$

\begin{array}{l}

P(y = 1|w,b,x) = \sigma ({w^T} + b) = \frac{1}{{1 + {e^{ – ({w^T}x + b)}}}},\\

P(y = 0|w,b,x) = 1 – \sigma ({w^T} + b) = \frac{{{e^{ – ({w^T}x + b)}}}}{{1 + {e^{ – ({w^T}x + b)}}}}

\end{array}

$$

How to find $$/ left ({w)_ 1},{w_ 2} , B} right) $$$, using the likelihood function (multiplying the probabilities of all pairs of judgments)

$$

L(w,b) = \prod\limits_{i = 1}^m {{{[p({x_i})]}^{{y_i}}}{{[1 – p({x_i})]}^{1 – {y_i}}}}

$$

Take both sides**Log likelihood function**：

$$

\begin{array}{l}

L(w,b) = \sum\limits_{i = 1}^m {{y_i}\ln p({x_i}) + (1 – {y_i})(1 – p({x_i}))} ,\\

p({x_i}) = \frac{1}{{1 + {e^{ – ({w^T}x + b)}}}}

\end{array}

$$

The loss function is a logarithmic likelihood function with negative values

$$

L(w,b) = – \sum\limits_{i = 1}^m {{y_i}\ln p({x_i}) + (1 – {y_i})(1 – p({x_i}))}

$$

**Stochastic gradient descent method is used to solve the problem**(a random sample)

$$

\nabla F(w) = \sum\limits_{i = 1}^m {({y_i} – \frac{1}{{1 + {e^{ – ({w^T}x + b)}}}}){x_i}}

$$

Update parameters

$$\left( {{w},b} \right)$$

$$

{w_{t + 1}} = {w_t} + \eta ({y_i} – \frac{1}{{1 + {e^{ – ({w^T}x + b)}}}}){x_i}

$$

- LR and linear regression, the former is classification, compared with the latter, discusses one layer SIGMOD function
- LR and SVM, LR uses cross entropy, SVM uses hingeloss; LR is a parametric model, the premise of parametric model is to assume that the data obey a certain distribution, SVM is a nonparametric model, the distribution of nonparametric model exists but does not know the distribution form; SVM depends on distance
- Generally, the original data is not directly thrown to LR, but the feature is discretized. The advantage of this method is that the generalization ability is large and the calculation is accelerated

reference resources:https://zhuanlan.zhihu.com/p/…