**From machine learning alchemy**

Linear regression is to solve the regression problem, and logical regression is to solve the classification problem on the basis of linear regression.

# 1 Formula

It goes without saying what linear regression is. The format looks like this:

$f_{w,b}(x)=\sum_i{w_ix_i}+b$

What about logistic regression?

$f_{w,b}(x)=\sigma(\sum_i{w_ix_i}+b)$

The first thing to remember:**Logistic regression can be understood as adding a sigmoid function after linear regression. The linear regression is transformed into a 0 ~ 1 output classification problem.**

# 2 sigmoid

The sigmoid function is:

$\sigma(z)=\frac{1}{1+e^{-z}}$

The function image is:

Linear regression can get more than 0 output, logistic regression can get 0.5 ~ 1 output;

Linear regression can get less than 0 output, logistic regression can get 0 ~ 0.5 output;

The main point of this article is that**The least square method is used to estimate the parameters of linear regression**, and**Logistic regression uses likelihood estimation**. (of course, gradient descent can be used for both).

# Likelihood estimation of logistic regression parameters

For example, now we have a training data set, which is a binary classification problem

The top $x ^ 1 $is the sample and the bottom $C_ 1 $is the category. There are two categories.

Now suppose we have a logistic regression model

$f_{w,b}(x)=\sigma(\sum_i{w_ix_i}+b)$

So $f_ The result of {W, B} (x ^ 1) $is a number from 0 to 1. We can set it well and assume that this number is the category $C_ The probability of 1 $, on the contrary, 1 minus this number, is the category $C_ Two dollars.

**The simple understanding of likelihood is to maximize the probability of the data set above**

Let’s understand:

- $x_ One dollar is $C_ The probability of $1 is $F_ {w,b}(x^1)$;
- $x_ Two dollars is $C_ The probability of $1 is $F_ {w,b}(x^2)$;
- $x_ Three dollars is $C_ The probability of $2 is $1-f_ {w,b}(x^3)$;
- ……
- $x_ N $is $C_ The probability of $1 is $F_ {w,b}(x^N)$;

The samples are independent of each other. What is the probability of the data set above? It is the product of each sample. This is the likelihood

We hope that the parameter estimation of W, B is the parameter that can obtain the maximum likelihood. That is to say:

With a minus sign, it becomes a minimization problem. Of course, adding a log will not affect the whole estimation of W and B. Because $log (L (W, b)) $is also the largest when $l (W, b) $is the largest, and log is a monotonically increasing function. So we can get the following results:

[Note: all logs are natural logarithms based on e]

Log can also convert the sum of previous products into addition.

$log(L(w,b))=log(f(x^1))+log(f(x^2))+log(1-f(x^3))…$

Then, in order to simplify this calculation, we will add $C_ 1, C_ Then, the real label of each sample is represented by $y $

$log(L(w,b))=\sum_i^N{ylog(f(x^i))+(1-y)log(1-f(x^i))}$

It’s a bit like binary cross entropy, but it’s actually binary cross entropy.. 】

- When y = 1, the category is $C_ When 1 $, this is $log (f (x ^ I))$
- When y = 0, the category is $C_ 2, this is $1-log (f (x ^ I))$

So in fact, the loss function we get is:

$loss=-log(L(w,b))=-\sum_i^N{ylog(f(x^i))+(1-y)log(1-f(x^i))}$

As I said before, how to find w and B when the loss is the smallest?

[merciless gradient drop]

So it’s good to calculate $/ frac {partial loss} {partial w} $, and multiply it by the learning rate. I won’t continue to deduce here. If you have patience, you can deduce it slowly. Anyway, you can definitely deduce it.

Here is the result:

$\frac{-\partial lnL(w,b)}{\partial w_i}=\sum_n^N{-(y^n-f_{w,b}(x^n))x_i^n}$

- Of which $W_ I $is the ith parameter to be estimated and the ith feature;
- $x^n_ I $is the value of the ith feature of the nth sample;
- $y ^ n $is the real class of the nth sample, 0 or 1.