In theory, linear regression model can be used for both regression and classification. Solving the regression problem can be used to predict the continuous target value. But for the classification problem, this method is not suitable, because the output value of linear regression is uncertain range, which can not be well matched to some of our classification. Even if it is a binary classification, linear regression + threshold method, it is difficult to complete a good robust classifier. In order to achieve better classification, logical regression was born. Logistic regression is mainly used to solve the binary classification problem, which is used to express the possibility of something happening. Logistic regression assumes that the data obey Bernoulli distribution, so LR is also a parameter model, and its purpose is to find the optimal parameters. Logistic regression is a generalized linear model.

[supplement] in statistics, parametric models usually assume that the population (random variable) obeys a certain distribution, which is determined by some parameters (for example, the positive distribution is determined by means and variance). The model built on this basis is called parametric model; the nonparametric model does not make any assumptions about the distribution of the population, but only knows that the population is a random variable and its distribution exists But we can’t know the form of the distribution, let alone the related parameters of the distribution. Only under the condition of given some samples, we can infer according to the nonparametric statistical method.

Let’s first review simple linear regression (linear regression with only one input variable and one output variable considered).Represents the input variable (independent variable), X in the first part of the example.Represents the output variable (dependent variable), y in the first part of the example. A pairRepresents a set of training samples. M training samplesIt’s called a training set. The I in the above representation represents the ith sample. The uppercase X represents the space of all input values. The uppercase Y represents the space of all output values. Return belongs to supervised learning. The definition of supervised learning is that given a training set, our goal is to “learn” to get a functionLet H (x) be a “good” predictor of the true value of Y. Here h is called a model, also known as a hypothesis.

If the output value we want to predict is continuous, then the problem is called regression problem. For simple linear regression, our model h can be expressed as follows:。 Among themAndParameters representing the model. The objective of linear regression is to find the most suitableAndTo make the model work best.

**The general steps of regression problem are as follows:**

- Search for H function (namely hypothesis);
- J function (loss function) is constructed;
- Find a way to minimize the j function and get the regression parameter (θ)

## From linear regression to logistic regression

We know that the model of linear regression is to find the linear relationship coefficient between the output eigenvector y and the input sample matrix Xθ”> θ, satisfyingY=Xθ”>Y=Xθ。 At this point, our y is continuous, so it is a regression model. What if we want y to be discrete? One way to think of it is to do another function conversion for this yg(Y)”>g(Y)。 If we orderg(Y)”>g(Y)In a real number interval, the value of is category A, in another real number interval, it is category B, and so on. If there are only two categories of results, then it is a binary classification model. This is the starting point of logical regression. Let’s start with binary logistic regression.

Both logistic regression and linear regression are generalized linear models. Logistic regression assumes that dependent variable y obeys Bernoulli distribution, while linear regression assumes that dependent variable y obeys Gaussian distribution. Therefore, there are many similarities with linear regression. If the sigmoid mapping function is removed, the logical regression algorithm is a linear regression. It can be said that logistic regression is supported by linear regression theory, but logistic regression introduces nonlinear factors through sigmoid function, so it can easily handle the 0 / 1 classification problem.

First, we will introduce the sigmoid function, also known as the logistic function

The function curve is as follows:

From the above figure, we can see that the sigmoid function is an S-shaped curve, its value is between [0, 1], and the value of the function will quickly approach 0 or 1 far away from 0. This feature is very important to solve the binary classification problem.

We know the model of linear regression

Generalized linear regression in linear regression we mentioned that the result of linear regression can be changed into logical regression by transforming the result of linear regression in function G. This function g is usually taken as sigmoid function in logistic regression, so the hypothetical function form of logistic regression is as follows:

The binary regression model is as follows:

Among themIs the input of the sampleFor the parameters we require.hθ(x)”>The output probability of a certain model (x) can be understood as the size of the output. andθ”>θThe model parameters required for the classification model. For model outputhθ(x)”>hθ(x)Let’s make it correspond to our binary sample output y (assumed to be 0 and 1)hθ(x)>0.5″>hθ(x)>0.5xθ>0″>If x θ > 0, then y is 1. Ifhθ(x)<0.5″>H θ (x) < 0.5, i.exθ<0″>Then θ y is 0. Y = 0.5 is the critical casexθ=0″>xθ=0For, the classification cannot be determined from the logistic regression model itself.hθ(x)”>hθ(x)The smaller the value of, the higher the probability of classification 0. Conversely, the higher the value, the higher the probability of classification 1. If it is close to the critical point, the classification accuracy will decrease.Here we can also write the model as a matrix pattern:

**The hypothesis of logical regression**

Just like the linear regression model, logistic regression also has two assumptions

(1) Suppose the data obey Bernoulli distribution

(2) Suppose that the output value of the model is the probability that the sample is positive

Based on these two hypotheses, we can obtain the posterior probability estimates of categories 1 and 0 respectively (i.e. the model function of binary classification regression)Special meaning of the value of)

After understanding the binary classification regression model, we will look at the loss function of the model. Our goal is to minimize the loss function to obtain the corresponding model coefficientθ”>θ。

## The loss function of binary logistic regression

Let’s review the loss function of linear regression. Since linear regression is continuous, the sum of squares of model errors can be used to define the loss function. But the logical regression is not continuous, so the experience of the loss function definition of natural linear regression is useless. However, we can use the maximum likelihood method to derive our loss function.

According to the above definition of binary logistic regression, suppose that our sample output is 0 or 1. So we have:

Write these two formulas into one formula

As mentioned above, logistic regression is a probabilistic model. Therefore, we derive the loss function of logistic regression by maximum likelihood estimation (MLE). Got ity”>yThen we can use the maximum likelihood function to solve the model coefficients we needθ”>θ. The weight of the independent variable of the model. maximum likelihood function L(θ)”>L(θ)：

y”>θ”>

y”>θ”>A multiply function is not easy to calculate. It can be changed into continuous addition by taking the log form of colleagues on both sidesL(θ)”>In order to solve the problem conveniently, we use the logarithmic likelihood function to maximize it

y”>θ”>L(θ)”>

Thus, the maximum likelihood estimation of the parameters is derived. The maximum likelihood estimation is to makeIn fact, the gradient rising method can be used to solve the maximum value of θ, and the obtained θ is the best parameter required. But in the optimization of functions, the smaller the function is, the better, so we add a negative sign before it. Because I multiply it by a negative coefficient – 1, so I take itThe minimum value of θ is the best parameter required. So the loss function that we understand is minimization,The minimum value can be obtained by gradient descent method,Therefore, we need to add a negative sign before the above formula to get the final loss function

The loss function can also be expressed by matrix method

Where e is the full 1 vector.

The loss function “logarithmic likelihood function” of logistic regression is also used in the case of gbdt classification, also known as “cross entropy”. In logic regression, the most commonly used cost function is cross entropy. Cross entropy is a common cost function, which is also used in neural networks. In 1948, Claude Elwood Shannon introduced the entropy of thermodynamics into information theory, so it is also called Shannon entropy, which is the expectation of Shannon information content (SIC). Shannon information is used to measure the uncertainty: if the Shannon information of an event is equal to 0, it means that the occurrence of the event will not provide us with any new information. For example, the probability of occurrence of a deterministic event is 1, and it will not cause any surprise if it occurs; when the impossible event occurs, Shannon information is infinite, which means that it provides us with infinite new information Information, and it amazes us immensely. For more explanation, please refer to the blog: https://blog.csdn.net/rtygbwwwerr/article/details/50778098

## Optimization method of loss function in binary logistic regression

For linear regression model, the least square method can be used, but for logical regression, the traditional least square method is not suitable. There are many methods to minimize the loss function of binary logistic regression. The most common methods are gradient descent method, coordinate axis descent method, etc. As mentioned before, the gradient descent method can be solved by algebraic method and matrix method,θ”>It’s just that the derivation of algebraic method is complicated.The specific use of gradient descent method to solve the loss function of logistic regression can be referred to https://www.cnblogs.com/pinard/p/6029432.html , https://zhuanlan.zhihu.com/p/51279024 Their derivation process.

In practice, we generally don’t have to worry about optimization methods. Most machine learning libraries have built-in optimization methods of logical regression, but it is necessary to understand these optimization methods.

## Regularization of binary logistic regression

Logistic regression also faces the problem of over fitting, so we should also consider regularization. There are L1 regularization and L2 regularization.

**The main cause of the problem**Over fitting problems often result from too many features.

**resolvent:**

1) Reduce the number of features (reducing features will lose some information, even if the features are well selected)

- The feature to be retained can be selected manually;
- Model selection algorithm;

2) Regularization (more effective when there are more features)

- L2 regularization preserves all features, but reduces the size of θ
- L1 regularization

The loss function expression of L1 regularization of logistic regression is as follows. Compared with the ordinary logistic regression loss function, L1 norm is added as penalty, and the super parameter is addedα”>αAs the penalty coefficient, adjust the size of the penalty term

α”>

among||θ||1″>|θ|1 isθ”>θL1 norm of, logicThe commonly used optimization methods of L1 regularization loss function are coordinate axis descent method and minimum angle regression method.

The L2 regularization loss function of binary logistic regression is expressed as follows:

among||θ||2″>|θ 2 isθ”>θThe L2 norm of,The optimization method of L2 regularization loss function of logistic regression is similar to that of ordinary logistic regression.

## The extension of binary logistic regression: multiple logistic regression

The previous models and loss functions of logistic regression are limited to binary logistic regression. In fact, the model and loss function of binary logistic regression can be easily extended to multivariate logistic regression. For example, we always think that one type is positive and the others are 0. This method is the most commonly used one vs rest method, or ovr for short.

Another method of multiple logistic regression is many vs many (MVM), which selects samples of some categories and samples of other categories for binary classification of logistic regression. The most common one is one vs one (OVO). Ovo is a special case of MVM. Every time we choose two kinds of samples to do binary logistic regression.

Multiple logistic regression algorithms such as softmax regression.

## The advantages and disadvantages of logical regression

**advantage:**

1. Modeling the classification possibility directly, without realizing the hypothesis data distribution, thus avoiding the problem caused by the inaccurate hypothesis distribution

2. The form is simple and convenient to observe the probability score of the sample, the interpretability of the model is very good, and the influence of different features on the final results can be seen by the weight of the features

3. Faster training speed. Classification speed is very fast

4. Less memory consumption

**Disadvantages:**

1. The general accuracy rate is not very high, because the situation is very simple, it is difficult to fit the real distribution of data

2. When the feature space is large, the performance of logistic regression is not very good

3. It is difficult to deal with data imbalance

**The differences between logistic regression and linear regression are as follows**

One is to solve the regression problem and the other is to solve the classification problem

2. The output of samples in linear regression is continuous, y ∈ (+ ∞, − ∞). However, in logical regression, y ∈ {0,1}, can only take 0 and 1

3. The essential difference of fitting function

Linear regression decision function: F (x) = θ TX = θ 1×1 + θ 2×2 +… + θ NxN

The decision function of logistic regression is obtained: F (x) = P (y = 1 ∣ X; θ) = g (θ TX)

Four The fitting function of linear regression is indeed the fitting of the output variable y of F (x), while the fitting function of logical regression is the fitting of probability of sample of type 1; in linear regression, θ TX is the fitting function of predictive value; in logical regression, θ TX = 0 is the decision boundary; that is, the straight line obtained in linear regression is to fit the distribution of input samples, while the line obtained in logical regression is The decision boundary is to make the sample as non open as possible with different purposes.

α”>**One sentence generalizes logical regression**Logistic regression assumes that data is subject to**Bernoulli distribution**On the basis of linear regression, a binary sigmoid function is set up**Maximum likelihood method**To derive the loss function, the gradient descent method is used to optimize one of the loss functions**Classification algorithm of discriminant**。 The essence of classification: finding a decision boundary in space to complete the decision of classification

Reference articles:

https://www.cnblogs.com/pinard/p/6029432.html

https://www.cnblogs.com/huangyc/p/9813891.html

https://zhuanlan.zhihu.com/p/28408516

https://blog.csdn.net/pakko/article/details/37878837

https://zhuanlan.zhihu.com/p/51279024