Machine learning — fundamentals of algorithms (VI) linear regression and logistic regression



linear regression

For the problem of classification, the data is discrete. For regression, the target value is continuous. The purpose of linear regression is to find a certain trend.

In an n-dimensional model. Its model can be used

The representation model is called a linear model. W is called the weight and B is called the offset term.

Regression is an iterative algorithm. The prediction cannot be consistent with the actual value, so we need a value to describe whether the prediction is accurate. This is defined as the loss function.

For the least square method, the loss function is the sum of squares of the errors.

gradient descent

For the value of W parameter in the function, when there is a large amount of data, we can’t get it directly by algorithm, so we need another method to calculate the parameter value.

$$w=w_0+\alpha\frac{\partial f}{\partial w_0}$$
F is the loss function and $\ \ alpha $is the learning rate, which needs to be specified manually$\\ Frac {\ \ partial f} {\ \ partial w_0} $indicates the direction, which needs to be found along the descending direction of this function, and finally the lowest point can be found.

Reference article:…

Batch gradient descent bgd

Each calculation in batch dropGradient descent of one parameter at a timeAll sample points need to be calculated.

Random gradient descent SGD (large pit to be filled)

Random gradient is calculated every time in batch descentBased on a random pointThe gradient of descent required.

Small batch random gradient descent (large pit to be filled)

The gradient descent of small batch is between bgd and SGD, and a part is taken each time to calculate the gradient descent direction.

Regression performance evaluation

Mean square error (MSE)

Linear regression characteristics

Linear regressor is the simplest and easy-to-use regression model, but it can not solve the over fitting problem.

Collinearity feature

Multicollinearity refers to the distortion or difficulty of model estimation due to the more accurate correlation or high correlation between explanatory variables in linear regression model. Complete collinearity is rare. Generally, it is collinearity to a certain extent, that is, approximate collinearity.

Over fitting and under fitting (linear regression)

Reasons for over fitting
1. There are too many original features, there are some noisy features, and the model is too complex
terms of settlement
1. Feature selection is carried out to eliminate features with high relevance
2. Cross validation
3. Regularization

Causes of under fitting
1. Less original features
terms of settlement
1. Feature selection is carried out to eliminate features with high relevance
2. Cross validation


Function: it can make each element of W very small, close to 0
Advantages: the smaller the parameter, the simpler the model, and the simpler the parameter, the less likely it is to produce over fitting

Regularization is to reduce the complexity of the model by adjusting the model parameters (quantity and size), so as to avoid over fitting. Regularization is a term in machine learning, which is different in other fields:

In machine learning, L1 and L2 are called regularization, the field of statistics is called penalty term, and the field of mathematics is called norm.

Regularization is a typical method for selecting models. It is the realization of structural risk minimization strategy, which adds a regularization term or penalty term to the empirical risk. The regularization term is generally a monotonic increasing function of model complexity. The more complex the model is, the greater the regularization value is.

$$R(f)=\frac1N\sum_{i=1}^n{L(y_i,f(x_i))}+\lambda J(f)$$

  • The first term is the representation of the loss function
  • J (f) in the latter item represents the complexity of the model. It is a functional defined on the assumption space f (usually refers to a “function” whose definition domain is a function and the value domain is a real number. In other words, it is a mapping from a vector space composed of functions to a real number. That is, its input is a function and its output is a real number, from Wikipedia). The more complex the model f is, The greater the complexity J (f); On the contrary, the simpler the model, the smaller the complexity J (f). That is, complexity represents the penalty for complex models. λ ≥ 0 is the coefficient used to weigh the empirical risk and model complexity. Models with low structural risk often have good prediction of training data and location test data.
Lasso regression

Linear least squares method with L1 regularization.

Ridge regression

Linear least squares method with L2 regularization.
Ridge regression: the regression parameters obtained by regression are more realistic and reliable. In addition, it can reduce the fluctuation range of estimated parameters and become more stable. It has great practical value in the research with more morbid data.

Elasticnet regression

Linear least squares method with L1 and L2 regularization.

Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression

logistic regression

One is suitable for dealing with the regression problem where the dependent variable is classified variable. The most common problem is binary classification or binomial distribution. For example, a person’s height, weight, skin color, etc. can be used to determine whether he is easy to get sick. 0 means sick, 1 means not sick. This kind of problem is called logistic regression problem. Although it is said to be a return, butLogistic regression is actually a classification problem

Generalized linear model (GLM)

For linear regression, the model is
For linear models, their values are continuous and can not be directly used in classification problems. The predicted value needs to be processed again. The processing function is called the link function, that is, the $g $function in the figure below. The processed GLM is shown in the following figure:
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
The generalized linear model is a flexible linear model. It considers that the dependent variable belongs to exponential cluster distribution (which can be understood as a limitation), that is, for the input $x $, $y $has the following distribution:
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression

Exponential family distribution

If a class of distribution can be written as follows, it can be calledExponential family distribution (exponential family distributions) 。

Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
If t, a and h in the distribution are fixed, when $\ \ ETA $changes, different distributions will be obtained in this family.

Bernoulli distribution (also known as two-point distribution, 0-1 distribution)Poisson distribution andGaussian distributionAll belong to exponential family distribution

SIGMOD function

Sigmoid function is a mathematical function with beautiful S-shaped curve. It is widely used in logical regression and artificial neural network. The mathematical form of sigmoid function is:


When x is negative infinity, the sigmoid function approaches 0, and when x is positive infinity, the sigmoid function approaches 1. At the same time, this function is continuous, smooth, continuous and monotonic. Therefore, it is used in the classifier of logistic regression model. SIGMOD value can be understood as probability value.

Why SIGMOD function?

The fundamental reason is that the premise of logistic regression is that we assume that the variables obey Bernoulli distribution (i.e. 0-1 distribution, which is why logistic regression is applicable to binary classification problems)
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
After derivation, we come to the conclusion that,When the dependent variable obeys Bernoulli distribution, the generalized linear model is logistic regression

Logistic regression formula


Loss function and optimization of logistic regression

It is the same as the principle of linear regression, but because it is a classification problem, it can only be solved by gradient descent.

For logistic regression, there are two kinds of results, one kind of results and the other kind of results.The actual meaning of its loss function can be described as the accuracy of prediction logistic regression. When the prediction is completely accurate, the loss function is 0。 The complete loss function can be expressed as follows:

$$\cos t\left(h_\theta\left(x\right),y\right)=\sum_{i=1}^m-y_i\log\left(h_\theta\left(x\right)\right)-\left(1-y_i\right)\log\left(1-h_\theta\left(x\right)\right)$$

The loss function is in the form of cross entropy.

Why not choose MSE (mean square deviation) for the loss function of logistic regression?

The basic form of LR is as follows
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
If the univariate logistic regression uses the gradient descent method to update W and B, it is necessary to derive the loss function from these two parameters.

! [\ [formula \]](…

! [\ [formula \]](…
It can be seen that the update rate of W and B is related to the derivative of the current predicted value sigmoid function. The image of sigmoid is as follows
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
Therefore, if the output of the current model is close to 0 or 1, the updated parameter value of gradient descent will be very small, close to 0, so that the obtained gradient is very small and the convergence of loss function is very slow.

Parameter estimation of logistic regression

For the quadratic regression problem, we assume
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
The general formula of probability can be obtained
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
It can be obtained by maximum likelihood estimation
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
According to the definition of gradient, the gradient descent formula of logistic regression can be obtained:
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression

Softmax classification

A k-classification problem based on logistic regression
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression

Logistic regression and linear regression

The difference and relation between logistic regression and linear regression

  • difference

Linear regression assumes that the response variables obey normal distribution, and logical regression assumes that the response variables obey Bernoulli distribution
The objective function of linear regression optimization is the mean square error (least square), while the likelihood function (cross entropy) of logical regression optimization
Linear regression requires a linear relationship between independent variables and dependent variables, while logistic regression does not
Linear regression analysis focuses on the relationship between dependent variables and independent variables, while logical regression studies the probability of dependent variables and independent variables
Logical regression deals with classification problems and linear regression deals with regression problems, which also leads to different value ranges of the two models: 0-1 and real number field
In terms of parameter estimation, the maximum likelihood estimation method is used to estimate the parameters (Gaussian distribution leads to the loss function of linear model as mean variance, and Bernoulli distribution leads to the loss function of logistic regression as cross entropy)

  • contact

Both are linear models, linear regression is an ordinary linear model, and logical regression is a generalized linear model
In terms of expression, logistic regression is linear regression with a sigmoid function

Discriminant model and generative model

__ Discriminant model Generating meaning
__ logistic regression Naive Bayes
solve the problem Second classification Multi classification
Application scenario Cancer, secondary classification requires probability Text classification
parameter Regularization strength No,
Similar algorithm K-nearest neighbor, decision tree, random forest, neural network Hidden Markov model

The difference between discriminant model and generative model is that the generative model needs to know a priori probability, that is, historical data, in order to judge whether it belongs to a classification

ROC curve

In a binary classification model, for the continuous results obtained, it is assumed that a threshold value has been determined, such as 0.6. Instances greater than this value are classified as positive, and instances less than this value are classified as negative. If the threshold value is reduced to 0.5, more positive classes can be identified, that is, the ratio of the identified positive cases to all positive cases is improved, that isTPR (the ratio of all samples that are actually positive to be correctly judged as positive. TPR = TP / (TP + FN))But at the same time, more negative instances are regarded as positive instances, that is, the FPR is improved(among all samples that are actually negative, the probability of being wrongly judged as positive is called fpr, fpr = FP / (FP + TN))。 In order to visualize this change, ROC is introduced here, and ROC curve can be used to evaluate a classifier.
Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression

AUC area

When there are two classifiers, we can conclude that a classifier is better by whether there is mutual coverage between classifiers. It can also be judged by the coverage area.
AUC area is the coverage area of ROC curve, and the size of its coverage area indicates the quality of the classifier.

  • AUC = 1, is the perfect classifier.
  • AUC = [0.85, 0.95], the effect is very good
  • AUC = 0.5, which is the same as random guess (e.g. lost copper plate), and the model has no predictive value.
  • AUC < 0.5, worse than random guess

Ar autoregressive model

The linear combination of random variables at some time in the early stage is used to describe the linear regression model of random variables at a later time. The model considers that the current time point can be predicted by the linear combination of past time points of time series and white noise. It is a simple extension of random walk. The following figure shows a time. If it can be expressed as the following structure, it shows that it follows an autoregressive process of order P, expressed as AR (P). Among them, ut represents white noise, which is the random fluctuation of values in the time series, but these fluctuations will cancel each other, and finally it is 0$\\ Theta $represents the autoregressive coefficient.


MA (moving average model). The moving average equation can be obtained by weighted sum of white noise sequences over a period of time. Expressed as MA (q) $\ \ theta $represents the moving regression coefficient, and ut represents the white noise at different time points.



ARMA (auto regressive and moving average model) autoregressive moving average model, which is composed of autoregressive and moving average models. So it can be expressed as ARMA (P, q). P is the autoregressive order and Q is the moving average order.

It can be seen from the formula that the autoregressive model combines the characteristics of the two models, in which ar can solve the relationship between current data and later data, and Ma can solve the problem of random variation, that is, noise.

Time series analysis

Time series is a time-varying data series arranged in chronological order.
There are too many time series data in various fields and industries of life, such as sales, number of customers, traffic, stock price, oil price, GDP, temperature…

The characteristics of random process include mean, variance, covariance and so on.
If the characteristics of the random process change with time, the process is non-stationary; On the contrary, if the characteristics of a random process do not change with time, it is said that the process is stationary.
As shown in the figure below, the left is unstable and the right is stable.

Machine learning --- fundamentals of algorithms (VI) linear regression and logistic regression
In the analysis of non-stationary time series, if the cause of non-stationary is determined, the methods that can be used mainly include trend fitting model, seasonal adjustment model, moving average, exponential smoothing and so on.
If the cause of non-stationary is random, the main methods are ARIMA (autoregressive integrated moving average) and autoregressive conditional heteroscedasticity model.


ARIMA (auto regressive integrated moving average model) differential autoregressive moving average model. ARIMA model is to solve the problem based on ARMA,NonstationaryA series of models. It is also based on stable time series or stable after differential differentiation. In addition, the previous models can be regarded as a special form of ARIMA. Expressed as ARIMA (P, D, q). P is the autoregressive order, q is the moving average order, and D is the number of differences made when time becomes stationary, which is what the word integrate means here.