1. Introduction of regression model
Let’s take a look at the regression model. The following explanation comes from Baidu Encyclopedia:
Regression model is a predictive modeling technology, which studies the relationship between dependent variable (target) and independent variable (predictor). This technique is usually used forForecast analysisTime series model and the relationship between variablescausal relationship。
The two most important application scenarios of regression model are prediction analysis and causality analysis, such as the system of linear equations with one variable that we learned in school
y = kx + bThis is the simplest regression model. When we know an X, for example, X is the month, we can find the Y corresponding to X through the equation, where y can be the sales volume. The process of finding y through X is a forecasting process.
Regression model is mainly divided into one variable linear regression and multiple linear regression. In this section, we will talk about one variable linear regression.
2. Parameter estimation
What does parameter estimation do? What are the estimated parameters? It’s used to estimate the equation
y = kx + bK and B in. Some people may have such a question. Why should we estimate it? It’s not a direct calculation. We can calculate it directly when we are in school, because there are only two points, and the straight line through these two points is determined, so the corresponding parameters are also fixed.
In practical application, we often have multiple data points, which are not in a straight line. However, we want these points to be in a straight line as much as possible, so we need to find such a straight line. The distance between this straight line and each data point is very close (close to 0), In this way, we can use the line as close as possible to each point to approximate a trend of these points. K and B corresponding to this line are the parameters we estimated.
We have a principle in finding this line, that is, the distance between each point and this line should be as small as possible, and finally the distance between all points and the line should be minimized. We call this method the least square method, which is a method of parameter estimation.
You can learn more about the least square method by yourself.
3. Fitting degree judgment
Through the above parameter estimation, we have got a line that can reflect the trend of data points, but we need to judge the accuracy of this line, that is, the degree of fitting with the trend of actual data points.
Here are some concepts about judging the degree of fit.
Total sum of squares (SST): the sum of squares of the distance between the actual value and its average value, which can be understood as variance (not actual variance), and is used to reflect the fluctuation of the actual value y.
Regression sum of squares (SSR): Regression value (i.e. predicted y value) andMean value of actual valueThe sum of the squares of the distances between them. This part of the change is caused by the change of independent variables, which can be explained by regression line.
Sum of squares of residuals (SSE): Regression value andactual valueThe sum of the squares of the distances between them. This part is caused by other factors besides the influence of independent variables and belongs to the unexplainable part.
SST = SSR + SSE，
The fluctuation of the actual value Y on the surface of the above formula is determined by two factors, one is the change of Y (regression sum of squares) caused by different independent variables X, and the other is determined by factors other than independent variables (residual sum of squares).
In our ideal situation, the fluctuation of actual value y is caused by the change of independent variable x as far as possible, and the higher the proportion, the better our regression line fitting. We call this indicator R ^ 2 = SSR / SST.
The larger R ^ 2 is, the better the fitting degree is, which is between [0,1].
4. Significance test
Through the previous steps, the parameters are also obtained, that is, K and B in y = KX + B are obtained. Can we use them directly? Obviously not. Why? Because your parameter estimation is based on your existing sample data, the straight line reflects the trend of the existing data. Can the trend of these data represent the trend of the total data? We need to test it. This is the significance test.
Regression line is actually used to reflect the linear relationship between X and y, so the first thing we need to test is whether the linear relationship is significant, how to test, or use the hypothesis test method we talked about before.
Let’s assume that there is no linear relationship between X and Y. if there is no linear relationship, is k equal to 0? Then, since the fluctuation of the total sum of squares is completely determined by the sum of squares of residuals, does it mean that SSR / SSE is basically 0? This is the conclusion we got from the assumption that there is no linear relationship.
As mentioned in the previous analysis of variance, the sum of squares will increase with the increase of sample data, so we need to convert the sum of squares into mean square, that is, sum of squares / degree of freedom.
In univariate linear regression, the degree of freedom of the sum of squares of regression is 1 (that is, the number of independent variables), and the degree of freedom of the sum of squares of residuals is n-2.
The F value is calculated according to the sample data, the significance level is determined, and the f boundary value corresponding to the significance level is checked. If f > F boundary value, the original hypothesis is rejected, otherwise the original hypothesis is not rejected.
5. Application of regression equation
Univariate regression variance is mainly used for forecasting, which is divided into point forecasting and interval forecasting. Point forecasting is to forecast the specific sales volume in December this year through regression equation; Interval forecast is to get the range of sales volume in December this year through regression equation.
The point prediction is relatively simple, and the result can be obtained by directly substituting x into the equation. Interval prediction is a little more complicated, but the essence of interval prediction is still the calculation of confidence interval [talking about confidence and confidence interval], which we talked about before. There are two key points, one is the sample mean value, and the other is the standard deviation. The sample mean is also relatively simple, and the formula of standard deviation is as follows: