Multiple regression analysis of Statistical Science


01. Preface

In front of us, we talked about univariate linear regression. If you haven’t seen it, you can take a look at it first: [univariate linear regression analysis]. In this article, let’s talk about multiple linear regression. Univariate linear regression means that there is only one X in the independent variable, while multivariate linear regression means that there are multiple X in the independent variable.

The form of multiple regression is as follows
Multiple regression analysis of Statistical Science

02. Parameter estimation

The parameters in the multiple regression equation also need to be estimated. In fact, we have also talked about why to estimate them in the single variable linear regression. Different from univariate linear regression, univariate linear regression fits a line, while multivariate regression fits a surface. The method used is also the least square method.

03. Fitting degree judgment

In multiple regression, fitting degree judgment is similar to univariate regression, including total sum of squares, regression sum of squares and residual sum of squares.

Multiple regression also has R ^ 2, R ^ 2 = SSR / SST = 1 – SSE / SST. Because increasing the number of independent variables will reduce the residual SSE, resulting in the increase of R ^ 2.

Why does adding new variables reduce SSE? Because every time a new variable is added, the new variable will contribute part of the sum of squares, which is separated from the residual.

In order to avoid the blind increase of independent variables leading to an imaginary high R ^ 2, the excellent predecessors came up with a new index, namely the revised R ^ 2. The formula is as follows:
Multiple regression analysis of Statistical Science

In the formula, n is the number of samples and K is the number of independent variables. Adjust R ^ 2 through N and K, so that R ^ 2 will not increase with the increase of the number of independent variables.

We usually use the adjusted R ^ 2 to judge the accuracy of multiple regression.

In addition to R ^ 2, we can also use the standard error to measure the quality of the regression model. The standard error is the square root of the mean square residual (MSE), which represents the average prediction error of the dependent variable y predicted according to the respective variable x.

04. Significance test

We have done significance test in one variable linear regression, and we also need to do significance judgment in multiple regression.

4.1 linear relationship test

Linear relationship test is to test whether the relationship between Y and multiple x is significant, which is the test of overall significance.

The test method is consistent with univariate linear regression, that is, we assume that there is no linear relationship, and then carry out F-test on variables. For detailed introduction, refer to the explanation in univariate linear regression.

4.2 regression coefficient test

The significance test of linear relationship is a significant judgment of multiple variables, that is to say, as long as one of the multiple X’s influence on y is significant, the linear relationship is significant. The regression coefficient test is used to see whether the coefficient corresponding to each x is significant. To see whether the coefficient of a variable is significant, assume that the coefficient of the variable is equal to 0, and then make t-test to judge the significance.

For specific t-test, you can check the content of hypothesis test: [statistical hypothesis test].

05. Multicollinearity

There is another difference between multiple regression and univariate regression, that is, multiple collinearity may exist in multiple regression.

What is multicollinearity? In multiple regression, we hope that multiple XS work on y respectively, that is, X is related to y respectively. But in the actual scene, X1 and X2 may be related to each other. We call this kind of situation that x variables are related to each other as multicollinearity. Multicollinearity may give the regression a wrong result.

Since multicollinearity is a serious problem, how can we find it? The simplest way is to find the correlation between variables. If two variables are highly correlated, it can be considered that there is multicollinearity.

For the variables with multicollinearity, we usually abandon one of them.

The above is a brief introduction to multiple regression. You can see that many contents have not been discussed, mainly because these things have been mentioned in previous articles. If you have not read the students, you can go to the front of the corresponding article.