The lecture of Statistical Science


01. Preface

We talked about multiple linear regression. In this article, let’s talk about gradual regression. What is stepwise regression? It’s literally a step-by-step return.

We know that the element in multiple regression refers to the independent variable, and multiple variables are multiple independent variables, namely multiple X. One of the questions we need to consider is whether these X have an effect on y. The answer is that sometimes it works, sometimes it works in part. For those useless parts, we’d better not add them to the regression model. We call the process of selecting variables that work or eliminating variables that don’t work variable selection.

We just mentioned whether an independent variable is useful or not. How can we judge whether an independent variable is useful or not? The judgment basis is to test the significance of independent variables. The specific method is to add an independent variable to the model, whether the residual sum of squares is significantly reduced. If there is a significant reduction, it means that the variable is useful. You can add the variable to the model, otherwise it is useless, you can delete the variable from the model. The criterion to judge whether there is a significant reduction is based on F statistics.

We have talked about the significance of F statistics in the analysis of variance. You can have a look.

Variable selection mainly includes: forward selection, backward kick out, stepwise regression, optimal subset, etc. in this article, we mainly talk about the first three.

02. Choose ahead

Forward selection can be understood as choosing from scratch, because there are no independent variables at the beginning of the model. The specific steps are as follows:

Step 1: take the existing K variables and y to establish the regression model, and finally get the K models and the corresponding F statistics and P of each variable in the model_ Then, the independent variable corresponding to the model with the largest F statistic is selected from the significant models, and the independent variable is added to the model. If K models are not significant, the selection ends.

Step 2: through the first step, we have got a significant variable and added it to the model. Next, continue to add the remaining variables in the model that has added a variable to get k-1 models, and then select the variable with the largest and significant F value from the k-1 models to continue to add to the model. If there are no significant variables, choose end.

Repeat the above two steps until no significant variables can be added to the model. This is forward selection.

03. Reject backward

Backward culling is the method corresponding to forward selection and the inverse method of forward selection

Step 1: add all the independent variables to the model, and establish a regression model with k independent variables. Then remove each independent variable and get k models with k-1 variables. Compare the K models to see which variable can reduce the least residual sum of squares of the model, that is, the variable with the least impact, and then delete the variable from the model.

Step 2: in the first step, we have deleted a useless variable. In the second step, we continue to delete the remaining variables on the basis of deleting a variable, and delete the independent variable that minimizes the sum of squares of model residuals from the model.

Repeat the above two steps until the removal of an independent variable does not significantly reduce the residuals. At this time, the remaining variables are significant.

04 step by step

Stepwise regression is a combination of forward selection and backward kick. It’s the intersection of these two methods, that is, select once and eliminate at the same time.

Stepwise regression uses forward selection when adding variables to the model every time. The variable with the largest F statistic is added to the model. After the variable is added to the model, all the variables in the current model are eliminated backward, and the process of selection and elimination is repeated until the final addition of variables can not lead to the reduction of the sum of squares of residuals.

About the python implementation of stepwise regression, there are many ready-made codes on the Internet. As long as the principle is clear, the code is easy to understand.