Link to the original text:http://tecdat.cn/?p=9706
Here, we relax the assumption of the popular linear method. Sometimes the linear assumption is just a poor approximation. There are many ways to solve this problem, some of which can be solved by using regularization method to reduce the complexity of the model. However, these techniques still use linear models and can only be improved so far. This paper focuses on the extension of linear model
- _ Polynomial regression_ This is a simple method to provide nonlinear fitting for data.
- _ Step function_ Divide the range of variables into_ K_ Different regions to generate qualitative variables. It has the effect of fitting piecewise constant function.
- _ Regression spline_ It is more flexible than polynomial and step function, and is actually an extension of them.
- _ Local spline curve_ Similar to the regression spline curve, but allows overlapping regions, and can overlap smoothly.
- _ Smooth spline curve_ They are also similar to regression splines, but they minimize the residual sum of squares criterion of smoothness penalty.
- _ Generalized additive model_ The above method is allowed to be extended to handle multiple predictive variables.
This is the most traditional way to extend the linear model. As we increase the polynomial term, polynomial regression enables us to generate nonlinear curves while still using the least squares method to estimate the coefficients.
It is often used in biostatistics and epidemiology.
Regression splines are many applications of extended polynomials and stepwise regression techniques_ Basic_ One of the functions. in fact. Polynomials and stepwise regression functions are just_ Base_ Function.
This is an example of piecewise cubic fitting (top left).
In order to solve this problem, a better solution is to use constraints, so that the fitting curve must be continuous.
Choose the location and number of knots
One option is to place more knots where we think the change is fastest and fewer knots where it is more stable. But in practice, knots are usually placed in a uniform way.
It should be clear that in this case, there are actually five knots, including boundary knots.
So how many knots should we use? A simple choice is to try many knots and see which produces the best curve. However, a more objective approach is to use cross validation.
Compared with polynomial regression, spline curve can show more stable effect.
We discuss regression splines, which are created by specifying a set of knots, generating a series of basis functions, and then estimating the spline coefficients using the least square method. Smoothing splines is another way to create splines. Let’s recall that our goal is to find some functions that are very suitable for the observed data, that is, to minimize RSS. However, if there are no restrictions on our functions, we can set RSS to zero by choosing the function that precisely interpolates all the data.
Select the smoothing parameter lambda
Again, we turn to cross validation. It turns out that we can actually compute loocv very efficiently to smooth splines, regression splines and any other basis functions.
Smooth splines are generally preferable to regression splines because they usually create simpler models and have comparable fit.
Local regression involves using only nearby training observations to calculate target points_ x_ 0.
Local regression can be performed in a variety of ways, especially when it comes to fitting_ Linear regression model is especially obvious in the multivariate scheme Therefore, some variables can be fitted globally, while others can be fitted locally.
Generalized additive model
GAM model provides a general framework to extend the linear model by allowing nonlinear functions of each variable while maintaining additivity.
GAM with smooth splines is not so simple because least squares cannot be used. Instead, we use a method called_ Inverse fitting_ It’s the best way.
Advantages and disadvantages of Gam
- GAM allows nonlinear functions to be fitted to each predictor so that we can automatically model the nonlinear relationships that standard linear regression will miss. We don’t have to try many different transformations for each variable.
- Nonlinear fitting can be potentially applied to the dependent variable_ Y_ Make more accurate predictions.
- Because the model is additive, we can still examine each pair of predictors_ Y_ While keeping other variables unchanged.
- The main limitation is that the model is limited to the cumulative model, so important interactions may be missed.
Polynomial regression and step function
We can easily use it to fit polynomial functions, and then specify the variables and degree of the polynomial. This function returns the matrix of orthogonal polynomials, which means that each column is a linear combination of variables
age^4. If you want to get the variable directly, you can specify
raw=TRUEBut this will not affect the forecast results. It can be used to check the required coefficient estimates.
fit = lm(wage~poly(age, 4), data=Wage) kable(coef(summary(fit)))
Now let’s create one
ages The vector we want to predict. Finally, we are going to plot the data and fit the polynomial of degree 4.
ageLims <- range(age) age.grid <- seq(from=ageLims, to=ageLims) pred <- predict(fit, newdata = list(age = age.grid), se=TRUE)
plot(age,wage,xlim=ageLims ,cex=.5,col="darkgrey") lines(age.grid,pred$fit,lwd=2,col="blue") matlines(age.grid,se.bands,lwd=2,col="blue",lty=3)
In this simple example, we can use ANOVA test.
## Analysis of Variance Table ## ## Model 1: wage ~ age ## Model 2: wage ~ poly(age, 2) ## Model 3: wage ~ poly(age, 3) ## Model 4: wage ~ poly(age, 4) ## Model 5: wage ~ poly(age, 5) ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 2998 5022216 ## 2 2997 4793430 1 228786 143.59 <2e-16 *** ## 3 2996 4777674 1 15756 9.89 0.0017 ** ## 4 2995 4771604 1 6070 3.81 0.0510 . ## 5 2994 4770322 1 1283 0.80 0.3697 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
_M_1Compared with the quadratic model, P value is higher
_M_2It is essentially zero, which indicates that linear fitting is not enough. Therefore, we can conclude that quadratic or cubic models may be more suitable for this data, and tend to simple models.
We can also use cross validation to select polynomial degree.
In fact, the minimum cross validation error we see here is for quartic polynomials, but choosing the cubic or quadratic model will not cause too much loss. Next, we consider predicting whether an individual’s annual income exceeds 250000.
However, the confidence interval of probability is unreasonable, because we finally get some negative probability. In order to generate confidence intervals, it is more meaningful to transform pairs _ Number_ forecast.
plot(age,I(wage>250),xlim=ageLims ,type="n",ylim=c(0,.2)) lines(age.grid,pfit,lwd=2, col="blue") matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)
Stepwise regression function
Here, we need to split the data.
## ## (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1] ## 750 1399 779 72
fit <- lm(wage~cut(age, 4), data=Wage) coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 94.158 1.476 63.790 0.000e+00 ## cut(age, 4)(33.5,49] 24.053 1.829 13.148 1.982e-38 ## cut(age, 4)(49,64.5] 23.665 2.068 11.443 1.041e-29 ## cut(age, 4)(64.5,80.1] 7.641 4.987 1.532 1.256e-01
Here, we will use cubic splines.
Because we use the cubic spline of three knots, the generated spline has six basis functions.
##  3000 6 dim(bs(age, df=6)) ##  3000 6 ## 25% 50% 75% ## 33.75 42.00 51.00
Fit the spline curve.
We can also fit smooth splines. Here, we fit the spline curve with 16 degrees of freedom, and then select the spline curve through cross validation to generate 6.8 degrees of freedom.
fit2$df ##  6.795 lines(fit, col='red', lwd=2) lines(fit2, col='blue', lwd=1) legend('topright', legend=c('16 DF', '6.8 DF'), col=c('red','blue'), lty=1, lwd=2, cex=0.8)
Local regression was performed.
Now, we use gam to predict wages by spline of year, age and education. Since this is only a linear regression model with multiple basic functions, we only use the
In order to fit more complex splines, we need to use smooth splines.
Draw these two models
yearIt’s linear. We can create a new model and then use ANOVA test.
## Analysis of Variance Table ## ## Model 1: wage ~ ns(age, 5) + education ## Model 2: wage ~ year + s(age, 5) + education ## Model 3: wage ~ s(year, 4) + s(age, 5) + education ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 2990 3712881 ## 2 2989 3693842 1 19040 15.4 8.9e-05 *** ## 3 2986 3689770 3 4071 1.1 0.35 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Seems to add linearity
year The composition is better than that without linear addition The gam of the ingredients is much better.
## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -119.43 -19.70 -3.33 14.17 213.48 ## ## (Dispersion Parameter for gaussian family taken to be 1236) ## ## Null Deviance: 5222086 on 2999 degrees of freedom ## Residual Deviance: 3689770 on 2986 degrees of freedom ## AIC: 29888 ## ## Number of Local Scoring Iterations: 2 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## s(year, 4) 1 27162 27162 22 2.9e-06 *** ## s(age, 5) 1 195338 195338 158 < 2e-16 *** ## education 4 1069726 267432 216 < 2e-16 *** ## Residuals 2986 3689770 1236 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## s(year, 4) 3 1.1 0.35 ## s(age, 5) 4 32.4 <2e-16 *** ## education ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the model with nonlinear relation, we can confirm again
yearNo contribution to the model.
Next, we will use local regression to fit gam.
Before calling GAM, we can also use local regression to create interaction items.
We can plot the resulting surface.