Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Time:2021-4-19

Link to the original text:http://tecdat.cn/?p=9706

Overview

Here, we relax the assumption of the popular linear method. Sometimes the linear assumption is just a poor approximation. There are many ways to solve this problem, some of which can be solved by using regularization method to reduce the complexity of the model. However, these techniques still use linear models and can only be improved so far. This paper focuses on the extension of linear model

  • _ Polynomial regression_     This is a simple method to provide nonlinear fitting for data.
  • _ Step function_ Divide the range of variables into_ K_ Different regions to generate qualitative variables. It has the effect of fitting piecewise constant function.
  • _ Regression spline_ It is more flexible than polynomial and step function, and is actually an extension of them.
  • _ Local spline curve_    Similar to the regression spline curve, but allows overlapping regions, and can overlap smoothly.
  • _ Smooth spline curve_ They are also similar to regression splines, but they minimize the residual sum of squares criterion of smoothness penalty.
  • _ Generalized additive model_ The above method is allowed to be extended to handle multiple predictive variables.

polynomial regression

This is the most traditional way to extend the linear model. As we increase the polynomial term, polynomial regression enables us to generate nonlinear curves while still using the least squares method to estimate the coefficients.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

stepwise regression

It is often used in biostatistics and epidemiology.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Regression spline

Regression splines are many applications of extended polynomials and stepwise regression techniques_ Basic_ One of the functions. in fact. Polynomials and stepwise regression functions are just_ Base_ Function.

This is an example of piecewise cubic fitting (top left).

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

In order to solve this problem, a better solution is to use constraints, so that the fitting curve must be continuous.

Choose the location and number of knots

One option is to place more knots where we think the change is fastest and fewer knots where it is more stable. But in practice, knots are usually placed in a uniform way.

It should be clear that in this case, there are actually five knots, including boundary knots.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

So how many knots should we use? A simple choice is to try many knots and see which produces the best curve. However, a more objective approach is to use cross validation.

Compared with polynomial regression, spline curve can show more stable effect.

Smooth spline

We discuss regression splines, which are created by specifying a set of knots, generating a series of basis functions, and then estimating the spline coefficients using the least square method. Smoothing splines is another way to create splines. Let’s recall that our goal is to find some functions that are very suitable for the observed data, that is, to minimize RSS. However, if there are no restrictions on our functions, we can set RSS to zero by choosing the function that precisely interpolates all the data.

Select the smoothing parameter lambda

Again, we turn to cross validation. It turns out that we can actually compute loocv very efficiently to smooth splines, regression splines and any other basis functions.

Smooth splines are generally preferable to regression splines because they usually create simpler models and have comparable fit.

Local regression

Local regression involves using only nearby training observations to calculate target points_ x_ 0.

Local regression can be performed in a variety of ways, especially when it comes to fitting_    Linear regression model is especially obvious in the multivariate scheme    Therefore, some variables can be fitted globally, while others can be fitted locally.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Generalized additive model

GAM model provides a general framework to extend the linear model by allowing nonlinear functions of each variable while maintaining additivity.

GAM with smooth splines is not so simple because least squares cannot be used. Instead, we use a method called_ Inverse fitting_ It’s the best way.

Advantages and disadvantages of Gam

advantage

  • GAM allows nonlinear functions to be fitted to each predictor so that we can automatically model the nonlinear relationships that standard linear regression will miss. We don’t have to try many different transformations for each variable.
  • Nonlinear fitting can be potentially applied to the dependent variable_ Y_ Make more accurate predictions.
  • Because the model is additive, we can still examine each pair of predictors_ Y_ While keeping other variables unchanged.

shortcoming

  • The main limitation is that the model is limited to the cumulative model, so important interactions may be missed.

example

Polynomial regression and step function

library(ISLR)
attach(Wage)

We can easily use it to fit polynomial functions, and then specify the variables and degree of the polynomial. This function returns the matrix of orthogonal polynomials, which means that each column is a linear combination of variablesage,  age^2,  age^3, andage^4. If you want to get the variable directly, you can specifyraw=TRUEBut this will not affect the forecast results. It can be used to check the required coefficient estimates.

fit = lm(wage~poly(age, 4), data=Wage)
kable(coef(summary(fit)))

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Now let’s create oneages  The vector we want to predict. Finally, we are going to plot the data and fit the polynomial of degree 4.

ageLims <- range(age)
age.grid <- seq(from=ageLims[1], to=ageLims[2])

pred <- predict(fit, newdata = list(age = age.grid),
                se=TRUE) 

plot(age,wage,xlim=ageLims ,cex=.5,col="darkgrey")
 lines(age.grid,pred$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=2,col="blue",lty=3)

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

In this simple example, we can use ANOVA test.

 ## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)    
## 1   2998 5022216                               
## 2   2997 4793430  1    228786 143.59 <2e-16 ***
## 3   2996 4777674  1     15756   9.89 0.0017 ** 
## 4   2995 4771604  1      6070   3.81 0.0510 .  
## 5   2994 4770322  1      1283   0.80 0.3697    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

We see,_M_1Compared with the quadratic model, P value is higher_M_2It is essentially zero, which indicates that linear fitting is not enough. Therefore, we can conclude that quadratic or cubic models may be more suitable for this data, and tend to simple models.

We can also use cross validation to select polynomial degree.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

In fact, the minimum cross validation error we see here is for quartic polynomials, but choosing the cubic or quadratic model will not cause too much loss. Next, we consider predicting whether an individual’s annual income exceeds 250000.

However, the confidence interval of probability is unreasonable, because we finally get some negative probability. In order to generate confidence intervals, it is more meaningful to transform pairs   _ Number_    forecast.

draw:

plot(age,I(wage>250),xlim=ageLims ,type="n",ylim=c(0,.2))
lines(age.grid,pfit,lwd=2, col="blue")
matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Stepwise regression function

Here, we need to split the data.

table(cut(age, 4)) 

## 
## (17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
##         750        1399         779          72

fit <- lm(wage~cut(age, 4), data=Wage)
coef(summary(fit))

##                        Estimate Std. Error t value  Pr(>|t|)
## (Intercept)              94.158      1.476  63.790 0.000e+00
## cut(age, 4)(33.5,49]     24.053      1.829  13.148 1.982e-38
## cut(age, 4)(49,64.5]     23.665      2.068  11.443 1.041e-29
## cut(age, 4)(64.5,80.1]    7.641      4.987   1.532 1.256e-01

splinesSpline function

Here, we will use cubic splines.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Because we use the cubic spline of three knots, the generated spline has six basis functions.

 ## [1] 3000    6
dim(bs(age, df=6))

## [1] 3000    6
##   25%   50%   75% 
## 33.75 42.00 51.00 

Fit the spline curve.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

We can also fit smooth splines. Here, we fit the spline curve with 16 degrees of freedom, and then select the spline curve through cross validation to generate 6.8 degrees of freedom.

 fit2$df

## [1] 6.795
lines(fit, col='red', lwd=2)
lines(fit2, col='blue', lwd=1)
legend('topright', legend=c('16 DF', '6.8 DF'),
       col=c('red','blue'), lty=1, lwd=2, cex=0.8) 

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Local regression

Local regression was performed.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

GAMs

Now, we use gam to predict wages by spline of year, age and education. Since this is only a linear regression model with multiple basic functions, we only use thelm()Function.

In order to fit more complex splines, we need to use smooth splines.

Draw these two models

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

yearIt’s linear. We can create a new model and then use ANOVA test.

 ## Analysis of Variance Table
## 
## Model 1: wage ~ ns(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
##   Res.Df     RSS Df Sum of Sq    F  Pr(>F)    
## 1   2990 3712881                              
## 2   2989 3693842  1     19040 15.4 8.9e-05 ***
## 3   2986 3689770  3      4071  1.1    0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Seems to add linearityyear  The composition is better than that without linear addition    The gam of the ingredients is much better.

 ## 
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -19.70   -3.33   14.17  213.48 
## 
## (Dispersion Parameter for gaussian family taken to be 1236)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29888 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value  Pr(>F)    
## s(year, 4)    1   27162   27162      22 2.9e-06 ***
## s(age, 5)     1  195338  195338     158 < 2e-16 ***
## education     4 1069726  267432     216 < 2e-16 ***
## Residuals  2986 3689770    1236                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)    
## (Intercept)                          
## s(year, 4)        3    1.1   0.35    
## s(age, 5)         4   32.4 <2e-16 ***
## education                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

In the model with nonlinear relation, we can confirm againyearNo contribution to the model.

Next, we will use local regression to fit gam.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive modelNonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Before calling GAM, we can also use local regression to create interaction items.

We can plot the resulting surface.

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model


Nonlinear models in R language: gam analysis of polynomial regression, local spline, smooth spline and generalized additive model

reference

1.Application of R language multivariate logistic regression

2.Implementation of panel smooth transition regression (PSTR) analysis case

3.Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.A case study of R language Poisson Poisson regression model

5.Hosmer lemeshow goodness of fit test in R language regression

6.The realization of lasso regression, ridge ridge regression and elastic net model in R language

7.Realization of logistic regression in R language

8.Predicting stock price with linear regression in Python

9.How does R language calculate IDI and NRI in survival analysis and Cox regression