Polynomial regression and spline regression analysis of R language ISLR wage data

Time:2021-3-27

Link to the original text:http://tecdat.cn/?p=8531

  1. Perform polynomial regression usingageforecastwage. Cross validation is used to select the best degree of polynomial. What degree of choice does this have to do with usingANOVAWhat is the result of hypothesis test? The polynomial fitting data are plotted.

Load the wage data set. An array that retains all cross validation errors. We performed k = 10 ﹣ k-fold cross validation.

1.  rm(list = ls())
    
2.  set.seed(1)
    

5. Test error
    
6.  cv.MSE <- NA
    

8. Cycle traversal age
    
9.  for (i in 1:15) {
    
10.    glm.fit <-  glm(wage ~ poly(age, i), data = Wage)
    
11. We use cv.glm Cross validation with CV error preserved
    
12.    cv.MSE[i] <-  cv.glm(Wage, glm.fit, K = 10)$delta[1]
    
13.  }
    
14. # objects of inspection results
    
15.  cv.MSE

1.  ##  [1] 1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500
    
2.  ##  [8] 1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253
    
3.  ## [15] 1595.332
    

We’re drawingtype = "b"The relationship between points and lines is shown to illustrate the results.

1. Explain the results with a line diagram
    
2.  plot( x = 1:15, y = cv.MSE, xlab = "power of age", ylab = "CV error", 
    
3.        type = "b", pch = 19, lwd = 2, bty = "n", 
    
4.        ylim = c( min(cv.MSE) - sd(cv.MSE), max(cv.MSE) + sd(cv.MSE) ) )
    

6. Horizontal line
    
7.  abline(h = min(cv.MSE) + sd(cv.MSE) , lty = "dotted")
    

9. Minimum value
    
10.  points( x = which.min(cv.MSE), y = min(cv.MSE), col = "red", pch = "X", cex = 1.5 )

Polynomial regression and spline regression analysis of R language ISLR wage data

We fit the model again with a higher age weight for analysis of variance.


1.  ## Analysis of Deviance Table
    
2.  ## 
    
3.  ## Model  1: wage ~ poly(age, a)
    
4.  ## Model  2: wage ~ poly(age, a)
    
5.  ## Model  3: wage ~ poly(age, a)
    
6.  ## Model  4: wage ~ poly(age, a)
    
7.  ## Model  5: wage ~ poly(age, a)
    
8.  ## Model  6: wage ~ poly(age, a)
    
9.  ## Model  7: wage ~ poly(age, a)
    
10.  ## Model  8: wage ~ poly(age, a)
    
11.  ## Model  9: wage ~ poly(age, a)
    
12.  ## Model 10: wage ~ poly(age, a)
    
13.  ## Model 11: wage ~ poly(age, a)
    
14.  ## Model 12: wage ~ poly(age, a)
    
15.  ## Model 13: wage ~ poly(age, a)
    
16.  ## Model 14: wage ~ poly(age, a)
    
17.  ## Model 15: wage ~ poly(age, a)
    
18.  ##    Resid. Df Resid. Dev Df Deviance        F    Pr(>F) 
    
19.  ## 1       2998    5022216 
    
20.  ## 2       2997    4793430  1   228786 143.5637 < 2.2e-16 ***
    
21.  ## 3       2996    4777674  1    15756   9.8867  0.001681 ** 
    
22.  ## 4       2995    4771604  1     6070   3.8090  0.051070 . 
    
23.  ## 5       2994    4770322  1     1283   0.8048  0.369731 
    
24.  ## 6       2993    4766389  1     3932   2.4675  0.116329 
    
25.  ## 7       2992    4763834  1     2555   1.6034  0.205515 
    
26.  ## 8       2991    4763707  1      127   0.0795  0.778016 
    
27.  ## 9       2990    4756703  1     7004   4.3952  0.036124 * 
    
28.  ## 10      2989    4756701  1        3   0.0017  0.967552 
    
29.  ## 11      2988    4756597  1      103   0.0648  0.799144 
    
30.  ## 12      2987    4756591  1        7   0.0043  0.947923 
    
31.  ## 13      2986    4756401  1      190   0.1189  0.730224 
    
32.  ## 14      2985    4756158  1      243   0.1522  0.696488 
    
33.  ## 15      2984    4755364  1      795   0.4986  0.480151 
    
34.  ## ---
    
35.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

According to the F-test, we should choose the model whose age is raised to three times and pass the cross validation.

Now, let’s plot the result of polynomial fitting.



1.  plot(wage ~ age, data = Wage, col = "darkgrey",  bty = "n")
    
2.  ...
    

Polynomial regression and spline regression analysis of R language ISLR wage data

  1. Fitting the step function towageForecast usingage. Draw the fitting diagram.
1.  cv.error <-  NA
    
2.  ...
    

4. Highlight the minimum value
    
5.  points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE),

Polynomial regression and spline regression analysis of R language ISLR wage data

Cross validation shows that the test error is the smallest when k = 8. The simplest model within 1sd of the minimum value has k = 4, so the data is divided into five different regions.

Polynomial regression and spline regression analysis of R language ISLR wage data44



1.  lm.fit <-  glm(wage ~ cut(age, 4), data = Wage)
    

3.  matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,
    
4.                            lm.pred$fit - 2* lm.pred$se.fit),
    
5.           col = "red", lty ="dashed")
    

TheWageThe data set contains some other variables, such as marital status(maritl), job level(jobclass)And so on. Explore the relationship between some of the other predictors and the risk factorswageAnd the nonlinear fitting technology is used to fit the model into the data.



1.  ...
    
2.  summary(Wage[, c("maritl", "jobclass")] )
    


1.  ##               maritl               jobclass 
    
2.  ##  1. Never Married: 648   1. Industrial :1544 
    
3.  ##  2. Married      :2074   2. Information:1456 
    
4.  ##  3. Widowed      :  19 
    
5.  ##  4. Divorced     : 204 
    
6.  ##  5. Separated    :  55
    
1. Relationship box diagram
    
2.  par(mfrow=c(1,2))
    
3.  plot(Wage$maritl, Wage$wage, frame.plot = "FALSE")

Polynomial regression and spline regression analysis of R language ISLR wage data

It seems that a married couple makes more money on average than other groups. The average wage of information jobs is higher than that of industrial jobs.

Polynomial and step function



1.  m1 <-  lm(wage ~ maritl, data = Wage)
    
2.  deviance(m1) # fit (RSS in linear; -2*logLik)
    

## [1] 4858941



1.  m2 <-  lm(wage ~ jobclass, data = Wage)
    
2.  deviance(m2)
    

## [1] 4998547



1.  m3 <-  lm(wage ~ maritl + jobclass, data = Wage)
    
2.  deviance(m3)
    

## [1] 4654752

As expected, using the most complex models can minimize data fitting within the sample.

We can’t make spline curve fit classification variable.

We can’t fit a spline curve to a factor, but we can use a spline curve to fit a continuous variable and add other predictive variables to the model.



1.  library(gam)
    
2.  m4 <-  gam(...)
    
3.  deviance(m4)
    

## [1] 4476501

anova(m1, m2, m3, m4) 

1.  ## Analysis of Variance Table
    
2.  ## 
    
3.  ## Model 1: wage ~ maritl
    
4.  ## Model 2: wage ~ jobclass
    
5.  ## Model 3: wage ~ maritl + jobclass
    
6.  ## Model 4: wage ~ maritl + jobclass + s(age, 4)
    
7.  ##   Res.Df     RSS      Df Sum of Sq      F    Pr(>F) 
    
8.  ## 1   2995 4858941 
    
9.  ## 2   2998 4998547 -3.0000   -139606 31.082 < 2.2e-16 ***
    
10.  ## 3   2994 4654752  4.0000    343795 57.408 < 2.2e-16 ***
    
11.  ## 4   2990 4476501  4.0002    178252 29.764 < 2.2e-16 ***
    
12.  ## ---
    
13.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

F test shows that the variables that we improved significantly from model 4 to model 1 are age,wagemaritl, andjobclass

Boston data regression

The variables used in this question are dis (weighted average distance to five Boston job centers) and NOx (concentration of nitric oxide per million people, in millions). We take DIS as the predictive variable and NOx as the dependent variable.



1.  rm(list = ls())
    
2.  set.seed(1)
    
3.  library(MASS)
    
4.  attach(Boston)
    


1.  ## The following objects are masked from Boston (pos = 14):
    
2.  ## 
    
3.  ##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio,
    
4.  ##     rad, rm, tax, zn
    
  1. usepoly()Function fitting cubic polynomial regression to predictnoxusedis. The regression output is reported and the resulting data and polynomial fitting are plotted.


1.  m1 <-  lm(nox ~ poly(dis, 3))
    
2.  summary(m1)



1.  ## 
    
2.  ## Call:
    
3.  ## lm(formula = nox ~ poly(dis, 3))
    
4.  ## 
    
5.  ## Residuals:
    
6.  ##       Min        1Q    Median        3Q       Max 
    
7.  ## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
    
8.  ## 
    
9.  ## Coefficients:
    
10.  ##                Estimate Std. Error t value Pr(>|t|) 
    
11.  ## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
    
12.  ## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
    
13.  ## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
    
14.  ## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
    
15.  ## ---
    
16.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
17.  ## 
    
18.  ## Residual standard error: 0.06207 on 502 degrees of freedom
    
19.  ## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
    
20.  ## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16
    


1.  dislim <-  range(dis)
    
2.  ...
    
3.  lines(x = dis.grid, y = lm.pred$fit, col = "red", lwd = 2)
    
4.  matlines(x = dis.grid, y = cbind(lm.pred$fit + 2* lm.pred$se.fit,
    
5.                                   lm.pred$fit - 2* lm.pred$se.fit) 
    
6.           , col = "red", lwd = 1.5, lty = "dashed")
    

Polynomial regression and spline regression analysis of R language ISLR wage data

Summary display, innoxAll polynomial terms are valid for prediction usingdis. The graph shows a smooth curve and fits the data well.

  1. Plot polynomials for a range of different polynomial degrees (for example, from 1 to 10) and report the relevant sum of squares of residuals.

We draw polynomials of 1 to 10 degrees and save RSS.

1.  # 
    
2.  train.rss <-  NA
    
3.  ...
    
4. Display model fitting in training set
    
5.  train.rss

1.  ##  [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484
    
2.  ##  [8] 1.835630 1.833331 1.832171
    

As expected, RSS decreases monotonically with the degree of polynomial.

  1. Perform cross validation or other methods to select the best degree of the polynomial and interpret your results.

We perform llocv and code it manually



1.  cv.error <- matrix(NA, nrow = nrow(Boston), ncol = 10)
    

3.  ...
    
4.  names(result) <- paste( "^", 1:10, sep= "" )
    
5.  result
    


1.  ##          ^1          ^2          ^3          ^4          ^5          ^6 
    
2.  ## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971 
    
3.  ##          ^7          ^8          ^9         ^10 
    
4.  ## 0.003655106 0.003627727 0.003623183 0.003620892
    


1.  plot(result ~ seq(1,10), type = "b", pch = 19, bty = "n", xlab = "powers of dist to empl. center",
    
2.       ylab = "cv error")
    
3.  abline(h = min(cv.error) + sd(cv.error), col = "red", lty = "dashed")
    

Polynomial regression and spline regression analysis of R language ISLR wage data

Based on cross validation, we will choosedisSquare.

  1. usebs()Application of function fitting regression spline curvenoxMake predictionsdis

We divide this range by four intervals which are roughly equal to [3,6,9]



1.  library(splines)
    
2.  m4 <-  lm(nox ~ bs(dis, knots = c(3, 6, 9)))
    
3.  summary(m4)
    


1.  ## 
    
2.  ## Call:
    
3.  ## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9)))
    
4.  ## 
    
5.  ## Residuals:
    
6.  ##       Min        1Q    Median        3Q       Max 
    
7.  ## -0.132134 -0.039466 -0.009042  0.025344  0.187258 
    
8.  ## 
    
9.  ## Coefficients:
    
10.  ##                               Estimate Std. Error t value Pr(>|t|) 
    
11.  ## (Intercept)                   0.709144   0.016099  44.049  < 2e-16 ***
    
12.  ## bs(dis, knots = c(3, 6, 9))1  0.006631   0.025467   0.260    0.795 
    
13.  ## bs(dis, knots = c(3, 6, 9))2 -0.258296   0.017759 -14.544  < 2e-16 ***
    
14.  ## bs(dis, knots = c(3, 6, 9))3 -0.233326   0.027248  -8.563  < 2e-16 ***
    
15.  ## bs(dis, knots = c(3, 6, 9))4 -0.336530   0.032140 -10.471  < 2e-16 ***
    
16.  ## bs(dis, knots = c(3, 6, 9))5 -0.269575   0.058799  -4.585 5.75e-06 ***
    
17.  ## bs(dis, knots = c(3, 6, 9))6 -0.303386   0.062631  -4.844 1.70e-06 ***
    
18.  ## ---
    
19.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
20.  ## 
    
21.  ## Residual standard error: 0.0612 on 499 degrees of freedom
    
22.  ## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7211 
    
23.  ## F-statistic: 218.6 on 6 and 499 DF,  p-value: < 2.2e-16
    
1. Drawing results
    
2.  ...
    
3.  # 
    
4.  matlines( dis.grid,
    
5.  ...
    
6.            col = "black", lwd = 2, lty = c("solid", "dashed", "dashed"))

Polynomial regression and spline regression analysis of R language ISLR wage data

  1. Now fit the spline regression for a certain range of degrees of freedom, and draw the results to merge and report the results. Describe the results obtained.

We use DFS between 3 and 16 to fit the regression spline.



1.  box <-  NA
    

3.  for (i in 3:16) {
    
4.  ...
    
5.  }
    

7.  box[-c(1, 2)]
    


1.  ##  [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653
    
2.  ##  [8] 1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546
    

ISLRIn the bagCollegeData sets.

  1. The data is divided into training set and test set. Using tuition as the dependent variable and other variables as the predictive variable, the training set is selected step by step to determine a satisfactory model using only a subset of predictive variables.


1.  rm(list = ls())
    
2.  set.seed(1)
    
3.  library(leaps)
    
4.  attach(College)
    


1.  ## The following objects are masked from College (pos = 14):
    
2.  ## 
    
3.  ##     Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate,
    
4.  ##     Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private,
    
5.  ##     Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc
    
1. Index number of training / testing branch
    
2.  train <-  sample( length(Outstate), length(Outstate)/2)
    
3.  test <-  -train
    
4.  ...
    
5.  abline(h=max.adjr2 - std.adjr2, col="red", lty=2)

Polynomial regression and spline regression analysis of R language ISLR wage data

All CP, BIC and adjr2 scores showed that 6 was the minimum size of the subset. However, according to a standard error rule, we will choose a model with four prediction variables.



1.  ...
    
2.  coefi <-  coef(m5, id = 4)
    
3.  names(coefi)
    

## [1] "(Intercept)" "PrivateYes" "Room.Board" "perc.alumni" "Expend"

  1. The GAM is fitted to the training data, the tuition fee is used as the response, and the function selected in the previous step is used as the prediction variable. Plot the results and explain your findings.


1.  library(gam)
    
2.  ...
    
3.  plot(gam.fit, se=TRUE, col="blue")
    

Polynomial regression and spline regression analysis of R language ISLR wage data

  1. Evaluate the models obtained on the test set and interpret the results obtained.


1.  gam.pred <-  predict(gam.fit, College.test)
    
2.  gam.err <-  mean((College.test$Outstate - gam.pred)^2)
    
3.  gam.err
    

## [1] 3745460



1.  gam.tss <-  mean((College.test$Outstate - mean(College.test$Outstate))^2)
    
2.  test.rss <-  1 - gam.err / gam.tss
    
3.  test.rss
    

## [1] 0.7696916

  1. For which variables (if any), is there any evidence of nonlinear relationship with dependent variables?
summary(gam.fit) 

1.  ## 
    
2.  ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
    
3.  ##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 
    
4.  ##     df = 2), data = College.train)
    
5.  ## Deviance Residuals:
    
6.  ##      Min       1Q   Median       3Q      Max 
    
7.  ## -4977.74 -1184.52    58.33  1220.04  7688.30 
    
8.  ## 
    
9.  ## (Dispersion Parameter for gaussian family taken to be 3300711)
    
10.  ## 
    
11.  ##     Null Deviance: 6221998532 on 387 degrees of freedom
    
12.  ## Residual Deviance: 1231165118 on 373 degrees of freedom
    
13.  ## AIC: 6941.542 
    
14.  ## 
    
15.  ## Number of Local Scoring Iterations: 2 
    
16.  ## 
    
17.  ## Anova for Parametric Effects
    
18.  ##                         Df     Sum Sq    Mean Sq F value    Pr(>F) 
    
19.  ## Private                  1 1779433688 1779433688 539.106 < 2.2e-16 ***
    
20.  ## s(Room.Board, df = 2)    1 1221825562 1221825562 370.171 < 2.2e-16 ***
    
21.  ## s(PhD, df = 2)           1  382472137  382472137 115.876 < 2.2e-16 ***
    
22.  ## s(perc.alumni, df = 2)   1  328493313  328493313  99.522 < 2.2e-16 ***
    
23.  ## s(Expend, df = 5)        1  416585875  416585875 126.211 < 2.2e-16 ***
    
24.  ## s(Grad.Rate, df = 2)     1   55284580   55284580  16.749 5.232e-05 ***
    
25.  ## Residuals              373 1231165118    3300711 
    
26.  ## ---
    
27.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
28.  ## 
    
29.  ## Anova for Nonparametric Effects
    
30.  ##                        Npar Df  Npar F     Pr(F) 
    
31.  ## (Intercept) 
    
32.  ## Private 
    
33.  ## s(Room.Board, df = 2)        1  3.5562   0.06010 . 
    
34.  ## s(PhD, df = 2)               1  4.3421   0.03786 * 
    
35.  ## s(perc.alumni, df = 2)       1  1.9158   0.16715 
    
36.  ## s(Expend, df = 5)            4 16.8636 1.016e-12 ***
    
37.  ## s(Grad.Rate, df = 2)         1  3.7208   0.05450 . 
    
38.  ## ---
    
39.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    

Nonparametric ANOVA test shows that there is strong evidence of nonlinear relationship between dependent variable and expenditure, as well as the relationship between dependent variable and expenditure Grad.Rate There was a moderate nonlinear relationship between pH and pH (P = 0.05).


Polynomial regression and spline regression analysis of R language ISLR wage data

Most popular insights

1.Application of R language multivariate logistic regression

2.Implementation of panel smooth transition regression (PSTR) analysis case

3.Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.A case study of R language Poisson Poisson regression model

5.Hosmer lemeshow goodness of fit test in R language regression

6.The realization of lasso regression, ridge ridge regression and elastic net model in R language

7.Realization of logistic regression in R language

8.Predicting stock price with linear regression in Python

9.How does R language calculate IDI and NRI in survival analysis and Cox regression