# Polynomial regression and spline regression analysis of R language ISLR wage data

Time：2021-3-27

### Link to the original text:http://tecdat.cn/?p=8531

1. Perform polynomial regression using`age`forecast`wage`. Cross validation is used to select the best degree of polynomial. What degree of choice does this have to do with using`ANOVA`What is the result of hypothesis test? The polynomial fitting data are plotted.

Load the wage data set. An array that retains all cross validation errors. We performed k = 10 ﹣ k-fold cross validation.

``````1.  rm(list = ls())

2.  set.seed(1)

5. Test error

6.  cv.MSE <- NA

8. Cycle traversal age

9.  for (i in 1:15) {

10.    glm.fit <-  glm(wage ~ poly(age, i), data = Wage)

11. We use cv.glm Cross validation with CV error preserved

12.    cv.MSE[i] <-  cv.glm(Wage, glm.fit, K = 10)\$delta

13.  }

14. # objects of inspection results

15.  cv.MSE``````
``````
1.  ##   1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500

2.  ##   1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253

3.  ##  1595.332

``````

We’re drawing`type = "b"`The relationship between points and lines is shown to illustrate the results.

``````1. Explain the results with a line diagram

2.  plot( x = 1:15, y = cv.MSE, xlab = "power of age", ylab = "CV error",

3.        type = "b", pch = 19, lwd = 2, bty = "n",

4.        ylim = c( min(cv.MSE) - sd(cv.MSE), max(cv.MSE) + sd(cv.MSE) ) )

6. Horizontal line

7.  abline(h = min(cv.MSE) + sd(cv.MSE) , lty = "dotted")

9. Minimum value

10.  points( x = which.min(cv.MSE), y = min(cv.MSE), col = "red", pch = "X", cex = 1.5 )`````` We fit the model again with a higher age weight for analysis of variance.

``````
1.  ## Analysis of Deviance Table

2.  ##

3.  ## Model  1: wage ~ poly(age, a)

4.  ## Model  2: wage ~ poly(age, a)

5.  ## Model  3: wage ~ poly(age, a)

6.  ## Model  4: wage ~ poly(age, a)

7.  ## Model  5: wage ~ poly(age, a)

8.  ## Model  6: wage ~ poly(age, a)

9.  ## Model  7: wage ~ poly(age, a)

10.  ## Model  8: wage ~ poly(age, a)

11.  ## Model  9: wage ~ poly(age, a)

12.  ## Model 10: wage ~ poly(age, a)

13.  ## Model 11: wage ~ poly(age, a)

14.  ## Model 12: wage ~ poly(age, a)

15.  ## Model 13: wage ~ poly(age, a)

16.  ## Model 14: wage ~ poly(age, a)

17.  ## Model 15: wage ~ poly(age, a)

18.  ##    Resid. Df Resid. Dev Df Deviance        F    Pr(>F)

19.  ## 1       2998    5022216

20.  ## 2       2997    4793430  1   228786 143.5637 < 2.2e-16 ***

21.  ## 3       2996    4777674  1    15756   9.8867  0.001681 **

22.  ## 4       2995    4771604  1     6070   3.8090  0.051070 .

23.  ## 5       2994    4770322  1     1283   0.8048  0.369731

24.  ## 6       2993    4766389  1     3932   2.4675  0.116329

25.  ## 7       2992    4763834  1     2555   1.6034  0.205515

26.  ## 8       2991    4763707  1      127   0.0795  0.778016

27.  ## 9       2990    4756703  1     7004   4.3952  0.036124 *

28.  ## 10      2989    4756701  1        3   0.0017  0.967552

29.  ## 11      2988    4756597  1      103   0.0648  0.799144

30.  ## 12      2987    4756591  1        7   0.0043  0.947923

31.  ## 13      2986    4756401  1      190   0.1189  0.730224

32.  ## 14      2985    4756158  1      243   0.1522  0.696488

33.  ## 15      2984    4755364  1      795   0.4986  0.480151

34.  ## ---

35.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

``````

According to the F-test, we should choose the model whose age is raised to three times and pass the cross validation.

Now, let’s plot the result of polynomial fitting.

``````

1.  plot(wage ~ age, data = Wage, col = "darkgrey",  bty = "n")

2.  ...

`````` 1. Fitting the step function to`wage`Forecast using`age`. Draw the fitting diagram.
``````1.  cv.error <-  NA

2.  ...

4. Highlight the minimum value

5.  points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE),`````` Cross validation shows that the test error is the smallest when k = 8. The simplest model within 1sd of the minimum value has k = 4, so the data is divided into five different regions. 44

``````

1.  lm.fit <-  glm(wage ~ cut(age, 4), data = Wage)

3.  matlines(age.grid, cbind( lm.pred\$fit + 2* lm.pred\$se.fit,

4.                            lm.pred\$fit - 2* lm.pred\$se.fit),

5.           col = "red", lty ="dashed")

``````

The`Wage`The data set contains some other variables, such as marital status（`maritl`）, job level（`jobclass`）And so on. Explore the relationship between some of the other predictors and the risk factors`wage`And the nonlinear fitting technology is used to fit the model into the data.

``````

1.  ...

2.  summary(Wage[, c("maritl", "jobclass")] )

``````
``````
1.  ##               maritl               jobclass

2.  ##  1. Never Married: 648   1. Industrial :1544

3.  ##  2. Married      :2074   2. Information:1456

4.  ##  3. Widowed      :  19

5.  ##  4. Divorced     : 204

6.  ##  5. Separated    :  55

``````
``````1. Relationship box diagram

2.  par(mfrow=c(1,2))

3.  plot(Wage\$maritl, Wage\$wage, frame.plot = "FALSE")`````` It seems that a married couple makes more money on average than other groups. The average wage of information jobs is higher than that of industrial jobs.

# Polynomial and step function

``````

1.  m1 <-  lm(wage ~ maritl, data = Wage)

2.  deviance(m1) # fit (RSS in linear; -2*logLik)

``````

`##  4858941`

``````

1.  m2 <-  lm(wage ~ jobclass, data = Wage)

2.  deviance(m2)

``````

`##  4998547`

``````

1.  m3 <-  lm(wage ~ maritl + jobclass, data = Wage)

2.  deviance(m3)

``````

`##  4654752`

As expected, using the most complex models can minimize data fitting within the sample.

We can’t make spline curve fit classification variable.

We can’t fit a spline curve to a factor, but we can use a spline curve to fit a continuous variable and add other predictive variables to the model.

``````

1.  library(gam)

2.  m4 <-  gam(...)

3.  deviance(m4)

``````

`##  4476501`

``anova(m1, m2, m3, m4) ``
``````
1.  ## Analysis of Variance Table

2.  ##

3.  ## Model 1: wage ~ maritl

4.  ## Model 2: wage ~ jobclass

5.  ## Model 3: wage ~ maritl + jobclass

6.  ## Model 4: wage ~ maritl + jobclass + s(age, 4)

7.  ##   Res.Df     RSS      Df Sum of Sq      F    Pr(>F)

8.  ## 1   2995 4858941

9.  ## 2   2998 4998547 -3.0000   -139606 31.082 < 2.2e-16 ***

10.  ## 3   2994 4654752  4.0000    343795 57.408 < 2.2e-16 ***

11.  ## 4   2990 4476501  4.0002    178252 29.764 < 2.2e-16 ***

12.  ## ---

13.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

``````

F test shows that the variables that we improved significantly from model 4 to model 1 are age,`wage``maritl`, and`jobclass`

# `Boston data regression`

The variables used in this question are dis (weighted average distance to five Boston job centers) and NOx (concentration of nitric oxide per million people, in millions). We take DIS as the predictive variable and NOx as the dependent variable.

``````

1.  rm(list = ls())

2.  set.seed(1)

3.  library(MASS)

4.  attach(Boston)

``````
``````
1.  ## The following objects are masked from Boston (pos = 14):

2.  ##

3.  ##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio,

4.  ##     rad, rm, tax, zn

``````
1. use`poly()`Function fitting cubic polynomial regression to predict`nox`use`dis`. The regression output is reported and the resulting data and polynomial fitting are plotted.
``````

1.  m1 <-  lm(nox ~ poly(dis, 3))

2.  summary(m1)

``````
``````
1.  ##

2.  ## Call:

3.  ## lm(formula = nox ~ poly(dis, 3))

4.  ##

5.  ## Residuals:

6.  ##       Min        1Q    Median        3Q       Max

7.  ## -0.121130 -0.040619 -0.009738  0.023385  0.194904

8.  ##

9.  ## Coefficients:

10.  ##                Estimate Std. Error t value Pr(>|t|)

11.  ## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***

12.  ## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***

13.  ## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***

14.  ## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***

15.  ## ---

16.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

17.  ##

18.  ## Residual standard error: 0.06207 on 502 degrees of freedom

19.  ## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131

20.  ## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

``````
``````

1.  dislim <-  range(dis)

2.  ...

3.  lines(x = dis.grid, y = lm.pred\$fit, col = "red", lwd = 2)

4.  matlines(x = dis.grid, y = cbind(lm.pred\$fit + 2* lm.pred\$se.fit,

5.                                   lm.pred\$fit - 2* lm.pred\$se.fit)

6.           , col = "red", lwd = 1.5, lty = "dashed")

`````` Summary display, in`nox`All polynomial terms are valid for prediction using`dis`. The graph shows a smooth curve and fits the data well.

1. Plot polynomials for a range of different polynomial degrees (for example, from 1 to 10) and report the relevant sum of squares of residuals.

We draw polynomials of 1 to 10 degrees and save RSS.

``````1.  #

3.  ...

4. Display model fitting in training set

``````
1.  ##   2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484

2.  ##   1.835630 1.833331 1.832171

``````

As expected, RSS decreases monotonically with the degree of polynomial.

1. Perform cross validation or other methods to select the best degree of the polynomial and interpret your results.

We perform llocv and code it manually

``````

1.  cv.error <- matrix(NA, nrow = nrow(Boston), ncol = 10)

3.  ...

4.  names(result) <- paste( "^", 1:10, sep= "" )

5.  result

``````
``````
1.  ##          ^1          ^2          ^3          ^4          ^5          ^6

2.  ## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971

3.  ##          ^7          ^8          ^9         ^10

4.  ## 0.003655106 0.003627727 0.003623183 0.003620892

``````
``````

1.  plot(result ~ seq(1,10), type = "b", pch = 19, bty = "n", xlab = "powers of dist to empl. center",

2.       ylab = "cv error")

3.  abline(h = min(cv.error) + sd(cv.error), col = "red", lty = "dashed")

`````` Based on cross validation, we will choose`dis`Square.

1. use`bs()`Application of function fitting regression spline curve`nox`Make predictions`dis`

We divide this range by four intervals which are roughly equal to [3,6,9]

``````

1.  library(splines)

2.  m4 <-  lm(nox ~ bs(dis, knots = c(3, 6, 9)))

3.  summary(m4)

``````
``````
1.  ##

2.  ## Call:

3.  ## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9)))

4.  ##

5.  ## Residuals:

6.  ##       Min        1Q    Median        3Q       Max

7.  ## -0.132134 -0.039466 -0.009042  0.025344  0.187258

8.  ##

9.  ## Coefficients:

10.  ##                               Estimate Std. Error t value Pr(>|t|)

11.  ## (Intercept)                   0.709144   0.016099  44.049  < 2e-16 ***

12.  ## bs(dis, knots = c(3, 6, 9))1  0.006631   0.025467   0.260    0.795

13.  ## bs(dis, knots = c(3, 6, 9))2 -0.258296   0.017759 -14.544  < 2e-16 ***

14.  ## bs(dis, knots = c(3, 6, 9))3 -0.233326   0.027248  -8.563  < 2e-16 ***

15.  ## bs(dis, knots = c(3, 6, 9))4 -0.336530   0.032140 -10.471  < 2e-16 ***

16.  ## bs(dis, knots = c(3, 6, 9))5 -0.269575   0.058799  -4.585 5.75e-06 ***

17.  ## bs(dis, knots = c(3, 6, 9))6 -0.303386   0.062631  -4.844 1.70e-06 ***

18.  ## ---

19.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

20.  ##

21.  ## Residual standard error: 0.0612 on 499 degrees of freedom

22.  ## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7211

23.  ## F-statistic: 218.6 on 6 and 499 DF,  p-value: < 2.2e-16

``````
``````1. Drawing results

2.  ...

3.  #

4.  matlines( dis.grid,

5.  ...

6.            col = "black", lwd = 2, lty = c("solid", "dashed", "dashed"))`````` 1. Now fit the spline regression for a certain range of degrees of freedom, and draw the results to merge and report the results. Describe the results obtained.

We use DFS between 3 and 16 to fit the regression spline.

``````

1.  box <-  NA

3.  for (i in 3:16) {

4.  ...

5.  }

7.  box[-c(1, 2)]

``````
``````
1.  ##   1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653

2.  ##   1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546

``````

`ISLR`In the bag`College`Data sets.

1. The data is divided into training set and test set. Using tuition as the dependent variable and other variables as the predictive variable, the training set is selected step by step to determine a satisfactory model using only a subset of predictive variables.
``````

1.  rm(list = ls())

2.  set.seed(1)

3.  library(leaps)

4.  attach(College)

``````
``````
1.  ## The following objects are masked from College (pos = 14):

2.  ##

4.  ##     Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private,

5.  ##     Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc

``````
``````1. Index number of training / testing branch

2.  train <-  sample( length(Outstate), length(Outstate)/2)

3.  test <-  -train

4.  ... All CP, BIC and adjr2 scores showed that 6 was the minimum size of the subset. However, according to a standard error rule, we will choose a model with four prediction variables.

``````

1.  ...

2.  coefi <-  coef(m5, id = 4)

3.  names(coefi)

``````

`##  "(Intercept)" "PrivateYes" "Room.Board" "perc.alumni" "Expend"`

1. The GAM is fitted to the training data, the tuition fee is used as the response, and the function selected in the previous step is used as the prediction variable. Plot the results and explain your findings.
``````

1.  library(gam)

2.  ...

3.  plot(gam.fit, se=TRUE, col="blue")

`````` 1. Evaluate the models obtained on the test set and interpret the results obtained.
``````

1.  gam.pred <-  predict(gam.fit, College.test)

2.  gam.err <-  mean((College.test\$Outstate - gam.pred)^2)

3.  gam.err

``````

`##  3745460`

``````

1.  gam.tss <-  mean((College.test\$Outstate - mean(College.test\$Outstate))^2)

2.  test.rss <-  1 - gam.err / gam.tss

``````

`##  0.7696916`

1. For which variables (if any), is there any evidence of nonlinear relationship with dependent variables?
``summary(gam.fit) ``
``````
1.  ##

2.  ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,

3.  ##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,

4.  ##     df = 2), data = College.train)

5.  ## Deviance Residuals:

6.  ##      Min       1Q   Median       3Q      Max

7.  ## -4977.74 -1184.52    58.33  1220.04  7688.30

8.  ##

9.  ## (Dispersion Parameter for gaussian family taken to be 3300711)

10.  ##

11.  ##     Null Deviance: 6221998532 on 387 degrees of freedom

12.  ## Residual Deviance: 1231165118 on 373 degrees of freedom

13.  ## AIC: 6941.542

14.  ##

15.  ## Number of Local Scoring Iterations: 2

16.  ##

17.  ## Anova for Parametric Effects

18.  ##                         Df     Sum Sq    Mean Sq F value    Pr(>F)

19.  ## Private                  1 1779433688 1779433688 539.106 < 2.2e-16 ***

20.  ## s(Room.Board, df = 2)    1 1221825562 1221825562 370.171 < 2.2e-16 ***

21.  ## s(PhD, df = 2)           1  382472137  382472137 115.876 < 2.2e-16 ***

22.  ## s(perc.alumni, df = 2)   1  328493313  328493313  99.522 < 2.2e-16 ***

23.  ## s(Expend, df = 5)        1  416585875  416585875 126.211 < 2.2e-16 ***

24.  ## s(Grad.Rate, df = 2)     1   55284580   55284580  16.749 5.232e-05 ***

25.  ## Residuals              373 1231165118    3300711

26.  ## ---

27.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

28.  ##

29.  ## Anova for Nonparametric Effects

30.  ##                        Npar Df  Npar F     Pr(F)

31.  ## (Intercept)

32.  ## Private

33.  ## s(Room.Board, df = 2)        1  3.5562   0.06010 .

34.  ## s(PhD, df = 2)               1  4.3421   0.03786 *

35.  ## s(perc.alumni, df = 2)       1  1.9158   0.16715

36.  ## s(Expend, df = 5)            4 16.8636 1.016e-12 ***

37.  ## s(Grad.Rate, df = 2)         1  3.7208   0.05450 .

38.  ## ---

39.  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

``````

Nonparametric ANOVA test shows that there is strong evidence of nonlinear relationship between dependent variable and expenditure, as well as the relationship between dependent variable and expenditure Grad.Rate There was a moderate nonlinear relationship between pH and pH (P = 0.05). Most popular insights

## “Self test” stay up late to summarize 50 Vue knowledge points, all of which will make you God!!!

preface Hello everyone, I’m Lin Sanxin. A lot of things have happened these days (I won’t say what’s specific). These things have scared me to treasure my collection these yearsVue knowledge pointsI took out my notes and tried my best to recall them. Finally, I realized these 50Knowledge points(let’s not be too vulgar. It’s not […]