Prediction of air quality ozone data with linear regression model in R language

Time:2022-1-15

Original link:http://tecdat.cn/?p=11387

Although linear models are one of the simplest machine learning techniques, they are still powerful tools for prediction. This is especially because linear models are particularly easy to explain this fact. Here, I will discuss the most important aspects of interpreting linear models using ordinary least squares regression examples of air quality data sets.

Air quality data set

The air quality data set contains 154 measurements of the following four air quality indicators:

  • Ozone: average ozone level, in parts per billion
  • Solar. R: Solar radiation
  • Wind: average wind speed, miles per hour
  • Temperature: maximum daily temperature in degrees Fahrenheit

We will delete all byNAAnd excludeMonthAndDayColumn to clean up the dataset and select some predictive variables.

data(airquality)
ozone <- subset(na.omit(airquality), 
        select = c("Ozone", "Solar.R", "Wind", "Temp"))

Data exploration and preparation

The prediction tasks are as follows: can we predict the ozone level according to solar radiation, wind speed and temperature? To see whether the assumptions of the linear model are suitable for the data at hand, we will calculate the correlation between variables:

#Scatter matrix
plot(ozone)

 Prediction of air quality ozone data with linear regression model in R language

#Correlation of paired variables
cors <- cor(ozone)
print(cors)

##              Ozone    Solar.R       Wind       Temp
## Ozone    1.0000000  0.3483417 -0.6124966  0.6985414
## Solar.R  0.3483417  1.0000000 -0.1271835  0.2940876
## Wind    -0.6124966 -0.1271835  1.0000000 -0.4971897
## Temp     0.6985414  0.2940876 -0.4971897  1.0000000

#Which variables are highly correlated and exclude autocorrelation

print(cor.names)

## [1] "Wind+Ozone: -0.61" "Temp+Ozone: 0.7"   "Ozone+Wind: -0.61"
## [4] "Ozone+Temp: 0.7"

Since ozone is involved in two linear interactions, namely:

  • Ozone is positively correlated with temperature
  • Ozone is negatively correlated with wind speed

This suggests that it should be possible to use the remaining features to form a linear model for predicting ozone levels.

It is divided into training and test sets

We will take 70% of the samples for training and 30% of the samples for testing:

set.seed(123)
N.train <- ceiling(0.7 * nrow(ozone))
N.test <- nrow(ozone) - N.train
trainset <- sample(seq_len(nrow(ozone)), N.train)
testset <- setdiff(seq_len(nrow(ozone)), trainset)

Study linear model

In order to illustrate the most important aspects of interpreting linear models, we will train the ordinary least squares model of training data in the following ways:

To explain the model, we use the followingsummaryFunction:

model.summary <- summary(model)
print(model.summary)

## 
## Call:

## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.135 -12.670  -2.221   9.420  65.914 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -65.76604   22.52381  -2.920 0.004638 ** 
## Solar.R       0.05309    0.02305   2.303 0.024099 *  
## Temp          1.56320    0.25530   6.123 4.03e-08 ***
## Wind         -2.61904    0.68921  -3.800 0.000295 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.17 on 74 degrees of freedom
## Multiple R-squared:  0.5924, Adjusted R-squared:  0.5759 
## F-statistic: 35.85 on 3 and 74 DF,  p-value: 2.039e-14

residual

The first piece of information we get is the residual.

The residual value shows that the ozone value usually predicted by the model is slightly higher than the observed value. However, the large maximum indicates that some outlier predictions are too low. Looking at the numbers may be a bit abstract, so let’s draw the prediction of the model based on the observations:

res <- residuals(model)

#Add residuals to the graph
segments(obs, pred, obs, pred + res)

Prediction of air quality ozone data with linear regression model in R language

coefficient

Now that we know the residuals, let’s look at the coefficients. We can use thiscoefficientsFunction to obtain the fitting coefficient of the model:

##  (Intercept)      Solar.R         Temp         Wind 
## -65.76603538   0.05308965   1.56320267  -2.61904128

Notice that the intercept value of the model is very low. This is the value predicted by the model when all independent values are zero. Low coefficientSolar.RIt is not surprising that solar radiation does not play an important role in predicting ozone levels, because it has no significant correlation with ozone levels in our exploratory analysis. CoefficientTempIndicates that the ozone level is high at high temperature (because ozone forms faster). CoefficientWindTell us that the ozone level will decrease when the wind speed is fast (because the ozone will be blown away).

Other values associated with the coefficients provide information about the statistical certainty of the estimate.

##                 Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) -65.76603538 22.52380940 -2.919845 4.638426e-03
## Solar.R       0.05308965  0.02305379  2.302860 2.409936e-02
## Temp          1.56320267  0.25530453  6.122894 4.034064e-08
## Wind         -2.61904128  0.68920661 -3.800081 2.946349e-04

  • Std. ErrorIs the standard error of coefficient estimation
  • t valueThe value of the coefficient is expressed as standard error
  • Pr(>|t|)Is the p value of t-test, indicating the importance of test statistics

Standard error

The standard error of the coefficient is defined as the standard deviation of the characteristic variance:

Prediction of air quality ozone data with linear regression model in R language

In R, the standard error of model estimation can be calculated in the following way:

##              (Intercept)       Solar.R         Temp          Wind
## (Intercept) 507.32198977  2.893612e-02 -5.345957524 -9.940961e+00
## Solar.R       0.02893612  5.314773e-04 -0.001667748  7.495211e-05
## Temp         -5.34595752 -1.667748e-03  0.065180401  6.715467e-02
## Wind         -9.94096142  7.495211e-05  0.067154670  4.750058e-01

## (Intercept)     Solar.R        Temp        Wind 
## 22.52380940  0.02305379  0.25530453  0.68920661

Now, you may want to know these valuesvcov 。 It is defined as the variance covariance matrix of the matrix, which is standardized according to the variance of the error:

##              (Intercept)       Solar.R         Temp          Wind
## (Intercept) 507.32198977  2.893612e-02 -5.345957524 -9.940961e+00
## Solar.R       0.02893612  5.314773e-04 -0.001667748  7.495211e-05
## Temp         -5.34595752 -1.667748e-03  0.065180401  6.715467e-02
## Wind         -9.94096142  7.495211e-05  0.067154670  4.750058e-01

The variance of the variance covariance matrix used for standardization is the estimated variance of the error, which is defined as

Prediction of air quality ozone data with linear regression model in R language

 cov.unscaledThe parameter is the variance covariance matrix:

#Via 'model 'matrix 'as intercept
X <- model.matrix(model) # design matrix
#Solve x ^ t% *% x = I and find the inverse of x ^ t * X
unscaled.var.matrix <- solve(crossprod(X), diag(4))
print(paste("Is this the same?", isTRUE(all.equal(unscaled.var.matrix, model.summary$cov.unscaled, check.attributes = FALSE))))

## [1] "Is this the same? TRUE"

T value

The value of T is defined as

Prediction of air quality ozone data with linear regression model in R language

In R

## (Intercept)     Solar.R        Temp        Wind 
##   -2.919845    2.302860    6.122894   -3.800081

P value

In all coefficients β Calculate the p value under the assumption of I = 0. The T value follows the t distribution

model.df <- df.residual(model)

freedom. The degree of freedom of the linear model is defined as

Prediction of air quality ozone data with linear regression model in R language

Where n is the number of samples and P is the characteristic number (including inctercept). The p value represents the possibility that the obtained coefficient estimation is purely accidental and different from zero. Therefore, low P values indicate a significant correlation between variables and results.

Further statistics

ShouldsummaryThe function provides the following additional statistics: R-square, adjusted R-square, and F statistics.  

Residual standard deviation

As the name suggests_ Residual standard deviation_ Is the square root of the average RSS (MSE) of the model:

## [1] 18.16979

_ Residual standard deviation_ Represents only the average accuracy of the model. In this case, the value is very low, indicating that the model has good fitting.

R square

R represents the coefficient of determination. It is defined as the square of the correlation between the estimated value and the observed results:

## [1] 0.5924073

In contrast to the correlation in [- 1,1], R squared is in [0,1].

Adjusted R-square

The adjusted R-square value will adjust the R-square according to the complexity of the model:

Prediction of air quality ozone data with linear regression model in R language

Where n is the observation number and P is the characteristic number. Therefore, the adjusted R-square can be calculated as follows:

N < - length (trainset) # sample number

print(r.squared.adj)

## [1] 0.5758832

If there is a considerable difference between the R square and the adjusted R square, it indicates that reducing the feature space can be considered.

F statistics

F statistic is defined as the ratio of explained variance to unexplained variance. For regression, the F statistic always indicates the difference between the two models, where model 1 (P1) is defined by the feature subset of model 2 (P2):

Prediction of air quality ozone data with linear regression model in R language

The F statistic describes the degree to which the prediction performance of model 2 (in terms of RSS) is better than that of model 1. The default F statistic reported refers to the difference between the trained model and the intercept only model:

## 
## Call:
## 
## Coefficients:
## (Intercept)  
##       36.76

 Prediction of air quality ozone data with linear regression model in R language

Therefore, the null hypothesis tested is the only intercept – the fitting of the model is equal to the specified model. If the original hypothesis can be rejected, it means that the specified model has better fitting than the original model.

Let’s get this idea by hand:

rss <- function(model) {
    return(sum(model$residuals^2))
}

#Compare intercept only model with 3 models
f.statistic(null.model, model)

## [1] 35.85126

In this case, the F statistic has a large value, which shows that our trained model is obviously better than the intercept only model.

confidence interval

Confidence intervals are useful tools for interpreting linear models. By defaultconfintCalculate the 95% confidence interval (± 1.96 σ^ ±1.96 σ^):

ci <- confint(model) 

##                (Intercept)                    Solar.R 
## "95% CI: [-110.65,-20.89]"       "95% CI: [0.01,0.1]" 
##                       Temp                       Wind 
##      "95% CI: [1.05,2.07]"    "95% CI: [-3.99,-1.25]"

These values indicate that the estimation of intercept by the model is uncertain. This may indicate that more data is needed to obtain a better fit.

Retrieve the confidence and prediction interval of the estimated value

By providing fromintervalVariable, which can convert the prediction of linear model into interval. These intervals give confidence in the predicted value. There are two types of intervals: confidence interval and prediction interval. Let’s apply the model to the test set, using different parameters asintervalParameter to view the difference between the two interval types:

#Calculate the confidence interval (CI) of the forecast
preds.ci <- predict(model, newdata = ozone[testset,], interval = "confidence")

##      fit                 lwr                 upr                Method
## [1,] "-4.42397219873667" "-13.4448767773931" "4.59693237991976" "CI"  
## [2,] "-22.0446990408131" "-61.0004555440645" "16.9110574624383" "PI"

The confidence interval is a narrow interval and the prediction interval is a wide interval. Their values are based onlevelParameter specifies the level of importance provided (default: 0.95).

Their definitions are slightly different. Given the new observations x, CI and PI are defined as follows

Prediction of air quality ozone data with linear regression model in R language

Where t α/ 2,dft α/ 2, DF is DF = 2 degrees of freedom and the significance level is α The value of T, σ Err is the standard error of the residual, σ 2x σ X2 is the variance of independent features, and X (x) represents the average value of features.


Prediction of air quality ozone data with linear regression model in R language

Most popular insights

1.Application case of multiple logistic regression in R language

2.Implementation of panel smooth transfer regression (PSTR) analysis case

3.Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.Case study of Poisson Poisson regression model in R language

5.Hosmer lemeshow goodness of fit test in R language regression

6.Implementation of lasso regression, ridge ridge regression and elastic net model in R language

7.Implementation of logistic regression in R language

8.Python uses linear regression to predict stock price

9.How does R language calculate IDI and NRI in survival analysis and Cox regression