# Prediction of air quality ozone data with linear regression model in R language

Time：2022-1-15

Although linear models are one of the simplest machine learning techniques, they are still powerful tools for prediction. This is especially because linear models are particularly easy to explain this fact. Here, I will discuss the most important aspects of interpreting linear models using ordinary least squares regression examples of air quality data sets.

# Air quality data set

The air quality data set contains 154 measurements of the following four air quality indicators:

• Ozone: average ozone level, in parts per billion
• Wind: average wind speed, miles per hour
• Temperature: maximum daily temperature in degrees Fahrenheit

We will delete all by`NA`And exclude`Month`And`Day`Column to clean up the dataset and select some predictive variables.

``````data(airquality)
ozone <- subset(na.omit(airquality),
select = c("Ozone", "Solar.R", "Wind", "Temp"))``````

# Data exploration and preparation

The prediction tasks are as follows: can we predict the ozone level according to solar radiation, wind speed and temperature? To see whether the assumptions of the linear model are suitable for the data at hand, we will calculate the correlation between variables:

``````#Scatter matrix
plot(ozone)``````

``````#Correlation of paired variables
cors <- cor(ozone)
print(cors)``````

``````##              Ozone    Solar.R       Wind       Temp
## Ozone    1.0000000  0.3483417 -0.6124966  0.6985414
## Solar.R  0.3483417  1.0000000 -0.1271835  0.2940876
## Wind    -0.6124966 -0.1271835  1.0000000 -0.4971897
## Temp     0.6985414  0.2940876 -0.4971897  1.0000000``````

``````#Which variables are highly correlated and exclude autocorrelation

print(cor.names)``````

``````## [1] "Wind+Ozone: -0.61" "Temp+Ozone: 0.7"   "Ozone+Wind: -0.61"
## [4] "Ozone+Temp: 0.7"``````

Since ozone is involved in two linear interactions, namely:

• Ozone is positively correlated with temperature
• Ozone is negatively correlated with wind speed

This suggests that it should be possible to use the remaining features to form a linear model for predicting ozone levels.

# It is divided into training and test sets

We will take 70% of the samples for training and 30% of the samples for testing:

``````set.seed(123)
N.train <- ceiling(0.7 * nrow(ozone))
N.test <- nrow(ozone) - N.train
trainset <- sample(seq_len(nrow(ozone)), N.train)
testset <- setdiff(seq_len(nrow(ozone)), trainset)``````

# Study linear model

In order to illustrate the most important aspects of interpreting linear models, we will train the ordinary least squares model of training data in the following ways:

To explain the model, we use the following`summary`Function:

``````model.summary <- summary(model)
print(model.summary)``````

``````##
## Call:

## Residuals:
##     Min      1Q  Median      3Q     Max
## -36.135 -12.670  -2.221   9.420  65.914
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65.76604   22.52381  -2.920 0.004638 **
## Solar.R       0.05309    0.02305   2.303 0.024099 *
## Temp          1.56320    0.25530   6.123 4.03e-08 ***
## Wind         -2.61904    0.68921  -3.800 0.000295 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.17 on 74 degrees of freedom
## Multiple R-squared:  0.5924, Adjusted R-squared:  0.5759
## F-statistic: 35.85 on 3 and 74 DF,  p-value: 2.039e-14``````

# residual

The first piece of information we get is the residual.

The residual value shows that the ozone value usually predicted by the model is slightly higher than the observed value. However, the large maximum indicates that some outlier predictions are too low. Looking at the numbers may be a bit abstract, so let’s draw the prediction of the model based on the observations:

``````res <- residuals(model)

segments(obs, pred, obs, pred + res)``````

# coefficient

Now that we know the residuals, let’s look at the coefficients. We can use this`coefficients`Function to obtain the fitting coefficient of the model:

``````##  (Intercept)      Solar.R         Temp         Wind
## -65.76603538   0.05308965   1.56320267  -2.61904128``````

Notice that the intercept value of the model is very low. This is the value predicted by the model when all independent values are zero. Low coefficient`Solar.R`It is not surprising that solar radiation does not play an important role in predicting ozone levels, because it has no significant correlation with ozone levels in our exploratory analysis. Coefficient`Temp`Indicates that the ozone level is high at high temperature (because ozone forms faster). Coefficient`Wind`Tell us that the ozone level will decrease when the wind speed is fast (because the ozone will be blown away).

Other values associated with the coefficients provide information about the statistical certainty of the estimate.

``````##                 Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) -65.76603538 22.52380940 -2.919845 4.638426e-03
## Solar.R       0.05308965  0.02305379  2.302860 2.409936e-02
## Temp          1.56320267  0.25530453  6.122894 4.034064e-08
## Wind         -2.61904128  0.68920661 -3.800081 2.946349e-04``````

• `Std. Error`Is the standard error of coefficient estimation
• `t value`The value of the coefficient is expressed as standard error
• `Pr(>|t|)`Is the p value of t-test, indicating the importance of test statistics

### Standard error

The standard error of the coefficient is defined as the standard deviation of the characteristic variance:

In R, the standard error of model estimation can be calculated in the following way:

``````##              (Intercept)       Solar.R         Temp          Wind
## (Intercept) 507.32198977  2.893612e-02 -5.345957524 -9.940961e+00
## Solar.R       0.02893612  5.314773e-04 -0.001667748  7.495211e-05
## Temp         -5.34595752 -1.667748e-03  0.065180401  6.715467e-02
## Wind         -9.94096142  7.495211e-05  0.067154670  4.750058e-01``````

``````## (Intercept)     Solar.R        Temp        Wind
## 22.52380940  0.02305379  0.25530453  0.68920661``````

Now, you may want to know these values`vcov` 。 It is defined as the variance covariance matrix of the matrix, which is standardized according to the variance of the error:

``````##              (Intercept)       Solar.R         Temp          Wind
## (Intercept) 507.32198977  2.893612e-02 -5.345957524 -9.940961e+00
## Solar.R       0.02893612  5.314773e-04 -0.001667748  7.495211e-05
## Temp         -5.34595752 -1.667748e-03  0.065180401  6.715467e-02
## Wind         -9.94096142  7.495211e-05  0.067154670  4.750058e-01``````

The variance of the variance covariance matrix used for standardization is the estimated variance of the error, which is defined as

`cov.unscaled`The parameter is the variance covariance matrix:

``````#Via 'model 'matrix 'as intercept
X <- model.matrix(model) # design matrix
#Solve x ^ t% *% x = I and find the inverse of x ^ t * X
unscaled.var.matrix <- solve(crossprod(X), diag(4))
print(paste("Is this the same?", isTRUE(all.equal(unscaled.var.matrix, model.summary\$cov.unscaled, check.attributes = FALSE))))``````

``## [1] "Is this the same? TRUE"``

### T value

The value of T is defined as

In R

``````## (Intercept)     Solar.R        Temp        Wind
##   -2.919845    2.302860    6.122894   -3.800081``````

P value

In all coefficients β Calculate the p value under the assumption of I = 0. The T value follows the t distribution

``model.df <- df.residual(model)``

freedom. The degree of freedom of the linear model is defined as

Where n is the number of samples and P is the characteristic number (including inctercept). The p value represents the possibility that the obtained coefficient estimation is purely accidental and different from zero. Therefore, low P values indicate a significant correlation between variables and results.

### Further statistics

Should`summary`The function provides the following additional statistics: R-square, adjusted R-square, and F statistics.

### Residual standard deviation

As the name suggests_ Residual standard deviation_ Is the square root of the average RSS (MSE) of the model:

``## [1] 18.16979``

_ Residual standard deviation_ Represents only the average accuracy of the model. In this case, the value is very low, indicating that the model has good fitting.

### R square

R represents the coefficient of determination. It is defined as the square of the correlation between the estimated value and the observed results:

``## [1] 0.5924073``

In contrast to the correlation in [- 1,1], R squared is in [0,1].

The adjusted R-square value will adjust the R-square according to the complexity of the model:

Where n is the observation number and P is the characteristic number. Therefore, the adjusted R-square can be calculated as follows:

``````N < - length (trainset) # sample number

``## [1] 0.5758832``

If there is a considerable difference between the R square and the adjusted R square, it indicates that reducing the feature space can be considered.

### F statistics

F statistic is defined as the ratio of explained variance to unexplained variance. For regression, the F statistic always indicates the difference between the two models, where model 1 (P1) is defined by the feature subset of model 2 (P2):

The F statistic describes the degree to which the prediction performance of model 2 (in terms of RSS) is better than that of model 1. The default F statistic reported refers to the difference between the trained model and the intercept only model:

``````##
## Call:
##
## Coefficients:
## (Intercept)
##       36.76``````

Therefore, the null hypothesis tested is the only intercept – the fitting of the model is equal to the specified model. If the original hypothesis can be rejected, it means that the specified model has better fitting than the original model.

Let’s get this idea by hand:

``````rss <- function(model) {
return(sum(model\$residuals^2))
}

#Compare intercept only model with 3 models
f.statistic(null.model, model)``````

``## [1] 35.85126``

In this case, the F statistic has a large value, which shows that our trained model is obviously better than the intercept only model.

## confidence interval

Confidence intervals are useful tools for interpreting linear models. By default`confint`Calculate the 95% confidence interval (± 1.96 σ^ ±1.96 σ^）：

``ci <- confint(model) ``

``````##                (Intercept)                    Solar.R
## "95% CI: [-110.65,-20.89]"       "95% CI: [0.01,0.1]"
##                       Temp                       Wind
##      "95% CI: [1.05,2.07]"    "95% CI: [-3.99,-1.25]"``````

These values indicate that the estimation of intercept by the model is uncertain. This may indicate that more data is needed to obtain a better fit.

## Retrieve the confidence and prediction interval of the estimated value

By providing from`interval`Variable, which can convert the prediction of linear model into interval. These intervals give confidence in the predicted value. There are two types of intervals: confidence interval and prediction interval. Let’s apply the model to the test set, using different parameters as`interval`Parameter to view the difference between the two interval types:

``````#Calculate the confidence interval (CI) of the forecast
preds.ci <- predict(model, newdata = ozone[testset,], interval = "confidence")``````

``````##      fit                 lwr                 upr                Method
## [1,] "-4.42397219873667" "-13.4448767773931" "4.59693237991976" "CI"
## [2,] "-22.0446990408131" "-61.0004555440645" "16.9110574624383" "PI"``````

The confidence interval is a narrow interval and the prediction interval is a wide interval. Their values are based on`level`Parameter specifies the level of importance provided (default: 0.95).

Their definitions are slightly different. Given the new observations x, CI and PI are defined as follows

Where t α/ 2，dft α/ 2, DF is DF = 2 degrees of freedom and the significance level is α The value of T, σ Err is the standard error of the residual, σ 2x σ X2 is the variance of independent features, and X (x) represents the average value of features.

Most popular insights

## What’s better about go arrays than slicing?

Hello, I’m fried fish. Some time ago, there was a quick news broadcast, that is, go1 17 will officially support the conversion from slice to array. It is no longer necessary to use the previous method, which is much safer. However, some students have raised new doubts. In go language, arrays are actually used relatively […]