ARIMA model for predicting time series of CO2 concentration

Time:2021-4-3

Link to the original text:http://tecdat.cn/?p=20424 

introduce

Time series provide a method for forecasting future data. Based on the previous values, time series can be used to predict economic, weather trends. The specific attributes of time series data mean that special statistical methods are usually needed.

In this tutorial, we will first introduce and discuss the concepts of autocorrelation, stationarity and seasonality, and then continue to apply one of the most commonly used time series forecasting methods, called Arima.

One method available in Python for modeling and predicting future points of time series is calledSARIMAXWhich meansWith seasonal regressionOfSeasonal autoregressive comprehensive moving average. Here, we will focus on ARIMA, which is used to fit time series data to better understand and predict future points in time series.

To make the most of this tutorial, familiarity with time series and statistics may help.

In this tutorial, we will use theJupyter NotebookProcessing data.

Step 1 – install the package

To set up our time series prediction environment:

cd environments
 
. my_env/bin/activate

From here, create a new directory for our project.

mkdir ARIMA
cd ARIMA 

Now let’s install itstatsmodelsAnd data drawing software packagematplotlib

pip install pandas numpy statsmodels matplotlib 

Step 2 – Import package and load data

To get started with our data, we’ll start jupyter Notebook:

To create a new notebook file, select “create new notebook file” from the drop-down menu at the top right“New”  >“  Python 3 ”:

ARIMA model for predicting time series of CO2 concentration

First, import the required Library:

import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

We will use the CO2 data set, which collected CO2 samples from March 1958 to December 2001. We can import these data as follows:

y = data.data

Let’s do some preprocessing of the data. It’s troublesome to process the data every week, because the time is short, so let’s use the monthly average. We can also use the fillna() function to ensure that there are no missing values in the time series.

#The "Ms" string groups the data into storage by the beginning of the month
y = y['co2'].resample('MS').mean()

#Fill in missing values
y = y.fillna(y.bfill())

Output
co2
1958-03-01  316.100000
1958-04-01  317.200000
1958-05-01  317.433333
...
2001-11-01  369.375000
2001-12-01  371.020000

Let’s explore this time series with data visualization:

plt.show()

ARIMA model for predicting time series of CO2 concentration

When we plot the data, we can find that the time series have obvious seasonal patterns, and the overall trend is upward.

Now, we continue to use Arima for time series prediction.

Step 3 – ARIMA time series model

The most common method used in time series prediction is called ARIMA model. Arima is a model that can fit time series data to better understand or predict future points in the series.

There are three different integers(p,  d,  q)It is used to parameterize ARIMA model. Therefore, ARIMA model is represented by symbolARIMA(p, d, q). These three parameters together explain the seasonality, trend and noise in the dataset

  • pIt’s a model_ Autoregression_ Part. It enables us to incorporate the effects of past values into the model. Intuitively, it’s like saying that if the weather has been warm for the past three days, it might be warm tomorrow.
  • dIt is the difference part of the model. Contains the difference component to be applied to the time series (that is, the number of past time points to be subtracted from the current value). Intuitively, this is similar to if the temperature difference in the last three days is very small, the temperature may be the same tomorrow.
  • qIt’s a model_ Moving average_ part. This allows us to set the error of the model as a linear combination of the error values observed in the past at previous time points.

When dealing with seasonal impacts, we use_ Seasonality_ ARIMA (expressed as)ARIMA(p,d,q)(P,D,Q)s. Here(p, d, q)It’s a non seasonal parameter, though(P, D, Q)Follow the same definition, but apply to the seasonal component of the time series. The termsIt is the periodicity of time series(4Quarter12Each year).

In the next section, we will describe how to automatically identify the best parameters for the seasonal ARIMA time series model.

Step 4 time series parameter selection of ARIMA model

When we want to use the seasonal ARIMA model to fit the time series data, our primary goal is to find the best wayARIMA(p,d,q)(P,D,Q)sThe value of objective index can be optimized. There are many guidelines and best practices to achieve this goal, but the correct parameterization of ARIMA model can be a hard manual process, requiring domain expertise and time. Other statistical programming languages (for example)RProvides automated ways to solve this problem, but these methods have not yet been ported to python. In this section, we will programmatically select by writing Python codeARIMA(p,d,q)(P,D,Q)sThe best parameter value of time series model is used to solve this problem.

We will use “grid search” to iteratively explore different combinations of parameters. For each parameter combination, we use theSARIMAX()Fitting the new seasonal ARIMA model. After exploring the whole parameter range, our best parameter set will become a set of parameters that produce the best performance. Let’s first generate the various parameter combinations we want to evaluate:

#Define the P, D, and Q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

#Generate all the different combinations of P, Q, and Q triples
pdq = list(itertools.product(p, d, q))

#Generate all the different seasonal P, Q and Q combinations
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

Output
Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 12)
SARIMAX: (0, 0, 1) x (0, 1, 0, 12)
SARIMAX: (0, 1, 0) x (0, 1, 1, 12)
SARIMAX: (0, 1, 0) x (1, 0, 0, 12)

Now, we can use the triple parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In statistics and machine learning, this process is called grid search (or hyper parameter optimization) for model selection.

When evaluating and comparing statistical models with different parameters, each model can be ranked according to its degree of fitting data or its ability to accurately predict future data points. We will useAIC(Akaike information criterion) value, which can be easily returned by using the fitted ARIMA modelstatsmodelsAICWhile considering the overall complexity of the model, the degree of data fitting is measured. Compared with the model using less features to achieve the same goodness of fit, the model using more features will get higher AIC score. So we look for the lowest yieldAICIt’s a good model.

The following code block iterates through a combination of parameters and uses theSARIMAXFunctionstatsmodelsTo fit the corresponding season ARIMA model. Fit eachSARIMAX()After modeling, the code outputs their respectiveAICScore.

warnings.filterwarnings ("ignore")? Specifies that the warning message is ignored


        try:
            mod = sm.tsa.statespace.SARIMAX(y,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)

The above code produces the following result

Output
SARIMAX(0, 0, 0)x(0, 0, 1, 12) - AIC:6787.3436240402125
SARIMAX(0, 0, 0)x(0, 1, 1, 12) - AIC:1596.711172764114
SARIMAX(0, 0, 0)x(1, 0, 0, 12) - AIC:1058.9388921320026
SARIMAX(0, 0, 0)x(1, 0, 1, 12) - AIC:1056.2878315690562
SARIMAX(0, 0, 0)x(1, 1, 0, 12) - AIC:1361.6578978064144
SARIMAX(0, 0, 0)x(1, 1, 1, 12) - AIC:1044.7647912940095
...
...
...
SARIMAX(1, 1, 1)x(1, 0, 0, 12) - AIC:576.8647112294245
SARIMAX(1, 1, 1)x(1, 0, 1, 12) - AIC:327.9049123596742
SARIMAX(1, 1, 1)x(1, 1, 0, 12) - AIC:444.12436865161305
SARIMAX(1, 1, 1)x(1, 1, 1, 12) - AIC:277.7801413828764

The output of the code shows thatSARIMAX(1, 1, 1)x(1, 1, 1, 12)TheAICThe lowest value was 277.78. Therefore, among all the models we consider, we should consider it as the best choice.

Step 5 – fitting ARIMA time series model

Using grid search, we identify a set of parameters that generate the best fit model for our time series data. We can continue to analyze this model in more depth.

We will insert the best parameter value from theSARIMAXModel start:

 results = mod.fit() 

Output
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.3182      0.092      3.443      0.001       0.137       0.499
ma.L1         -0.6255      0.077     -8.165      0.000      -0.776      -0.475
ar.S.L12       0.0010      0.001      1.732      0.083      -0.000       0.002
ma.S.L12      -0.8769      0.026    -33.811      0.000      -0.928      -0.826
sigma2         0.0972      0.004     22.634      0.000       0.089       0.106
==============================================================================

summaryAttribute generated by output resultSARIMAXA lot of information was returned, but we focused on the coefficients.  coefColumns show the weight (i.e., importance) of each function and how each function affects the time series.  P>|z|Columns tell us the importance of each feature weight. Here, the p value of each weight is less than or close to0.05So it is reasonable to keep all the weights in our model.

When fitting the seasonal ARIMA model, it is important to run the model diagnostic program to ensure that the assumptions made by the model are not violated.

plt.show()

ARIMA model for predicting time series of CO2 concentration

Our main concern is to ensure that the residuals of the model are uncorrelated and zero mean normal distribution. If the seasonal ARIMA model does not meet these attributes, it can be further improved.

In this case, our model diagnosis suggests normal distribution of model residuals according to the following contents:

  • In the upper right corner, we see the red lineKDEClose toN(0,1)Red lineN(0,1))It’s the mean0And the standard deviation is a normal distribution. This shows that the residuals are normally distributed.
  • Bottom leftQQ chartIt is shown that the distribution of residuals (blue dots) follows from the standard normal distribution. It also shows that the residuals are normally distributed.
  • The residuals over time (top left) do not show any significant seasonal variation, but white noise. This is confirmed by the autocorrelation graph at the bottom right, which shows that the time series residuals have low correlation.

These observations lead us to conclude that our model produces a satisfactory fit, which can help us understand time series data and predict future value.

Although we have a satisfactory fit, some parameters of seasonal ARIMA model can be changed to improve the fit. Therefore, if we expand the scope of grid search, we may find a better model.

Step 6 – validate predictions

We have obtained models for time series, which can now be used to generate predictions. We first compare the predicted value with the actual value of the time series, which will help us understand the accuracy of the prediction.

pred_ci = pred.conf_int()

The code above indicates that the forecast starts in January 1998.

We can plot the actual and predicted values of CO2 time series to evaluate our effect.

 ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)


plt.show()

ARIMA model for predicting time series of CO2 concentration

Overall, our forecast is very consistent with the real value, showing an overall growth trend.

It’s also useful to quantify the accuracy of our predictions. We will use MSE (mean square error) to summarize the average error of our predictions. For each predicted value, we calculate its difference from the true value and square the result. The results are squared and the positive / negative differences do not cancel each other when calculating the population mean.

y_truth = y['1998-01-01':]

#Calculation of mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()

Output
The mean square error of our prediction is 0.07

The MSE we predicted one step ahead of time is0.07Because it’s close to zero, so it’s very low. If MSE is 0, it is ideal to estimate the observed values of parameters with ideal accuracy, but it is usually impossible.

However, the use of dynamic forecasting can better express our real prediction ability. In this case, we only use the information up to a specific point in the time series, and then use the values in the previous prediction time points to generate the prediction.

In the following code block, we specify that the dynamic prediction and confidence interval should be calculated from January 1998.

By plotting the observed and predicted values of the time series, we can see that the overall prediction is accurate even if dynamic prediction is used. All the predicted values (red line) are very close to the real situation (blue line) and are within the confidence interval of our prediction.

We calculate MSE again to quantify the effect of prediction

#Extract the predicted value and real value of time series
y_forecasted = pred_dynamic.predicted_mean
y_truth = y['1998-01-01':]

#Calculation of mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

Output
The mean square error of our prediction is 1.01

The MSE of the predicted value from dynamic prediction is 1.01. This is slightly higher than the previous one, which can be expected, because we rely on less historical data of time series.

One step ahead and dynamic prediction confirm the validity of the time series model. However, the interest of time series prediction is to be able to predict the future value in advance.

Step 7 – generate and visualize predictions

Finally, we describe how to use the seasonal ARIMA time series model to predict future data.

#Get a forecast for the next 500 steps
pred_uc = results.get_forecast(steps=500)

#Obtain the confidence interval of the prediction
pred_ci = pred_uc.conf_int()

ARIMA model for predicting time series of CO2 concentration

We can use the output of this code to plot the time series and predict its future value.

 ARIMA model for predicting time series of CO2 concentration

Now, the prediction and the confidence interval can be used to further understand the time series and predict the expected results. Our forecast shows that the time series is expected to continue to grow steadily.

As we further predict the future, the confidence interval will be larger and larger.

conclusion

In this tutorial, we describe how to implement the seasonal ARIMA model in Python. It shows how to diagnose the model and how to generate the prediction of carbon dioxide time series.

You can try some other things:

  • Change the start date of the dynamic forecast to understand how this affects the overall quality of the forecast.
  • Try more parameter combinations to see if you can improve the goodness of fit of the model.
  • Select other indicators to select the best model. For example, we use theAICFind the best model.
    • *

ARIMA model for predicting time series of CO2 concentration

Most popular insights

1.Time series prediction using LSTM and python in Python

2.Time series prediction analysis based on long term and short term memory model LSTM in Python

3.Using R language to analyze time series (ARIMA, exponential smoothing)

4.R language multivariate copula GARCH model time series prediction

5.R language copulas and financial time series cases

6.Using R language stochastic volatility model SV to deal with stochastic volatility in time series

7.Tar threshold autoregressive model of R language time series

8.R language K-shape time series clustering method for stock price time series clustering

9.Time series prediction with ARIMA model in Python 3

Recommended Today

CentOS makes document server based on nginx

CentOS makes document server based on nginx Basic environment for installation yum install gcc-c++ yum install -y pcre pcre-devel yum install -y zlib zlib-devel yum install -y openssl openssl-devel Find the latest stable version of nginx on the official website https://nginx.org/en/download.html Stable version: https://nginx.org/download/nginx-1.18.0.tar.gz cd /home wget -c https://nginx.org/download/nginx-1.18.0.tar.gz Decompress compressed package data tar -zxvf […]