# Original link:Time series analysis model in R language: arima-arch / GARCH model to analyze stock price | tuoduan data technology / welcome to tecdat

# Original source:Tuoduan data tribe official account

# brief introduction

Time series analysis is a main branch of statistics, which mainly focuses on analyzing data sets to study the characteristics of data and extract meaningful statistical information to predict the future value of the series. There are two methods of time series analysis: frequency domain and time domain. The former is mainly based on Fourier transform, while the latter studies the autocorrelation of sequences, and uses box Jenkins and arch / GARCH methods to predict sequences.

This paper will provide the process of using time domain method to analyze and model financial time series in R environment. The first part covers the stationary time series. The second part provides guidance for ARIMA and arch / GARCH Modeling. Next, it will study the combinatorial model and its performance and effectiveness in modeling and predicting time series. Finally, the time series analysis methods will be summarized.

Stationarity and difference of time series data sets:

### 1. Stability:

The first step of time series data modeling is to convert non-stationary time series into stationary time series. This is important because many statistical and econometric methods are based on this assumption and can only be applied to stationary time series. Nonstationary time series are unstable and unpredictable, while stationary process is mean reversion, that is, it fluctuates around the constant mean with constant variance. In addition, the stationarity and independence of random variables are closely related, because many theories applicable to independent random variables are also applicable to stationary time series requiring independence. Most of these methods assume that random variables are independent (or uncorrelated). Noise is independent (or uncorrelated); Variables and noise are independent (or uncorrelated) of each other. So what is a stationary time series?

Roughly speaking, there is no long-term trend in the stationary time series, and the mean and variance remain unchanged. More specifically, there are two definitions of stationarity: weak stationarity and strict stationarity.

a. Weak stationarity: if the following conditions are met, the time series {XT, t ∈ Z} (where Z is an integer set) is stationary

b. Strictly stationary: if the joint distribution of (xt1, XT2,…, xtk) is the same as that of (xt1 + H, XT2 + H), the time series {XT.. xtk + H), t ∈ Z} is considered strictly stationary.

Generally, in the statistical literature, stationarity refers to the weak stationarity of stationary time series satisfying three conditions: constant mean, constant variance and auto covariance. The function depends only on (TS) (not on T or s). On the other hand, strict stationarity means that the probability distribution of time series will not change with time.

For example, white noise is stationary, which means that random variables are uncorrelated and not necessarily independent. However, strict white noise indicates the independence between variables. In addition, because the characteristics of Gaussian distribution are the first two moments, Gaussian white noise is strictly stable. Therefore, uncorrelation also means the independence of random variables.

In strict white noise, the noise term {et} cannot be predicted linearly or nonlinearly. In general white noise, it may not be predicted linearly, but it can be predicted nonlinearly by arch / GARCH model discussed later. There are three points to note:

• strict stationarity does not mean that stationarity is weak because it does not require limited variance

• stationarity does not mean strict stationarity, because strict stationarity requires that the probability distribution does not change with time

• the nonlinear function of strictly stationary sequence is also strictly stationary, which is not suitable for weak stationary

### 2. Difference:

In order to convert a non-stationary sequence into a stationary sequence, the difference method can be used to subtract the lag phase of the sequence from the original sequence: for example:

In financial time series, the series is usually converted and then the difference is performed. This is because financial time series usually experience exponential growth, so logarithmic transformation can smooth (linearize) the time series, and difference will help to stabilize the variance of the time series. Here is an example of Apple’s stock price:

• the chart at the top left is the original time series of apple stock price from January 1, 2007 to July 24, 2012, showing exponential growth.

• the chart at the bottom left shows the difference in Apple’s stock price. It can be seen that this series is price related. In other words, the variance of the sequence increases with the level of the original sequence, so it is not stable

• Apple’s log price chart is displayed in the upper right corner. Compared with the original sequence, the sequence is more linear.

• the bottom right shows the difference in Apple log prices. The series seems to be more mean revertive, and the variance is constant and does not change significantly with the level of the original series.

To perform the difference in R, perform the following steps:

• read the data file in R and store it in variables

`appl. Close = appl $adjclose # reads and stores the closing price in the original file`

• plot the original stock price

`plot(ap.close,type='l')`

• different from the original sequence

`diff.appl=diff(ap.close)`

• differential sequence diagram of the original sequence

`plot(diff.appl,type='l')`

• get the logarithm of the original sequence and plot the logarithm price

`log.appl=log(appl.close)`

• different log prices and charts

`difflog.appl=diff(log.appl)`

The difference in log prices represents returns, similar to the percentage change in stock prices.

# ARIMA model:

### Model identification:

The time-domain autocorrelation method is established and realized. Therefore, autocorrelation and partial autocorrelation are the core of ARIMA model. Boxjenkins method provides a method to identify ARIMA model according to the autocorrelation and partial autocorrelation diagram of sequence. The parameters of ARIMA are composed of three parts: P (autoregressive parameter), D (difference fraction) and Q (moving average parameter).

There are three rules for identifying ARIMA model:

• if ACF (autocorrelation diagram) is cut off after lag n, PACF (partial autocorrelation diagram) disappears: ARIMA (0, D, n) determines MA (q)

• if ACF drops, PACF will be cut off after n-order lag: ARIMA (n, D, 0), identify AR (P)

• if ACF and PACF fail: a hybrid ARIMA model needs to be distinguished

Note that even if the same model is referenced, the number of differences in Arima is written in different ways. For example, the ARIMA (1,1,0) of the original sequence can be written as the ARIMA (1,0,0) of the differential sequence. Similarly, it is necessary to check the over difference when the lag first-order autocorrelation is negative (usually less than – 0.5). Excessive difference will increase the standard deviation.

The following is an example of an apple time series:

• the upper left is represented by the ACF of the logarithmic Apple stock price, which shows that the ACF decreases slowly (rather than decreases). The model may require difference.

• in the lower left corner is the PACF of log apple, which indicates the valid value at lag 1, and then the PACF is cut off. Therefore, the model of log Apple stock price may be ARIMA (1,0,0)

• the ACF of logarithmic Apple’s difference is displayed in the upper right, with no significant lag (lag 0 is not considered)

• in the lower right corner is the PACF of logarithmic Apple difference, without obvious lag. Therefore, the model of the differential logarithmic Apple sequence is white noise, and the original model is similar to the random walk model ARIMA (0,1,0)

In fitting ARIMA model, the idea of simplicity is very important. In this model, the model should have as small parameters as possible, but it can still explain the series (P and Q should be less than or equal to 2, or the total number of parameters should be less than or equal to 3 in view of box Jenkins method). The more parameters, the greater the noise that can be introduced into the model, so the larger the standard deviation.

Therefore, when checking the AICC of the model, the model with P and Q of 2 or less can be checked. To execute ACF and PACF in R, the following code:

• logarithmic ACF and PACF

```
acf.appl=acf(log.appl)
pacf.appl=pacf(log.appl,main='PACF Apple',lag.max=100
```

• ACF and PACF of differential logarithm

```
acf.appl=acf(difflog.appl,main='ACF Diffe
pacf.appl=pacf(difflog.appl,main='PACF D
```

In addition to the box Jenkins method, AICC also provides another method to check and identify the model. AICC is Akaike information criterion, which can be calculated by the following formula:

AICC = N * log（SS / N）+ 2（p + q + 1）*N / (n – P – Q – 2), if there is no constant term in the model

AICC = N * log（SS / N）+ 2（p + q + 2）*N / (n – P – Q – 3), if it is a constant term in the model

N: Number of items after difference (n = n – d)

SS: sum of difference squares

P & Q: the order of autoregressive model and moving average model

According to this method, the model with the lowest AICC will be selected. When performing time series analysis in R, the program will provide AICC as part of the results. However, in other software, you may need to calculate the number manually by calculating the sum of squares and following the above formula. When using different software, the numbers may be slightly different.

```
Model AICc
0 1 0 -6493
1 1 0 -6491.02
0 1 1 -6493.02
1 1 1 -6489.01
0 1 2 -6492.84
1 1 2 -6488.89
2 1 0 -6491.1
2 1 1 -6489.14
2 1 2 -6501.86
```

Based on AICC, we should choose ARIMA (2,1,2). These two methods may sometimes produce different results, so once all estimates are obtained, the model must be checked and tested. The following is the code to execute Arima in R:

` summary(arima212)`

### parameter estimation

To estimate the parameters, execute the same code as previously shown. The results will provide an estimate of each element of the model. Using ARIMA (2,1,2) as the selected model, the results are as follows:

```
Series: log.appl
ARIMA(2,1,2)
Coefficients:
ar1 ar2 ma1 ma2
-0.0015 -0.9231 0.0032 0.8803
s.e. 0.0532 0.0400 0.0661 0.0488
sigma^2 estimated as 0.000559: log likelihood=3255.95
AIC=-6501.9 AICc=-6501.86 BIC=-6475.68
```

Complete model:

（Yt –Yt-1）= -0.0015（Yt-1 – Yt-2）-0.9231（Yt-2 – Yt-3）+0.0032εt-1+0.8803εt-2+εt

Note that when the ARIMA model with difference is executed, R will ignore the mean. The following is the output of Minitab:

```
Final Estimates of Parameters
Type Coef SE Coef T P
AR 1 0.0007 0.0430 0.02 0.988
AR 2 -0.9259 0.0640 -14.47 0.000
MA 1 0.0002 0.0534 0.00 0.998
MA 2 -0.8829 0.0768 -11.50 0.000
Constant 0.002721 0.001189 2.29 0.022
Differencing: 1 regular difference
Number of observations: Original series 1401, after differencing 1400
Residuals: SS = 0.779616 (backforecasts excluded)
MS = 0.000559 DF = 1395
Modified Box-Pierce (Ljung-Box) Chi-Square statistic
Lag 12 24 36 48
Chi-Square 6.8 21.2 31.9 42.0
DF 7 19 31 43
P-Value 0.452 0.328 0.419 0.516
```

Note that R will give different estimates for the same model, depending on how we write the code. For example: ARIMA (log. Apple, order = C (2,1,2))

`arima（difflog.appl，order = c（2,0,2））`

The parameter estimates of ARIMA (2,1,2) from these two lines of code will be different in R, even if it refers to the same model. However, in Minitab, the results are similar, so there is less confusion for users.

### Diagnostic examination

This process includes observing the residual diagram and its ACF and PACF diagrams, and checking the Ljung box results.

If the ACF and PACF of the model residuals do not lag significantly, the appropriate model is selected.

There is no obvious lag between ACF and PACF, indicating that ARIMA (2,1,2) is a good model to represent the sequence.

In addition, the Ljung box test provides another way to carefully examine the model. Basically, Ljung box is an autocorrelation test, which tests whether the autocorrelation of time series is different from 0. In other words, if the result rejects the hypothesis, it means that the data is independent and irrelevant; Otherwise, there is still sequence correlation in the sequence, and the model needs to be modified.

```
Modified Box-Pierce (Ljung-Box) Chi-Square statistic
Lag 12 24 36 48
Chi-Square 6.8 21.2 31.9 42.0
DF 7 19 31 43
P-Value 0.452 0.328 0.419 0.516
```

Minitab’s output shows that the P values are greater than 0.05, so we cannot reject the assumption that the autocorrelation is different from 0. Therefore, the selected model is one of the appropriate models for Apple stock price.

# Arch / GARCH model

Although there is no significant lag between ACF and PACF of residuals, the time series diagram of residuals shows some volatility. It is important to remember that Arima is a method of linearly modeling data and keeping predictions unchanged, because the model cannot reflect recent changes or incorporate new information. In other words, it provides the best linear prediction for the sequence, so it plays little role in nonlinear model prediction. In order to model fluctuations, the arch / GARCH method needs to be used. How do we know if arch / GARCH is needed for the time series we are concerned about?

First, check that the residual plot shows any volatility. Next, look at the square of the residuals. If volatility exists, arch / GARCH should be used to model the volatility of the series to reflect more recent changes and fluctuations in the series. Finally, the ACF and PACF of square residuals will help to confirm whether the residuals (noise terms) are independent and predictable. As mentioned earlier, strict white noise cannot be predicted linearly or nonlinearly, while ordinary white noise may not be predicted linearly but still cannot be predicted nonlinearly. If the residuals are strictly white noise, they are independent of zero mean and normal distribution, and the ACF and PACF of square residuals have no obvious lag.

The following is a graph of square residuals:

• the squared residual diagram shows the volatility at some time points

• when the lag is 10, the PACF will still be truncated, even if some lags are still large

Therefore, the residuals show some patterns that can be modeled. Arch / GARCH is necessary to model volatility. As the name suggests, this method is related to the conditional variance of the sequence. General form of arch (q):

```
res.arima212=arima212$res
squared.res.arima212=res.arima212^2
```

Select the arch / GARCH order and parameters according to AICC, as follows:

AICC = -2 * log+ 2（ q + 1）*N / (n – Q – 2), if there is no constant term in the model

AICC = -2 * log+ 2（ q + 2）*N / (n – Q – 3), if it is a constant term in the model

To calculate AICC, we need to fit the arch / GARCH model to the residual, and then use the loglik function in R to calculate the log likelihood. Please note that since we only want to model the noise of ARIMA model, we fit arch to the residual of the previously selected ARIMA model instead of the original sequence or logarithmic or differential logarithmic sequence.

```
Model N q Log&likelihood AICc&no&const AICc&const
ARCH(0) 1400 0 3256.488 ,6510.973139 ,6508.96741
ARCH(1) 1400 1 3314.55 ,6625.09141 ,6623.082808
ARCH(2) 1400 2 3331.168 ,6656.318808 ,6654.307326
ARCH(3) 1400 3 3355.06 ,6702.091326 ,6700.076958
ARCH(4) 1400 4 3370.881 ,6731.718958 ,6729.701698
ARCH(5) 1400 5 3394.885 ,6777.709698 ,6775.68954
ARCH(6) 1400 6 3396.683 ,6779.28554 ,6777.262477
ARCH(7) 1400 7 3403.227 ,6790.350477 ,6788.324504
ARCH(8) 1400 8 3410.242 =6802.354504 =6800.325613
ARCH(9) 1400 9 3405.803 ,6791.447613 ,6789.415798
ARCH(10) 1400 10 3409.187 ,6796.183798 ,6794.149054
GARCH(1,"1) 1400 2 3425.365 ,6844.712808 ,6842.701326
```

AICC tables for constant and non constant conditions are provided above. Note that AICC decreases from arch 1 to arch 8 and then increases in arch 9 and arch 10. Why did it happen? It means that we need to check the convergence of the model. In the first seven cases, the output in R gives “relative function convergence”, while arch 9 and arch 10 have “false convergence”. When the output contains false convergence, the prediction ability of the model is questionable, and we should exclude these models from the selection; Although GARCH 1,1 has the lowest AICC, the model is incorrectly convergent and therefore excluded. Arch 8 is the selected model.

In addition, we also include arch 0 in our analysis because it can be used to check whether there are any arch effects or whether the residuals are independent.

R code for executing arch / GARCH model:

```
loglik08=logLik(arch08)
summary(arch08)
```

Note that R does not allow the order of q = 0, so we cannot obtain the log likelihood of arch 0 from R; But we need to calculate through the formula: −. 5* N * 1 + log 2 * pi *mean(x) ˆ2

N: Number of observations after phase difference n = n – D

10: Consider the residual data set

Output of arch 8:

```
Call:
Model:
GARCH(0,8)
Residuals:
Min 1Q Median 3Q Max
-4.40329 -0.48569 0.08897 0.69723 4.07181
Coefficient(s):
Estimate Std. Error t value Pr(>|t|)
a0 1.472e-04 1.432e-05 10.282 < 2e-16 ***
a1 1.284e-01 3.532e-02 3.636 0.000277 ***
a2 1.335e-01 2.839e-02 4.701 2.59e-06 ***
a3 9.388e-02 3.688e-02 2.545 0.010917 *
a4 8.678e-02 2.824e-02 3.073 0.002116 **
a5 5.667e-02 2.431e-02 2.331 0.019732 *
a6 3.972e-02 2.383e-02 1.667 0.095550 .
a7 9.034e-02 2.907e-02 3.108 0.001885 **
a8 1.126e-01 2.072e-02 5.437 5.41e-08 ***
\-\-\-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Diagnostic Tests:
Jarque Bera Test
data: Residuals
X-squared = 75.0928, df = 2, p-value < 2.2e-16
Box-Ljung test
data: Squared.Residuals
X-squared = 0.1124, df = 1, p-value = 0.7374
```

Except for the sixth parameter, the P “values of all parameters are less than 0.05, indicating that they are statistically significant. In addition, the P” value of the box “Ljung test is greater than 0.05, so we cannot reject the assumption that the value of the autocorrelation residual is different from 0. Therefore, the model is sufficient to represent the residual.

Complete arch model 8:

ht = 1.472e-04 +1.284e-01ε2t“ 1 +1.335e-01ε2t” 2 +9.388e-02ε2t“ 3 + 8.678 e-02ε2t“ 4 +

5.667e-02ε2t” 5 +3.972e-02ε2t“ 6 +9.034e-02ε2t” 7 +1.126e-01ε2t“ 8

# ARIMA-ARCH / GARCH

In this section, we will compare the results of ARIMA model and combined Arima arch / GARCH model. As mentioned earlier, the ARIMA and arch models of Apple log price series are ARIMA 2,1,2) and arch 8), respectively. In addition, we will also look at the results of Minitab and compare it with the results of R. Remember that R excludes constants when fitting Arima to the desired differential sequence. Therefore, the result we previously generated from R is ARIMA (2,1,2), which has no constant. Use the prediction function to predict the series in one step according to ARIMA 2,1,2)

```
Point Forecast Lo 95 Hi 95
1402 6.399541 6.353201 6.445882
```

ARIMA (2,1,2) – complete model of arch (8):

The following table summarizes all models and edits and calculates point forecasts and forecast intervals in Excel:

```
95% Confident interval
Model Forecast Lower Upper Actual
ARIMA(2,1,2) in R 6.399541 6.353201 6.445882 6.354317866
ARIMA(2,1,2) in Minitab (constant) 6.40099 6.35465 6.44734
ARIMA(2,1,2) in Minitab (no constant) 6.39956 6.35314 6.44597
ARIMA(2,1,2) + ARCH(8) in R 6.39974330 6.35340330 6.44608430
ARIMA(2,1,2) in Minitab (constant) +ARCH(8) 6.40119230 6.35485230 6.44754230
ARIMA(2,1,2) in Minitab (no constant)
+ARCH(8) 6.39976230 6.35334230 6.44617230
```

By converting logarithmic prices into prices, we obtain the prediction of the original series:

```
95% Confident interval
Model Forecast Lower Upper Actual
ARIMA(2,1,2) in R 601.5688544 574.3281943 630.1021821 574.9700003
ARIMA(2,1,2) in Minitab (constant) 602.4411595 575.1609991 631.0215411
ARIMA(2,1,2) in Minitab (no constant) 601.5802843 574.2931614 630.1576335
ARIMA(2,1,2) + ARCH(8) in R 601.6905666 574.4443951 630.2296673
ARIMA(2,1,2) in Minitab (constant) +ARCH(8) 602.5630482 575.2773683 631.1492123
ARIMA(2,1,2) in Minitab (no constant)
+ARCH(8) 601.7019989 574.409355 630.28513
```

On July 25, 2012, Apple released a lower than expected earnings report, which affected the company’s share price, resulting in the stock falling from $600.92 on July 24, 2012 to $574.97 on July 24, 2012. When the company releases positive or negative news, this is a frequent accidental risk. However, since the actual price is within our 95% confidence interval and very close to the lower limit, our model seems to be able to successfully predict this risk.

It should be noted that the 95% confidence interval of ARIMA (2,1,2) is wider than that of ARIMA (2,1,2) – arch (8) combined model. This is because the latter reflects and incorporates the recent changes and fluctuations of stock prices by analyzing the residual and its conditional variance (the variance affected with the emergence of new information).

So how to calculate the conditional variance ht of arch (8)?

• generate 1-step prediction, 100 step prediction, prediction chart:

`forecast212step1=forecast(arima212,1,level=95)`

• calculate HT, conditional variance:

`ht. Arch08 = arch08 $fit \ [, 1 \] ^ 2 # use first column of fit`

• generate graphs of logarithmic prices, upper and lower limits of 95%

```
plot(log.appl,type='l',main='Log Apple,Low,High')
lines(low,col='red')
lines(high,col='blue')
```

To calculate HT, we first list all the parameters of the model in one column, then find the residuals associated with these coefficients, square these residuals, multiply the HT coefficient by the residuals square, and then sum these numbers to get HT. For example, to estimate point 1402 (our dataset has 1401 observations), we need the residual of the last 8 days, because our model is arch (8). The following is the generated table:

```
ht coeff res squared res ht components
const 1.47E-04 1.47E,04
a1 1.28E-01 ,5.18E,03 2.69E,05 3.45E,06
a2 1.34E-01 4.21E,04 1.77E,07 2.37E,08
a3 9.39E-02 ,1.68E,02 2.84E,04 2.66E,05
a4 8.68E-02 1.25E,02 1.57E,04 1.36E,05
a5 5.67E-02 ,7.41E,04 5.49E,07 3.11E,08
a6 3.97E-02 8.33E,04 6.93E,07 2.75E,08
a7 9.03E-02 2.92E,03 8.54E,06 7.72E,07
a8 1.13E-01 9.68E,03 9.37E,05 1.05E,05
ht 2.02E,04
```

In order to estimate the 1-step prediction and 95% confidence interval of the mixed model as described above, we use the ARIMA prediction obtained from R or Minitab, and then add HT to the ARIMA prediction. Record log price and conditional variance:

• the conditional variance chart successfully reflects the volatility of the whole time series • high volatility is closely related to the period of stock price collapse

95% forecast interval of price:

The final inspection of the model is to check the QQ diagram of the residual of arima-arch model, that is, et= ε T / sqrt (HT) = residual / sqrt (conditional variance). We can calculate it directly from R, and then draw QQ diagram to check the normality of residuals. The following is the code and QQ diagram:

`qqline(archres)`

The figure shows that the residuals appear to be roughly normally distributed, although some points are not on a straight line. However, compared with ARIMA model, the residual of hybrid model is closer to normal distribution.

# conclusion

Time domain method is a useful method to analyze financial time series. There are some aspects to be considered in the prediction based on arim-arch / GARCH model:

Firstly, ARIMA model focuses on linear analysis of time series, and it can not reflect the recent changes due to the existence of new information. Therefore, in order to update the model, the user needs to merge the new data and estimate the parameters again. The variance in ARIMA model is unconditional variance and remains constant. Arima is suitable for stationary sequences, so non-stationary sequences (such as logarithmic transformation) should be transformed.

In addition, Arima is usually used with arch / GARCH model. Arch / GARCH is a method to measure the volatility of series, or more specifically, it is a method to model the noise term of ARIMA model. Arch / GARCH combines new information and analyzes the sequence according to conditional variance. Users can use the latest information to predict future value. The prediction interval of the hybrid model is shorter than that of the pure ARIMA model.

Most popular insights

1.Using LSTM and python to predict time series in Python

2.Using long-term and short-term memory model LSTM to predict and analyze time series in Python

3.Time series (ARIMA, exponential smoothing) analysis using R language

4.R language multivariate copula GARCH model time series prediction

5.R language copulas and financial time series cases

6.Using R language random fluctuation model SV to deal with random fluctuations in time series

7.Tar threshold autoregressive model of R language time series

8.R language K-shape time series clustering method for stock price time series clustering