# Python panel time series data prediction: Granger causality test, Granger causality test, drug sales example and visualization

Time：2022-5-7

Time series is based on fixed time_ Interval_ Recorded observation sequence. This guide takes you through the process of analyzing the characteristics of a given time series in Python.

## content

1. What is time series?
2. How do I import time series in Python?
3. What is panel data?
4. Visualization of time series
5. Patterns in time series
6. Time series of addition and multiplication
7. How to decompose a time series into its components?
8. Stationary and non-stationary time series
9. How to make a time series stable?
10. How to test stability?
11. What is the difference between white noise and stationary sequences?
12. How to de trend a time series?
13. How to de season the time series?
14. How to check the seasonality of time series?
15. How to deal with missing values in time series?
16. What are autocorrelation and partial autocorrelation functions?
17. How to calculate the partial autocorrelation function?
18. Lag diagram
19. How to estimate the predictability of a time series?
20. Why and how to smooth the time series?
21. How to use Granger causality test to understand whether one time series can help predict another time series?

## 1. What is a time series?

Time series is an observation series recorded in a fixed time interval.

Depending on the frequency of observation, a time series may usually be hourly, daily, weekly, monthly, quarterly and annual. Sometimes, you may also have a time series in seconds, such as clicks per minute, user visits, etc.

Why analyze a time series?

Because this is the preparation step before you predict the sequence.

In addition, time series prediction is of great commercial significance, because things that are very important to enterprises, such as demand and sales, website visits, stock prices and so on, are basically time series data.

So, what is involved in analyzing a time series?

Time series analysis involves understanding all aspects of the nature of the series, so that you can better understand and create meaningful and accurate predictions.

## 2. How to import time series in Python?

So, how to import time series data?

Time series data are usually stored in CSV file or other spreadsheet format, including two columns: date and measured value.

We use read in the pandas package\_ CSV () to read the time series data set (a CSV file about drug sales) as a pandas data frame. Add parse\_ The dates = [‘date ‘] parameter will cause the date column to be resolved into a date field.

import pandas as pd

#Import data
', parse_dates=$'date'$)
df.head()

# Data frame time series

In addition, you can import it into a pandas sequence indexed by date. You just need to be in PD read\_ Specify index in csv()\_ Col parameter is OK.

pd.read\_csv('10.csv', parse\_dates=$'date'$, index_col='date')

## 3. What is panel data?

Panel data is also a time-based data set.

The difference is that in addition to the time series, it also contains one or more related variables measured in the same time period.

Generally, the columns existing in the panel data contain explanatory variables that help to predict y, provided that these columns are available in the future prediction period.

The following is an example of panel data.

df.head()

Panel sequence

## 4. Visualization of time series

Let’s use Matplotlib to visualize this sequence.

#Time series data source: FPP # packet in R.
import matplotlib.pyplot as plt

#Draw a chart
def plot_ DF (DF, x, y, title = "", xlabel = 'date', DPI = 100):

plt.show()

# Visualization of time series

Since all values are positive, you can display them on both sides of the Y axis to emphasize growth.

#Import data
x = df$'date'$.values

#Drawing
fig, ax = plt.subplots(1, 1, figsize=（16,5）, dpi= 120)
plt.fill(x, y1=y1, y2=-y1, alpha=0.5)

# Air passenger data — two-sided sequence

Because it is a monthly time series and follows a certain repetition pattern every year, you can draw the situation of each year as a separate line in the same chart. This allows you to compare the annual patterns side by side.

# Seasonal chart of time series

#Import data

df.reset_index(inplace=True)

#Prepare data

years = df$'year'$.unique()

#Preparation color

np.random.choice(list(mpl.color), len(year),

#Draw a chart

plt.text(df.loc$df.year==y, :$.shape$0$-.9\]

plt.gca().set(xlim=(-0.3, 11)

plt. Title ("seasonal chart of drug sales time series", fontsize = 20)

# Seasonal map of drug sales

Every year, the sales volume of drugs drops sharply in February, rises again in March and drops again in April, and so on. Obviously, this model will repeat in a certain year every year.

However, over time, drug sales have generally increased. You can use a beautiful annual chart to show this trend and its annual changes. Similarly, you can also make a boxplot arranged by month to display the monthly distribution.

# Monthly (seasonal) and year-on-year (trend) distribution box chart

You can group the data by season to see how the value is distributed in a certain year or month and its comparison in different periods.

#Import data

df.reset_index(inplace=True)

#Prepare data
DF \ ['year' \] = \ [d.year # for # D # in # DF. Date \]
DF \ ['month' \] = \ [d.strftime ('% B') for D # in DF. Date \]

#Draw a chart
sns. Boxplot (x = 'year', y = 'value', data = DF, ax = axes \ [0 \])
sns. Boxplot (x = 'month', y = 'value', data = DF. LOC \ [~ DF. Year. Isin (\ [1991, 2008 \],: \])

Box chart by year and month

The box chart makes the distribution of years and months obvious. In addition, in the monthly chart, drug sales in December and January were significantly higher, which can be attributed to the holiday discount season.

So far, we have seen the similarity of recognition patterns. Now, how do you find any deviations from the usual pattern?

## 5. Patterns in time series

Any time series can be divided into the following parts. Base level + trend + seasonality + error

When an increasing or decreasing slope is observed in the time series, the trend can be observed. Seasonality refers to the obvious repetition pattern observed between periodic intervals due to seasonal factors. This may be due to which month of the year, which day of the month, working day or even which time of the day.

However, not all time series must have trends and / or seasonality. A time series may not have an obvious trend, but it has a seasonality. On the contrary, it can also be true.

Therefore, a time series can be imagined as a combination of trend, seasonality and error terms.

fig, axes = plt.subplots(1,3, figsize=(20,4), dpi=100)

pd.read_csv.plot( legend=False, ax=axes$2$)

# Patterns in time series

Another area to consider is cyclical behavior. This happens when the rising and falling patterns in the series do not occur within a fixed calendar based time interval. Care should be taken not to confuse the “cyclical” effect with the “seasonal” effect.

So, how to distinguish between “cyclical” and “seasonal” patterns?

If these patterns are not based on a fixed calendar frequency, then they are periodic. Because, unlike seasonality, cyclical effects are often influenced by business and other socio-economic factors.

## 6. Time series of addition and multiplication

According to the nature of trend and seasonality, a time series can be modeled as addition or multiplication, in which each observation in the series can be expressed as the sum or product of each component.

Additive time series:Value = base + trend + seasonality + error

Multiplicative time series:Value = base x trend x seasonal x error

## 7. How to decompose a time series into its components?

You can decompose a time series classically and regard it as the addition or multiplication combination of cardinality, trend, seasonal index and residual.

Seasonal in statsmodels_ Decompose can easily achieve this.

#Multiplicative decomposition
decompose(df$'value'$, model='multiplicative')

decompose(df$'value'$, model='additive')

#Drawing
result_mul.plot().suptitle( fontsize=22)

Set extrapolate_ Trend =’freq ‘you can see any missing values in the trend and residuals at the beginning of the sequence.

If you take a closer look at the residuals of additive decomposition, it has some pattern residues. However, the multiplicative decomposition looks quite random. Therefore, ideally, multiplicative decomposition should be the first choice for this specific sequence.

The digital output of trend, seasonal and residual components is stored in result_ Mul output itself. Let’s extract them and put them in a data frame.

#Extracted components----
#Actual value = (seasonal * trend * residual) product

constructed.columns = $'seas', 'trendance', 'resid', 'actual_values'$ 。
constructed.head()

If you check, the product of the sea, trend and resid columns should be exactly equal to actual_ values。

## 8. Stationary and non-stationary time series

Stationarity is an attribute of time series. A stationary sequence means that the value of the sequence is not a function of time.

In other words, the statistical properties of the series, such as mean, variance and autocorrelation, remain unchanged over time. The autocorrelation of a sequence is only the correlation between the sequence and its previous value, which is more about this.

A stable time series also has no seasonal effect.

So, how to identify whether a sequence is stable? Let’s draw some examples to illustrate.

# Stationary and nonstationary time series

So why is a stable sequence important?

I’ll talk about this later, but you know that almost any time series can be made stationary by applying appropriate transformations. Most statistical prediction methods are designed to work on stationary time series. The first step in the prediction process is usually to make some transformations to convert the non-stationary series into stationary ones.

## 9. How to make a time series stable?

You can smooth the sequence in the following ways

1. Difference the sequence (one or more times)
2. Take the logarithm of the sequence
3. Take the nth root of the sequence
4. Combination of the above methods

The most common and convenient way to make the sequence stable is to make a difference at least once until it becomes approximately stable.

# So, what is difference?

If y_ T is the value of time’t ‘, then the first difference of y = yt-yt-1. More simply, a sequence simply subtracts the next value from the current value.

If the first difference cannot make a sequence smooth, you can make the second difference. and so on.

For example, consider the following sequence. [1, 5, 2, 12, 20]

The first difference can be obtained. [5-1, 2-5, 12-2, 20-12] = [4, -3, 10, 8]

The second difference is obtained. [-3-4, -10-3, 8-10] = [-7, -13, -2]

It is relatively easy to predict a stationary series, and the prediction results are more reliable.

An important reason is that the autoregressive prediction model is essentially a linear regression model, which uses the lag of the sequence itself as the prediction factor.

We know that linear regression works best if the predictors (x variables) are not interrelated. Therefore, the stabilization of the sequence solves this problem because it eliminates any continuous autocorrelation, so that the predictors (the lag of the sequence) in the prediction model are almost independent.

Now that we have determined the importance of sequence stationarity, how do you check whether a given sequence is stationary?

## 10. How to check the stability?

The stationarity of a sequence can be determined by observing the graph of the sequence, as we did before.

Another method is to divide the sequence into two or more continuous parts, and calculate the summary statistics such as mean, variance and autocorrelation. If the statistics are very different, the series cannot be stable.

However, you need a way to quantitatively determine whether a given sequence is stationary. This can be done through a statistical test called “unit root test”. There are many changes in this area. The test checks whether a time series is unsteady and has unit roots.

There are many ways to implement unit root testing, such as.

1. Augmented Dickey fuller test (ADH test)
2. Kwiatkowski Phillips Schmidt Shin – kpss test (stable trend)
3. Philips Perron test (PP test)

The most commonly used is the ADF test. The invalid hypothesis is that the time series has unit roots and is non-stationary. Therefore, if the p value in the ADH test is less than the significance level (0.05), you reject the null hypothesis.

On the other hand, kpss test is used to test the stability of trend. The interpretation of null hypothesis and P value is just opposite to ADH test. The following code implements these two checks using the statsmodels package in Python.

#ADF test

#Kpss test
result = kpss(df.value.values, regression='c')
print('\\nKPSS Statistic: %f' % result$0$)

## 11. What is the difference between white noise and stationary series?

Like stationary series, white noise is not a function of time, that is, its mean and variance do not change with time. But the difference is that the white noise is completely random, and the average value is 0.

In white noise, there is no pattern. If you think of the sound signal in the FM radio as a time series, the blank sound you hear between channels is white noise.

Mathematically, a completely random sequence of numbers with an average of 0 is white noise.

 np.random.randn(1000)

Random white noise

## 12. How to de trend the time series?

Detrended time series refers to the removal of trend components from a time series. But how to extract trends? There are many ways.

1. Remove the best fit line from the time series. The best fitting line can be obtained from the linear regression model with time steps as predictors. For more complex trends, you may want to use quadratic terms (x ^ 2) in the model.
2. Remove the trend component from the time series decomposition we saw earlier.
3. Removal average
4. A filter such as Baxter King filter or Hodrick Prescott filter is applied to remove the moving average trend line or periodic component.

Let’s implement the first two methods.

#Use SciPy. Subtract best fit lines
signal.detrend(df.value.values)
plt.plot(detrended)

The trend of time series is interpreted by subtracting the least square method

#Use statmodels. Subtract the trend component.
decompose(df$'value'$, model='multiplicative',
detrended = df.value.value - trend
plt.plot(detrended)

Castration by subtracting trend components

## 13. How to de season a time series?

There are many ways to make time series de seasonal. Here are some examples.

- 1. Take a moving average with the length of the seasonal window. This will smooth the sequence in the process.

- 2. Seasonal differences in the series (subtract the previous quarter's value from the current value).

- 3. Divide the sequence by the seasonal index obtained from STL decomposition.

If the effect of dividing by seasonal index is not good, you can try to take the logarithm of the sequence and then carry out de seasonal processing. Then you can restore to the original scale by taking the index.

#Time series decomposition
seasonal_decompose(df$'value'$, model='multiplicative'')
#De seasoning
df. value.values / result_mul.seasonal
#Drawing
plt.plot(deseason)

Anti seasonal processing of time series

## 14. How to test the seasonality of a time series?

A common method is to draw a sequence diagram to check the repeatable patterns in the stationary time interval. Therefore, the type of seasonality is determined by the clock or calendar.

1. Hours of the day
2. One day of the month
3. weekly
4. monthly
5. year

However, if you want to have a clearer check on seasonality, you can use the autocorrelation function (ACF) graph. More about ACF will be introduced in the following chapters. However, when there is a strong seasonal pattern, the ACF chart usually shows a clear repetitive peak on the multiple of the seasonal window.

For example, the time series of drug sales is a monthly series, with repeated patterns every year. So you can see in the 12th, 24th, 36th The peak of the line.

In real data sets, this strong pattern is difficult to notice and may be distorted by any noise, so you need to observe it carefully.

#Draw a picture
correlation_plot(df.value.tolist())

Autocorrelation diagram

In addition, if you want to conduct statistical test, chtest can determine whether seasonal difference is needed to stabilize the sequence.

## 15. How to deal with missing values in time series?

Sometimes, your time series will have missing dates / times. This means that the data of these periods are not captured or unavailable. In this case, you can fill these periods with zeros.

Secondly, when it comes to time series, you should not usually replace the missing values with the average of the series, especially when the series is not stable. What you can do is fill the previous value forward.

However, depending on the nature of the sequence, you have to try a variety of methods before reaching a conclusion. Some effective alternative attribution methods are.

• Backward fill
• linear interpolation
• Nearest neighbor average
• Seasonal difference average

In order to measure attribution performance, I manually introduce the missing value of time series, use the above method for attribution, and then measure the average square error between attribution and actual value.

#Generate dataset

prcPupdate({'xtick.bottom' : False})

## 1. Actual -------------------------------
df_or.plot(style=".-")

## 2. Forward fill--------------------------
df_ffill = df.ffill()

## 3. Backward fill-------------------------
df_bfill = df.bfill()

## 4. Linear interpolation------------------
df$'rownum'$ = np.arange(df.shape$0$)

## 5. Stereo interpolation--------------------
f2 = interp1d(df\_um'\], df\_nona$'v kind='cubic') #Reference interpolation. ## 6. N "average of past nearest neighbors------ knnmean(df.value.values, 8) np.round(meansquarerr('), 2) ## 7. Seasonal average---------------------------- """ Calculate the average value of the corresponding seasonal period TS: a set of dimension expressions of time series n: Seasonal window length of time series """ out = np.cpy(ts) for i, val in merate(ts): if np.isnan(val): ts_ Seas = TS \ [I-1:: - n$ # only previous seasons
If NP isan(np.naean(ts_seas))。
ts_ seas = np. Concatenate (\ [TS \ [I-1:: - n \], TS \ [I:: n \] \]) # before and after
out$i$ = np.nanan(ts_seas) * lr

seasonal_mean(df.value, n=12, lr=1.25)

Missing value processing method

You can also consider the following methods, depending on how accurate you want to infer.

1. If you have explanatory variables, use prediction models, such as random forests or k-nearest neighbors.
2. If you have enough past observations, predict the missing values.
3. If you have enough future observations, reverse predict the missing values
4. Predict the corresponding value of the previous period.

## 16. What are autocorrelation and partial autocorrelation functions?

Autocorrelation is only the correlation between a sequence and its own lag. If a sequence is significantly autocorrelated, it means that the previous value (lag) of the sequence may help predict the current value.

It conveys the information related to the middle lag, but it also excludes the information related to the middle lag.

#Calculate ACF and PACF until 50 hysteresis periods
acf_50 = acf(value, nlags=50)
pacf_50 = pacf(value, nlags=50)

#Draw a chart
fig, axes = plt.subplots(1,2,figsize=（16,3）)

ACF and PACF

## 17. How to calculate the partial autocorrelation function?

So, how to calculate partial autocorrelation?

The partial autocorrelation of the lag (k) of a sequence is the coefficient of the lag in the autoregressive equation of Y. The autoregressive equation of Y is just a linear regression of Y with its own lag as the predictor.

For example, if y\_ T is the current sequence, y\_ T-1 is the lag 1 of Y, then the partial autocorrelation (Y \ _t-3) of Lag 3 is y\_ Coefficient of T-3 $alpha_ 3$, in the following equation.

Autoregressive equation

## 18. Hysteresis diagram

Lag diagram is a scatter diagram of time series and its own lag. It is usually used to check autocorrelation. If there is any pattern in the sequence like you see below, the sequence is autocorrelated. If there is no such pattern, the sequence may be random white noise.

In the following example of sunspot region time series, with n_ With the increase of lag, the graph becomes more and more scattered.

#Import

#Drawing
for i, ax in enumerate(axes.flatten()$:4$):

Fig. subtitle ('lag chart of drug sales', y = 1.05)

Drug sales lag curve

Lag sunspot

## 19. How to estimate the predictability of a time series?

The stronger the regularity and repeatability of a time series, the easier it is to predict. Approximate entropy “can be used to quantify the regularity and unpredictability of fluctuations in a time series.

The higher the approximate entropy, the more difficult it is to predict.

Another better alternative is “sample entropy”.

Sample entropy is similar to approximate entropy, but it is more consistent in estimation complexity, even for small time series. For example, the “approximate entropy” of a random time series with fewer data points may be lower than that of a more “regular” time series, while the “approximate entropy” of a longer random time series will be higher.

Sample entropy deals with this problem well. Please see the demonstration below.

def AEn(U, m, r):
"" calculate approximate entry ""
def \_maxdit(x\_i, x_j):
returnmax($abs(u- va) for ua, va in zp(x\_i, x\_j)$)

def _phi(m):
x = $\[U\[j$for in range(i, i \] for i in range(N - m + 1)\]
C = $len(\[1 for x\_jn x if \_maist(x\_i, x\_j) <= r$) / (N - m + 1.0) for xi in x\] 。
Return (n - M +. 0) * * (- 1) * sum log(C))

N = len(U)

def Samp:
"" calculate sample entropy ""“
maxstx\_i, x\_j):
retur max($abs(ua - va) fo ua, vain zip(\_i x\_j)$)
phi(m)
x = $\[j$or j i ane(i, i  m - 1 + 1\] for iin age(N - m + 1)\]
C= $len(\[1 for j in rane(len(x)) if i != j and _mist(x\[i, xj$) <= r\]) for i in rane(ln(x)) \]
= en()

## 20. Why and how to smooth the time series?

Smoothing a time series may be useful in the following aspects.

• Reduce the influence of noise in the signal and obtain an approximate value of the noise filtered sequence.
• The smoothed sequence can be used as a feature to explain the original sequence itself.
• Better observation of underlying trends

So, how to smooth a sequence? Let’s discuss the following methods.

1. Take a moving average
2. Do a loess smoothing (local regression).
3. Do a lower smoothing (locally weighted regression).

The moving average is simply the average of the scrolling windows that define the width. But you must choose the window width reasonably, because large window size will make the sequence too smooth. For example, window size equal to seasonal duration (e.g. 12-month sequence) will effectively eliminate seasonal effects.

Loess is the abbreviation of “localized regression”. It performs multiple regression near the local area of each point. It is implemented in the statsmodels package. You can use the frac parameter to control the degree of smoothing. This parameter specifies the percentage of nearby data points and should be regarded as suitable for the regression model.

#Import
pd.read\_csv('ele.csv', parse\_dates=$'date'$, index_col='date')

# 1. Moving average
dma = df_rg.vale.r.man()

# 2. Smooth (5% and 15%)
pd.DtaFame(lowess(dfoig.alu, np.ane(len(d_origale)), fac=0.05)

#Drawing
fig, axes = plt.ubplos(4,1, figsiz=（7, 7, sharex=rue, dp=120)
df_ori$'aue'$.pot(ax=axes0\], color='k')

Smoothing time series

## How to use Granger causality test to understand whether one time series can help predict another time series?

Granger causality test is used to determine whether one time series is helpful to predict another time series.

How does the Granger causality test work?

It is based on the idea that if x leads to y, the prediction of Y based on the former value of Y and the former value of X should be better than that based on the former value of Y only.

Therefore, understanding Granger causality should not be used to test whether the lag of Y leads to y.

The invalid assumption is that the sequence in the second column will not lead to the Granger of the sequence in the first column. If the p value is less than the significance level (0.05), then you reject the null hypothesis and conclude that the lag period of X above is indeed useful.

The second parameter maxlag says how many y lags should be included in the test.

df = pd.rea\_csv('a10.csv', parse\_dates=$'date'$)
df$'moth'$ = df.date.dt.nth
gragecaslitess(df$\['alue', 'moh'$\], maxag2)

In the above case, the p value of all tests is zero. Therefore, the “monthly” data can indeed be used to predict air passengers.

Most popular insights

## Mathematical basis of vehicle control theory — state equation

1. Use the equation of state to find the transfer function formula The equation of state is$$G(s)=\dfrac{Y(s)}{U(s)} = C(sI-A)^{-1}B+D$$ Example 1:$$m-c-k$$System, seeking$$m\overset{··}{x}+c\overset{·}{x}+kx=f$$Transfer function.Solution:order$$x_1 = x$$，$$x_2=\overset{·}{x}$$Then there are: $\begin{cases} \overset{·}{x_1}=x_2\\ \overset{·}{x_2}=\dfrac{1}{m}[-c x_2-k x_1]+\dfrac{f}{m} \end{cases}$ According to the vector expression of the state equation and the vector expression of the output equation \[\begin{cases} \overset{·}{X} = AX+Bf \\ […]