The data obtained by a series of points in an equidistant time period is usually called time series data. Monthly retail sales, daily weather forecast, unemployment data and consumer sentiment survey are classic examples of time series data. In fact, most variables in nature, science, business and many other applications depend on data that can be measured at fixed time intervals.
One of the key reasons for analyzing time series data is to understand the past and predict the future. Scientists can use historical climate data to predict future climate change. Marketing managers can view the historical sales of a product and predict future demand.
In this article, we will look at time series data sets related to users who visit this blog. I will establish a connection with Google Analytics API in R and introduce daily users. Then we will create a forecast to predict the number of users that the blog may attract. I randomly selected a date range for illustrative purposes only.
The idea here is to let us learn how to query the data in Google Analytics into R and how to create time series prediction.
Let’s first set up our working directory and load the necessary libraries:
#Set working directory setwd("/Users/") #Load the required packages. Library (ggplot2) # is used to draw some initial graphs. Library (forecast) # is used to predict time series.
Query the daily time series data of blog users from Google Analytics API. The following is the initial query to set the parameters I want. In this example, I just pull the user data of blog every day from mid January 2017 to mid May 2017.
#Create a list of parameters used in Google Analytics queries listparam = Int( dinsns= "ga:date", mercs = "ga:users", sot = "ga:date", maresults = 10000, tae.id = "ga:99395442" )
After setting my query parameter list, I can start querying Google Analytics API:
#Store query results from Google Analytics es = QueyBlder(lit_pram) #Get data from Google Analytics through query results and OAuth df = GetRporData(rs, oaut_tken, splidawse = T)
#Reorder results #Check users in the first 30 days head(df,30)
As can be seen from the above 30 records, the only features or variables used are date and number of users. This is a very simple data set, but it can well illustrate the example of time series prediction.
One of the most important steps in time series analysis is to plot the data and check whether it has any patterns and fluctuations in the series. Let’s draw the results of our daily users in the time series diagram to check the trend or seasonality. Weekends are marked with s symbol:
#Process date and draw daily user df$dte <- as.ate(dfdte, '%m%d') df$d = as. fator(weekays(dfdate)) ggplot( daa = df, as( dateusers )) + geom_line()
Looking at the chart above with the daily user data of this blog, users seem to increase over time. The sequence starts with less than 100 users and increases to more than 400 users at a given point in time in a day. There seems to be a general rise in userstrend 。 We can also roughly identify the peaks and troughs, or ups and downs in the series. This model may be related toRelated to seasonal changes。 In other words, the number of users who visit this blog every day seems to be seasonal to some extent. We can run a simple box diagram by day of week to try to better visualize this mode:
#Create a working day as a factor and draw it df$wd = fator(df$kd) ggplot(df, aes(x=wd, y=srs)) + geom_boxplt()
As you can see in the figure above, Tuesday to Thursday are the most visited days. It attracts a lot of users. At some time, there are more than 400 users on some working days. In fact, Thursday seems to contain a lot of outliers compared to other days of the week. On the contrary, Saturday, Sunday and Monday attract the least number of users compared with other days of the week.
Therefore, when we revisit the above time series diagram, we can now say that users tend to visit this blog more during the week (peak on Thursday) and less on weekends (Sunday). This is the seasonal change we observed before.
In time series analysis, we tend to combine the observed time series datadecomposeThere are three basic components:trend、 SeasonalityAndIrregular。
We do this to observe its characteristics and_ Signal_ And_ Noise_ Separate. We decompose time series, recognize patterns, make estimates, model data, and improve our ability to understand what is happening and predict future behavior. Decomposing time series enables us to fit the model in the data that best describes its behavior.
The trend component is the long-term direction of the time series, reflecting the observed potential level or pattern. In our example, the trend is upward, which reflects the increasing number of users accessing blogs.
Seasonal components include general effects observed in data consistent in time, amplitude and direction. For example, in our example here, we see that users frequently have positive peaks from Wednesday to Thursday. Seasonality may be driven by many factors. In the retail industry, seasonality occurs on specific dates, such as long holidays and double 11. In our blog example, it seems that users visit more during study / work weeks and less on weekends.
Irregularity, also known as residuals, is what remains after we remove trend and seasonal components. It reflects unpredictable fluctuations in the sequence.
Before we decompose the time series of daily users of this blog, we need to convert the queried user data frame into the time series object in R_ ts()_：
#Convert data frame to time series object dfts = ts(df$ers, freqny = 7) #Decompose the time series and draw the results dcmp = dompose(dfts, tye = "aditve") prit(dcmp)
Usually, time series decomposition adoptsadditionOrmultiplicationForm of. There are other forms of decomposition, but we won’t cover them in this example.
In short, additive decomposition is used for time series, in which the basic level of the series fluctuates, but the seasonal amplitude remains relatively stable. As the trend level changes over time, the amplitude of seasonal and irregular components will not change significantly.
On the other hand, when the amplitude of seasonality and irregularity increases with the increase of trend, multiplicative decomposition is used.
We can observe from the above initial time series diagram that the seasonal amplitude remains basically stable in the whole time series, which shows that additive decomposition is more meaningful.
One of our goals in this exercise is to try and fit a model so that we can infer the data and make predictions to predict the future users of this blog. Obviously, a key assumption for predicting time series is that the current trend will continue. In other words, without any surprising changes or shocks, the overall trend should remain similar in the future (at least in the short term). We will also not consider any potential causes of the observed patterns (for example, any post in this blog is very well-known in WeChat official account, which may prompt many users to come to blog pages, etc.).
When we make prediction in time series, our purpose is to predict a future value under the condition of past observation history at a given time point. Many considerations need to be made around the exponential smoothing form required to fit the time series model. For simplicity, we will cover only one method here.
Therefore, considering the seasonality of our time series and using additive decomposition, the appropriate smoothing method isHolt-Winters, it uses an exponentially weighted moving average to update the estimate.
Let’s fit a prediction model in time series data:
#Apply Holt winters model in time series and check the fitting print(pred)
From the above model fitting, it can be seen that the smoothing of Holt winters is completed with three parameters:alpha、 betaAndgamma。 Alpha estimates the trend (or horizontal) component, β The slope of the trend component is estimated, while gamma estimates the seasonal component. These estimates are based on the latest time points in the series and will be used for prediction. Alpha, beta, and gamma values range from 0 to 1, with values close to 0 indicating that recent observations have little weight in the estimation.
From the above results, we can see that the smoothing estimate of alpha is 0.4524278, beta is 0.0211364 and gamma is 0.5593518. The value of alpha is about 0.5, which indicates that short-term and recent observations and historical and further observations all play a role in the trend estimation of time series. The beta value is close to zero, which indicates that the slope of the trend component (the horizontal change from one time period to the next) remains relatively similar throughout the sequence. The value of gamma is relatively similar to alpha, which indicates that seasonal estimation is based on recent and long-distance observations.
The following plots of model fit (red) and actual observations (black) help illustrate the results:
#Draw model fitting diagram plot(pred)
We can now infer the model to predict the future users of this blog:
#Forecast for future users forecast(pred, h=28)
Please note that in the above figure, the thick and bright blue line indicates the prediction of users accessing this blog around next month. Dark blue shaded areas represent 80% of the prediction interval, and light blue shaded areas represent 95% of the prediction interval. As we can see, the model usually illustrates the observed patterns and does a relatively good job in estimating the number of future users of this blog.
It is suggested that we check the accuracy of the prediction model. First let’s seeSum of squared errors (SSE), it measures the gap between our model and the actual observed data. Square is performed to avoid negative values and give more weight to large differences. A value close to 0 is always better. In our example, we can see that the SSE of our model is 9.8937336 10 ^ 4.
It is also important to check autocorrelation. Generally speaking, autocorrelation is used to evaluate whether there is a correlation between time series and time delays of the same time series. It’s like copying the time series many times and pasting it before each step to evaluate whether a pattern is found. In the context of time series error, autocorrelation is used to try and find the patterns existing in the residuals. This is what the diagram below does.
#Correlation diagram and histogram of model residuals acf(residuals\[8:length\])
Without spending too much time on correlogram itself, we can note that although the model works relatively well, it does not work as well in the early stages of the series_ Acf() function_ The autocorrelation results (Y-axis) are at lag 2 and 3.
In addition, from the histogram, we can see that there is a certain degree of abnormal distribution in the prediction model, which shows that there is a certain degree of abnormal distribution.
Most popular insights