Monte Carlo analysis of web page views

Time:2020-10-31

By Michael Grogan
Compile | VK
Source: toward Data Science

Monte Carlo method has been widely used in finance and other fields to model various risk scenarios.

However, this method also has important applications in other aspects of time series analysis. In this particular example, let’s look at how the Monte Carlo method can be used to model web page views.

The above time series is derived from Wikimedia toolforge, which is the time series of page views of the word “medical” on Wikipedia from January 2019 to July 2020. The data is broken down by day.

We can see that time series show significant volatility every day and show typical characteristics of some strange “peaks” in the data. Or, in these days, the number of searches for the term is particularly high.

It is usually futile to try to predict such time series directly. This is because it is not possible to statistically predict when search terms will peak because it is affected by data independent of the past. For example, major health-related news events can lead to a peak search for the word.

However, it is particularly interesting that we can create a simulation to analyze many potential scenarios of web page statistics and estimate how high or low the page views of this search term under abnormal scenarios.

probability distribution

When running Monte Carlo simulations, it is important to note the type of distribution used.

Considering that page views cannot be negative, we assume that the distribution is positive skew.

The following is a histogram of the data:

We can see that the distribution shows a positive skewness, with several outliers tilting the tail of the distribution to the right.

>>> series = value;
>>> skewness = series.skew();
>>> print("Skewness:");
>>> print(round(skewness,2));
Skewness:
0.17

The skewness of this distribution is 0.17.

The QQ chart shows that most of the values are normal except for the outliers.

However, it is more likely that the data represents a lognormal distribution due to the normal skewness. If we convert the data to a logarithmic format, it will lead to the normality of the distribution.

>>> mu=np.mean(logvalue)
>>> sigma=np.std(logvalue)
>>> x = mu + sigma * np.random.lognormal(mu, sigma, 10000)
>>> num_bins = 50

This is the distribution of logarithmic data, which is more representative of normal distribution.

In addition, the skewness of this distribution is now -0.41.

>>> logvalue=pd.Series(logvalue)
>>> logseries = logvalue;
>>> skewness = logseries.skew();
>>> print("Skewness:");
>>> print(round(skewness,2));
Skewness:
-0.41

This shows that there is a slight negative skewness, but the QQ diagram still shows a normal distribution.

Monte Carlo simulation

Now that the data has been properly converted, Monte Carlo simulations can be generated to analyze the potential result range of page views statistics. Page views are expressed in logarithmic format according to the selected distribution.

First, the average and volatility (measured by standard deviation) of the time series are calculated.

>>> mu=np.mean(logvalue)
>>> sigma=np.std(logvalue)
>>> x = mu + sigma * np.random.lognormal(mu, sigma, 10000)
>>> num_bins = 50

Then define the corresponding array with X, use Mu and sigma, and generate 10000 random numbers. These random numbers follow the lognormal distribution according to the defined mean and standard deviation.

array([5.21777304, 5.58552424, 5.39748092, ..., 5.27737933, 5.42742056, 5.52693816])

Now, let’s draw the histogram.

Again, these values are expressed in logarithmic format. We see that this shape represents a normal distribution. As mentioned earlier, the idea of Monte Carlo simulation is not to predict web page views per se, but to provide estimates of web page views in many different simulations so as to determine

  • 1) The range of most web page views;
  • 2) The range of extreme values in a distribution.

conclusion

In this article, you see:

  • Application of Monte Carlo simulation

  • The role of skewness in defining distribution

  • How to simulate to identify the probability of obtaining the extreme value

Link to the original text: https://towardsdatascience.com/monte-carlo-simulations-in-python-analysing-web-page-views-b6dbec2ba683

Welcome to visit pan Chuang AI blog station:
http://panchuang.net/

Sklearn machine learning Chinese official document:
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/