# Python matching trading strategy pairs trading statistical arbitrage quantitative trading analysis stock market

Time：2022-5-24

When it comes to making money in the stock market, there are countless different ways to make money. It seems that in the financial world, wherever you go, people are telling you that you should learn python. After all, Python is a popular programming language that can be used in all types of fields, including data science. There are a number of software packages that can help you achieve your goals, and many companies use Python to develop data centric applications and scientific computing related to the financial community.

Most importantly, python can help us take advantage of many different trading strategies that (without it) will be difficult to analyze by hand or spreadsheet. One of the trading strategies we will discuss is calledPaired transactions.

# Pairing transaction

Matching transactions are_ Mean regression_ A form that has the unique advantage of always hedging against market fluctuations. The strategy is based on mathematical analysis.

The principle is as follows. Suppose you have a pair of securities X and y with some potential economic connection. An example might be two companies that produce the same product, or two companies in a supply chain. If we can use mathematical models to model this economic connection, we can trade it.

To understand pairing transactions, we need to understand three mathematical concepts:Stationarity, difference and Cointegration

``````import numpy as np
import pandas as pd``````

# Stationary / nonstationary

Stationarity is the most common untested hypothesis in time series analysis. When the parameters of the data generation process do not change with time, we usually assume that the data is stable. Or consider two series: A and B. Series a will generate a stationary time series with fixed parameters, while B will change with time.

We will create a function to create a Z-score for the probability density function. The probability density of Gaussian distribution is:

Is the mean sumIs the standard deviation. Square of standard deviation, is variance. Rule of thumb states that 66% of the data should be betweenAnd, which means that the function`normal`It is more likely to return samples close to the mean than those far from the mean.

``````    mu
sigma
return normal(mu, sigma )``````

From there, we can create two graphs showing stationary and non-stationary time series.

``````#Set parameters and data points
T = 100

Series(index=range(T))

#Now the parameters depend on time
#Specifically, the mean value of the sequence changes with time
B\[t\] = genedata

plt.subplots``````

# Why is stability important

Many statistical tests require the tested data to be stable. Using some statistics on non-stationary data sets may lead to garbage results. As an example, let’s pass through our nonstationary.

``````np.mean

plt.figure
plt.plot
plt.hlines``````

The calculated average will show the average of all data points, but any prediction of the future state is useless. Compared with any specific time, it is meaningless because it is a collection of different states at different times. This is just a simple and clear example of why non stationarity distorts the analysis, and more subtle problems will appear in practice.

# Stationarity test enlarged Dickey Fuller (ADF)

To test stationarity, we need to test a called_ Unit root_ Things. The autoregressive unit root test is based on the following hypothesis tests:

It is called the unit root tet because under the original assumption, autoregressive polynomials, the root of is equal to 1.Under the original assumption, the trend is stable. IfThen we first make a difference, which becomes:

The test statistic is

Least squares estimation and Se（）Is the usual standard error estimate. This test is a unilateral left tail test. If{}Is stable, then it can be provedperhapsAnd yesHowever, under the original assumption of non stationarity, the above results are givenThe following function will allow us to check stationarity using the augmented Dickey Fuller (ADF) test.

``````defty_test(X, cutoff=0.01):
#Fuller in adller_ 0 ￠ is the existence of unit root (non-stationary)
#We must observe the significant P ， value to see that the sequence is stable

As we can see, based on the test statistics of time series a (corresponding to a specific p value), we may not be able to reject the original hypothesis. Therefore, the a series is likely to be stationary. On the other hand, B series is rejected by hypothesis test, so this time series is likely to be non-stationary.

# Cointegration

The correlation between financial volumes is notoriously unstable. Nevertheless, correlation is often used in almost all diversified financial problems. Another statistical measure of correlation is cointegration. This may be a more robust measure of the link between two financial quantities, but so far, there is little deviation theory based on this concept.

The two stocks may be completely related in the short term, but there are differences in the long run, one growing and the other falling. On the contrary, the two stocks may follow each other, and the distance will not exceed a certain distance, but they are correlated, with positive and negative correlation changes. If we are short-term, correlation may be important, but it doesn’t matter if we hold stocks in the portfolio for a long time.

We have constructed two examples of cointegration sequences. We now draw the difference between the two.

``````#Generate daily revenue

np.random.normal

#Summary

plot

np.random.normal
Y = x + 6 + noise

plt.show()``````

``````(Y - X). Plot # plot point difference
plt. Axhline# add average
plt.xlabel
plt.xlim``````

# Cointegration test

Steps of cointegration inspection procedure:

1. Check the unit root of each component seriesUse univariate unit root test alone, such as ADF and PP test.
2. If the unit root cannot be rejected, the next step is to test the cointegration relationship between components, that is, to test whether it is true or notIs I (0).

If we find that the time series is the unit root, then we continue the cointegration process. There are three main cointegration tests: Johansen, Engle Granger and Phillips ouliaris. We will mainly use Engle Granger test.

Let’s consider the regression model:

inIs a deterministic term. The hypothesis test is as follows:

And_ Normalized cointegration vector Cointegration_

We also use residualsUsed for unit root inspection.

This hypothesis test applies to the model:

Test statistics for the following equation:

Now that you understand the meaning of two time series cointegration, we can test it and measure it with Python:

``````coint
print(pvalue)

#Low P value means high cointegration!``````

# Correlation and Cointegration

Correlation and cointegration are similar in theory, but they are completely different. To prove this, we can look at two examples of related but non cointegration time series.

A simple example is two sequences.

``````Xruns = np.random.normal
yrurs = np.random.normal

pd.concat
plt.xlim``````

Next, we can output the correlation coefficient, and cointegration test

As we can see, there is a very strong correlation between sequences X and y. However, the p value of our cointegration test produces 0.7092, which means that there is no cointegration between time series X and y.

Another example of this is the normal distribution series and the square wave.

``````Y2 = pd.Series

plt.figure
Y2.plot()

#The correlation is almost zero

prinr(pvle))``````

``````
``````

Although the correlation is very low, the p value indicates that these time series are cointegrated.

``````import fix_yaance as yf
yf.pdrde``````

# Data Science in trading

Before I begin, I will first define a function that can easily find cointegration pairs using the concepts we have covered.

``````def fitirs(data):
n = data.shape
srmaix = np.zeros
pvl_mrix = np.ones
keys = dta.keys
for i in range(n):
for j in range:

reut = coint
sr = ret\[0\]
paue = rsult\[1\]
soeix\[i, j\] = score
pu_trix\[i, j\] = palue
if palue < 0.05:
pairs.append
return soe_mati, prs``````

We are looking at a group of technology companies to see if any of them are cointegrated. We will first define the list of securities we want to view. Then we will get the pricing data of each security from 2013 to 2018

We want to test whether there is a certain relationship between the securities industry, that is, whether there is a certain relationship within the securities industry. This results in a much smaller multiple comparison bias than searching for hundreds of securities, and slightly more than forming assumptions for a single test.

``````start = datetime.datetime
end = datetime.datetime

df = pdr(tcrs, strt, nd)\['Close'\]
df.tail()``````

``````#The heat map shows the P # value of the cointegration test between each pair of stocks. Only the values on the diagonal of the heat map are displayed
Score

seaborn.heatmap``````

Our algorithm lists two cointegration pairs: AAPL / eBay and ABDE / MSFT. We can analyze their patterns.

``````coit
pvalue``````

As we can see, the p value is less than 0.05, which means that ADBE and MSFT # are indeed cointegration pairs.

# Calculate price difference

Now we can plot the price difference between the two time series. In order to actually calculate the spread, we use linear regression to obtain the coefficients of the linear combination between our two securities, as mentioned earlier in the Engel Granger method.

``````results.params

sed = S2 - b * S1
sedplot
plt.axhline
plt.xlim
plt.legend``````

Alternatively, we can check the ratio between two time series

``````rio
rao.plot
plt.axhline
plt.xlim
plt.legend``````

Whether we use the spread method or the ratio method, we can see that our first graph for ADBE / SYMC tends to move around the mean. We now need to standardize this ratio, because absolute ratio may not be the best way to analyze this trend. To do this, we need to use z-scores.

The Z-score is the standard deviation of the data points from the mean. More importantly, the number of standard deviations above or below the overall mean comes from the original score. The calculation method of Z-score is as follows:

``````def zscr:
return (sres - ees.mean) / np.std

zscr.plot
plt.axhline
plt.axhline
plt.axhline
plt.xlim
plt.show``````

By placing the other two lines at z scores 1 and – 1, it is clear that in most cases any large deviation from the average will eventually converge. This is exactly the matching trading strategy we want.

# Transaction signal

When conducting any type of trading strategy, it is always important to clearly define and describe the time point at which the actual transaction is conducted. For example, what is the best indicator that I need to buy and sell specific stocks?

# Set rules

We will use the ratio time series we created to see if it tells us whether to buy or sell at a specific time. We will first create a predictive variableIf the ratio is positive, it means “buy”, otherwise it means sell. The prediction model is as follows:

The advantage of paired trading signals is that we don’t need to know the absolute information of the price trend, we just need to know its trend: up or down.

# Split training test

When training and testing models, there is usually 70 / 30 or 80 / 20 segmentation. We only used a time series of 252 points (this is the number of trading days in a year). Before training and splitting the data, we will add more data points in each time series.

``````ratios = df\['ADBE'\] / df\['MSFT'\]
print(len(ratios) * .70 )``````

``````tran = ratos\[:881\]
tet = rats\[881:\]``````

# Characteristic Engineering

We need to find out which features are actually important in determining the direction in which the ratio moves. Knowing that the ratio will eventually return to the mean, perhaps moving averages and indicators related to the mean will be important.

Let’s try:

• 60 day moving average
• 5-day moving average
• 60 day standard deviation
• Z score
``````train.rolg
zcoe\_5 = (ra\_ag5 - rasag60)/
plt.figure
plt.plot
plt.legend
plt.ylabel
plt.show``````

``````plt.figure
z5.plot()
plt.xlim
plt.axhline
plt.legend
plt.show``````

# Create model

The mean of standard normal distribution is 0 and the standard deviation is 1. As can be seen from the figure, it is obvious that if the time series exceeds the mean by 1 standard deviation, it tends to return to the mean. Using these models, we can create the following transaction signals:

• Whenever the Z-score is below – 1, buy (1), which means we expect the ratio to increase.
• Whenever the Z score is higher than 1, sell (- 1), which means that we expect the ratio to decline.

# Training optimization

We can use our model on actual data

``````train.plot()
sell
sell\[z5<1\] = 0
sell\[160:\].plot``````

``````plt.figure

#When you buy the ratio, you buy the stock # S1 # and sell # S2

#When you sell the ratio, you sell the stock # S1 # and buy # S2

sell\[sll!=0\] = S1\[sll!=0\]

BuR\[60:\].plot
selR\[60:\].plot``````

Now we can clearly see when we should buy or sell the corresponding stocks.

Now, how much can we expect from this strategy?

``````#Use the simple # strydef # to trade:

#If the window length is 0, the algorithm is meaningless and exits

#Calculate the rolling average and rolling standard deviation
Ratio = S1 / s2
a1 = rais.rolng
zscoe = (ma1 - ma2)/std

#Simulated transaction

#For ＾ I (len (ratios)) in the range:
#If # Z-score > 1, sell short

mey += S1\[i\] - S2\[i\] * rts\[i\]

cutS2 += raos\[i\]

#If Z-score < - 1, buy long
ef zoe\[i\] > 1:
mey  -= S1\[i\] - S2\[i\] * rtos\[i\]

#Clear if # Z-score # is between -. 5 # and. 5 #
elif abs(zcre\[i\]) < 0.75：
mey  += S1\[i\] * ctS + S2\[i\] * oS2``````
``trad``

This is a good profit for the strategy formulated from the strategy.

# Areas for improvement and further steps

This is by no means a perfect strategy, and the implementation of our strategy is not the best. However, there are several things that can be improved.

1. Use more securities and a more diversified time horizon

For the cointegration test of paired trading strategy, I only used a few stocks. Naturally (and in practice), it is more effective to use clusters within the industry. I only used a time frame of only five years, which may not represent the fluctuation of the stock market.

2. Processed fitting

Anything related to data analysis and training model has a lot to do with the over fitting problem. There are many different methods to deal with over fitting such as validation, such as Kalman filter and other statistical methods.

Our trading algorithm does not take into account the overlapping and overlapping stock prices. Considering that the code only asks to buy or sell according to its ratio, it does not consider which stock is actually higher or lower.

4. More advanced methods

This is just the tip of the iceberg of the algorithm for trading. This is simple because it only deals with moving averages and ratios. If you want to use more complex statistics, use. Other complex examples include topics such as Hurst index, half-life mean regression, and Kalman filter.

Most popular insights

## An error message of the angular server-side rendering application – localstorage is not defined

In angular application development, we call localstorage in typescript code It retrieves data from local storage through key. However, on the server, this code crashes with an error message:ReferenceError: localStorage is undefined When running the angular application on the server, the standard browser API is missing from the global space For example, in server-side rendering […]