The simplest and fastest worldquant 101 alpha factor implementation

Time:2021-8-5

The simplest and fastest worldquant 101 alpha factor implementation

Dolphin DB database is a high-performance distributed time series database. It is especially suitable for quantitative query and analysis of investment banks, hedge funds and exchanges. It can be used to build strategy test based on historical data. Next, we will illustrate how to quickly build complex alpha factors in dolphin dB.

The famous paper 101 formal alpha gives 101 alpha factor formulas used by worldquant, one of the world’s top quantitative hedge funds. Many individuals and institutions try to implement these 101 alpha factors in different languages. In this paper, we illustrate the implementation of the simpler alpha #001 and the more complex alpha #098 two factors, using2Line sum4This line of dolphin DB SQL code is the simplest in history.

Factor introduction

Alpha#001 formula: rank (TS)_ ArgMax(SignedPower((returns<0? stddev(returns,20):close), 2), 5))-0.5

For detailed interpretation of alpha #001, please refer to[most detailed in history] worldquant alpha 101 factor series #001 research

Alpha #98 formula: (rank (decay)_ linear(correlation(vwap, sum(adv5,26.4719), 4.58418), 7.18088))- rank(decay_ linear(Ts_ Rank(Ts_ ArgMin(correlation(rank(open), rank(adv15), 20.8187), 8.62571), 6.95668) ,8.07206)))

Both cross sectional information and a large number of time series are used in the calculation of these two factors. That is, when calculating the factor of a stock on a certain day, we need to use not only the historical data of the stock, but also the information of all stocks on that day, so the amount of calculation is very large.

Required data

The input data is a table containing the following fields:

Symbol: stock code

Date: Date

Volume: trading volume

VWAP: weighted average price of trading volume

Open: opening price

Close: closing price

When calculating the alpha #001 factor, you only need three fields: stock code, date and closing price.

code implementation

def alpha1(stock){
    t= select date,symbol ,mimax(pow(iif(ratios(close) < 1.0, mstd(ratios(close) - 1, 20),close), 2.0), 5) as maxIndex from stock context by symbol
    return select date,symbol, rank(maxIndex) - 0.5 as A1 from t context by date
}

def alpha98(stock){
    t = select symbol, date, vwap, open, mavg(volume, 5) as adv5, mavg(volume,15) as adv15 from stock context by symbol
    update t set rank_open = rank(open), rank_adv15 = rank(adv15) context by date
    update t set decay7 = mavg(mcorr(vwap, msum(adv5, 26), 5), 1..7), decay8 = mavg(mrank(9 - mimin(mcorr(rank_open, rank_adv15, 21), 9), true, 7), 1..8) context by symbol
    return select symbol, date, rank(decay7)-rank(decay8) as A98 from t context by date 
}

The above code uses some built-in functions of dolphin DB:

POW: calculate exponential power

IIF: conditional operation function

Ratios: calculate x (n) x (n-1) for each element of vector X

MSTd: calculate standard deviation in sliding window

Mavg: calculates the average in a sliding window

Mcorr: calculate correlation in sliding window

Msum: sum in sliding window

Mrank: returns the position of elements in the sliding window sorted in ascending or descending order

Mimin: returns the index position of the minimum value in the sliding window

Mimax: returns the index position of the maximum value in the sliding window.

All core codes are implemented in SQL, which is very convenient for users and readable. The key function in SQL is the grouping calculation function implemented by the context by clause. Unlike group by, which generates one line of records for each group, context by will output records with the same number of lines as the input, which is convenient for nesting multiple functions. In cross sectional calculation, we use date to group. In time series calculation, we use symbol (stock code) to group.

performance analysis

We use a total of 11711 stocks in the U.S. stock market from 2007 to 2016 for back testing. Each stock generates a factor value every trading day, resulting in a total of 17 million factor values. The test machine configuration is as follows:

CPU: Intel Core i7-9700 @ 3.0 GHz

Memory: 32GB

Operating system: Ubuntu 18.04.4

Using single thread calculation, the alpha #001 factor takes only 1.5 seconds, and the complex alpha #098 factor takes only 4.3 seconds. Using pandas calculation, alpha #98 takes 760.1 seconds, and the performance difference is more than 150 times. See the end of the article for the implementation code of pandas. The high performance of dolphin DB benefits from its design idea and technical architecture. For details, please refer toUncover high performance dolphin DB

When calculating the alpha factor, in addition to considering the performance, the simplicity and readability of the code can not be ignored. Dolphin DB only needs 2 lines of core code to implement the alpha #001 factor, 4 lines of core code to implement the alpha #098 factor, and other system implementations need large pieces of code. You can refer to pandas implementation orOther systems calculate alpha #001 factors。 Why is the implementation of dolphin DB so concise? This is due to the powerful scripting language of dolphin dB. In dolphin dB, scripting language and SQL are completely integrated. SQL query can be directly assigned to a variable or as a parameter of a function. In addition to supporting standard SQL, the SQL Engine of dolphin DB also makes many useful extensions for big data analysis, especially time series data analysis. For example, the context by used above is one of the features of dolphin dB. It is equivalent to the window functions of other systems (SQL server and PostgreSQL), but it has much richer functions than the window functions of other systems, is more concise and flexible in syntax, and is very friendly to panel data. Dolphin DB has built-in and optimized many functions related to time series data, which greatly improves the computing performance. For example, the complexity of the functions used to calculate sliding window indicators such as mavg, mcorr, mrank and Mimin is only O (n) or O (nlogk), and K is the size of the window. If you want to know more about dolphin DB’s scripting language, you can refer toMixed paradigm programming of dolphin DB scripting language

Interested friends can go toOfficial websiteDownload dolphin DB database and try to implement its own alpha factor and policy backtesting.

enclosure

Pandas Code:

from time import time

import pandas as pd
import numpy as np
from scipy.stats import rankdata

def rank(df):
    return df.rank(pct=True)


def decay_linear(df, period=10):
    if df.isnull().values.any():
        df.fillna(method='ffill', inplace=True)
        df.fillna(method='bfill', inplace=True)
        df.fillna(value=0, inplace=True)
    na_lwma = np.zeros_like(df)
    na_lwma[:period, :] = df.iloc[:period, :]
    na_series = df.as_matrix()
    divisor = period * (period + 1) / 2
    y = (np.arange(period) + 1) * 1.0 / divisor
    # Estimate the actual lwma with the actual close.
    # The backtest engine should assure to be snooping bias free.
    for row in range(period - 1, df.shape[0]):
        x = na_series[row - period + 1: row + 1, :]
        na_lwma[row, :] = (np.dot(x.T, y))
    return pd.DataFrame(na_lwma, index=df.index, columns=['CLOSE'])


def rolling_rank(na):
    return rankdata(na)[-1]


def ts_rank(df, window=10):
    return df.rolling(window).apply(rolling_rank)


def ts_argmin(df, window=10):
    return df.rolling(window).apply(np.argmin) + 1


def correlation(x, y, window):
    return x.rolling(window).corr(y)


def decay7(df):
    return rank(decay_linear(correlation(df.vwap, df.adv5, 5).to_frame(), 7).CLOSE)


def decay8(df):
    return rank(decay_linear(ts_rank(ts_argmin(correlation(rank(df.open), rank(df.adv15), 21), 9), 7).to_frame(), 8).CLOSE)


def alpha098(df):
    return (decay7(df) - decay8(df)).to_frame()


path = 'your_path/USPrices.csv'
df = pd.read_csv(path, parse_dates=[1])
df = df[df.date.between('2007.01.01', '2016.12.31')]

print("loaded")

df["vwap"] = df["PRC"]
df["open"] = df["PRC"] + np.random.random(len(df))
df['adv5'] = df.groupby('PERMNO')['VOL'].transform(lambda x: x.rolling(5).mean())
df['adv15'] = df.groupby('PERMNO')['VOL'].transform(lambda x: x.rolling(15).mean())
df['rank_open'] = df.groupby('date')['open'].rank(method='min')
df['rank_adv15'] = df.groupby('date')['adv15'].rank(method='min')

print("start")
start = time()
df['A98'] = df.groupby('PERMNO').apply(alpha098)
end = time()

print(end - start)