How to use time series database to process tushare financial data


Dolphin DB is a new generation of time series database. It can not only be used as a distributed data warehouse or memory database, but also has rich computing tools. It can be used as a research tool or research platform. It is very suitable for quantitative finance, Internet of things and other fields of sea data analysis. Many problems in the field of quantitative finance, such as transaction signal research, strategy back testing, transaction cost analysis, stock correlation research, market risk control, can be solved by dolphin dB.

This tutorial will show you how toTushareThe daily market data and daily index data of Shanghai and Shenzhen stock from 2008 to 2017 are imported into dolphin DB database, and used for financial analysis. Tushare is an open community of financial big data. It has rich financial data, such as stock, fund, futures, digital currency and so on. It provides free financial data for quantitative practitioners and financial related researchers.

1. Data overview

The daily market data of Shanghai and Shenzhen stocks provided by tushare includes the following fields:

Name Description
ts_ Stock code
trade_ Transaction date
Opening price
The highest price
Low price
Closing price
pre_ Close closed yesterday
pct_ Change up and down
Vol volume (hand)
Amount turnover (thousand yuan)

The daily indicator data contains the following fields:

Name Description
ts_ Stock code
trade_ Transaction date
Closing price
turnover_ Rate turnover
turnover_ rate_ F turnover (free floating shares)
volume_ Volume ratio
PE P / E ratio (total market value / net profit)
pe_ TTM price earnings ratio (TTM)
Pb P / B ratio (total market value / net assets)
PS market rate
ps_ TTM market rate (TTM)
total_ Total share capital (10000)
float_ Share capital in circulation (10000)
free_ Share free circulating share capital (10000)
total_ Total market value of MV (10000 yuan)
cric_ MV circulation market value (10000 yuan)

2. Create dolphin DB database

2.1 installing dolphin DB

fromOfficial websiteDownload the dolphin DB installation package and dolphin DB GUI

For single node deployment of dolphin dB, please refer toSingle node deployment

For single server cluster deployment of dolphin dB, please refer toSingle server cluster deployment

Please refer toMulti server cluster deployment

2.2 create database

We can use itdatabaseFunction to create a partitioned database.

Grammar:database(directory, [partitionType], [partitionScheme], [locations])


_ directory:_ The directory saved by the database. There are three types of databases in dolphin DB: in memory database, on disk database and distributed file system database. Create an in memory database, and the directory is empty; To create a local database, the directory should be the local file system directory; To create a database on a distributed file system, the directory should start with “DFS: / /”. This tutorial uses a database on a distributed file system.

_ partitionType:_ There are six partition methods: sequence partition (SEQ), range partition (range), hash partition (hash), value partition (value), list partition (list), composite partition (compo).

_ partitionScheme:_ Partition scheme. The partition schemes corresponding to various partition modes are as follows:

How to use time series database to process tushare financial data

Before importing data, we should do a good job in data partition planning, mainly considering two factors: partition field and partition granularity.

In daily query analysis, the frequency of query by date is the highest, so the partition field is date trade_ date。 If there is a partition every day, the amount of data in each partition is too small, only more than 3000 pieces of data, less than 1 Megabyte in size, and the number of partitions is very large. When a distributed system executes a query, it will divide the query into several subtasks and send them to different partitions. This partition method will lead to a large number of subtasks, and the execution time of each task is very short. The system spends more time on the management task than the execution time of the task itself. Obviously, this partition method is unreasonable. In this case, we partition according to the date range. From January 1 of each year to January 1 of the next year, we can improve the efficiency of the query without causing too small partition size.

The time span of existing data is 2008-2017, but in order to leave enough space for future data, we set the time range as 2008-2030. Execute the following code:

yearRange=date(2008.01M + 12*0..22)

Since the partition scheme of daily market and daily index data is the same, they are stored in the same database dfs://tushare In the two tables, Hushen_ daily_ Line is used to store daily market data_ daily_ Indicator is used to store daily indicator data. If you need to use an in memory database, you need to_ directory_ Set to blank; If you need to use the database on disk, put the_ directory_ Set to disk directory. The code for creating a database is as follows:

yearRange=date(2008.01M + 12*0..22)


3. Use Python API to import data into dolphin dB

Tushare provides two common data retrieval methods:

  • Through tushare Python package, the data of Python dataframe type is returned.
  • The data is directly obtained through HTTP protocol and returned in JSON format.

This tutorial uses the first method to retrieve the daily market data and daily index data of Shanghai and Shenzhen stock from 2008 to 2017.

3.1 download and install Python 3. X and tushare

Please refer to the specific tutorialTushare website

3.2 install Python 3 API of dolphin DB

fromOfficial websiteDownload the python 3 API and unzip the installation package to any directory. Enter the directory in the console and execute the following command:

python install

Update the python API with the following command:

python install --force

3.3 data import

We use daily and daily of tushare Python package respectively_ Basic interface calls daily market and daily index data, and returns Python dataframe type data. Note that you need to register a tushare account to get the token. Then, through Python API, connect to the dolphin DB data node with IP of localhost and port number of 8941 (the data node IP and port here are modified according to the situation of their own cluster), and add the data frame returned by tushare to the dolphin DB DFS table created previously.

The specific Python code is as follows:

import datetime
import tushare as ts
import pandas as pd
import numpy as np
import dolphindb as ddb
def dateRange(beginDate,endDate):
    while date <= endDate:
        dt=dt + datetime.timedelta(1)
    return dates

for dates in dateRange('20080101','20171231'):
if len(df):

for dates in dateRange('20080101','20171231'):
if len(ds):

After the data is imported successfully, we can see the partition of the two tables from the variable browser in the lower right corner of the dolphin DB GUI

How to use time series database to process tushare financial data

View data volume:

select count(*) from hushen_daily_line

select count(*) from hushen_daily_indicator

So far, we have imported the daily quotation and daily index data of Shanghai and Shenzhen stock from 2008 to 2017 into dolphin dB.

4. Financial analysis

Dolphindb integrates database, programming language and distributed computing, and can be used not only as a data warehouse, but also as a computing and analysis tool. Many optimized time series functions are built into dolphin dB, which is especially suitable for quantitative query and analysis of investment banks, hedge funds and exchanges, and can be used to build strategy testing based on historical data. Here’s how to use tushare’s data for financial analysis.

4.1 calculate the rolling volatility of each stock

daily_line= loadTable("dfs://daily_line","hushen_daily_line")
t=select ts_code,trade_date,mstd(pct_change/100,21) as mvol from daily_line context by ts_code
select * from t where trade_date=2008.11.14
ts_code            trade_date    mvol
000001.SZ    2008.11.14    0.048551
000002.SZ    2008.11.14    0.04565
000004.SZ    2008.11.14    0.030721
000005.SZ    2008.11.14    0.046655
000006.SZ    2008.11.14    0.043092
000008.SZ    2008.11.14    0.035764
000009.SZ    2008.11.14    0.051113
000010.SZ    2008.11.14    0.027254

One line of code is needed to calculate the rolling volatility of each stock for one month. Dolphin DB has its own financial gene and a large number of functions related to finance are built in, which can calculate financial indicators with simple code.

4.2 find the most relevant stocks

Using the daily stock market data of Shanghai and Shenzhen, calculate the pairwise correlation of stocks. Firstly, the stock return matrix is generated

retMatrix=exec pct_change/100 as ret from loadTable("dfs://daily_line","hushen_daily_line") pivot by trade_date,ts_code

Exec and pivot by are one of the characteristics of dolphin DB programming language. Exec and select have the same usage, but the select statement generates a table and the exec statement generates a vector. Pivot by is used to organize dimensions. When used with exec, a matrix is generated.

Then, the stock correlation matrix is generated


The cross used above is a high-order function in dolphin dB, which takes functions and objects as input and applies functions to each object. Template functions are very useful in complex batch computing.

Then, find the 10 stocks with the highest correlation of each stock

mostCorrelated = select * from table(corrMatrix).rename!(`ts_code`corr_ts_code`corr) context by ts_code having rank(corr,false) between 1:10

Find the 10 stocks with the highest correlation with 00000

select * from mostCorrelated where ts_code="000001.SZ" order by cor desc
ts_code            corr_ts_code    corr
000001.SZ    601166.SH    0.859
000001.SZ    600000.SH    0.8406
000001.SZ    002920.SZ    0.8175
000001.SZ    600015.SH    0.8153
000001.SZ    600036.SH    0.8129
000001.SZ    600016.SH    0.8022
000001.SZ    002142.SZ    0.7956
000001.SZ    601169.SH    0.7882
000001.SZ    601009.SH    0.7778
000001.SZ    601328.SH    0.7736

Both of the above two examples are relatively simple, so let’s carry out complex calculation.

4.3 building world quant alpha # 001 and # 98

101 alpha factor formulas are given in 101 formula alpha published by worldquant LLC. Many individuals and organizations try to implement these 101 alpha factors in different languages. In this paper, we illustrate the implementation of two factors, alpha # 001 and alpha # 098.

Alpha # 001 formula: rank (TS)_ ArgMax(SignedPower((returns<0? stddev(returns,20):close), 2), 5))-0.5

The detailed interpretation of alpha # 001 can be referred toResearch on worldquant alpha 101 factor series

Alpha # 98 formula: (rank (decay)_ linear(correlation(vwap, sum(adv5,26.4719), 4.58418), 7.18088))- rank(decay_ linear(Ts_ Rank(Ts_ ArgMin(correlation(rank(open), rank(adv15), 20.8187), 8.62571), 6.95668) ,8.07206)))

These two factors not only use cross sectional information, but also use a lot of time series calculation. That is to say, when calculating the factor of a stock on a certain day, we need to use not only the historical data of the stock, but also the information of all the stocks on that day, so the amount of calculation is very large.

To construct these two factors, you need to include the following fields: stock code, date, trading volume, weighted average price of trading volume, opening price and closing price. Among them, the weighted average price of trading volume can be calculated by closing price and trading volume. Therefore, the daily market data can be used to build these two factors.

The code of the build factor is as follows:

def alpha1(stock){
    t= select trade_date,ts_code,mimax(pow(iif(ratios(close) < 1.0, mstd(ratios(close) - 1, 20),close), 2.0), 5) as maxIndex from stock context by ts_code
    return select trade_date,ts_code,rank(maxIndex) - 0.5 as A1 from t context by trade_date

def alpha98(stock){
    t = select ts_code,trade_date, wavg(close,vol) as vwap, open, mavg(vol, 5) as adv5, mavg(vol,15) as adv15 from stock context by ts_code
    update t set rank_open = rank(open), rank_adv15 = rank(adv15) context by trade_date
    update t set decay7 = mavg(mcorr(vwap, msum(adv5, 26), 5), 1..7), decay8 = mavg(mrank(9 - mimin(mcorr(rank_open, rank_adv15, 21), 9), true, 7), 1..8) context by ts_code
    return select ts_code,trade_date, rank(decay7)-rank(decay8) as A98 from t context by trade_date

Only two lines of core code are used to build alpha # 001, and only four lines of core code are used to build alpha # 98. All the core codes are implemented in SQL, and the readability is very good. The key function of SQL is the grouping calculation function of context by clause. Context by is an extension of dolphin DB to standard SQL. Unlike group by, which generates one row of records for each group, context by outputs records with the same number of rows as the input, so we can nest multiple functions conveniently. In cross sectional calculation, we use trade_ Date group. In time series calculation, we use ts___ Code group. Different from traditional analysis languages such as MATLAB and SAS, dolphin DB script language is closely integrated with distributed database and distributed computing, which has strong expression ability, high performance and easy expansion, and can meet the needs of rapid development and modeling.

View results:

select * from alpha1(stock1) where trade_date=2017.07.06
trade_date    ts_code        A1
2017.07.06    000001.SZ    252.5
2017.07.06    000002.SZ    1,103.5
2017.07.06    000004.SZ    252.5
2017.07.06    000005.SZ    252.5
2017.07.06    000006.SZ    1,103.5
2017.07.06    000008.SZ    1,972.5
2017.07.06    000009.SZ    1,103.5
2017.07.06    000010.SZ    1,103.5
2017.07.06    000011.SZ    1,103.5

select * from alpha98(stock1) where trade_date=2017.07.19
ts_code    trade_date    A98
000001.SZ    2017.07.19    (1,073)
000002.SZ    2017.07.19    (1,270)
000004.SZ    2017.07.19    (1,805)
000005.SZ    2017.07.19    224
000006.SZ    2017.07.19    791
000007.SZ    2017.07.19    (609)
000008.SZ    2017.07.19    444
000009.SZ    2017.07.19    (1,411)
000010.SZ    2017.07.19    (1,333)

Using single thread computing, alpha # 001 takes only 4 seconds, and complex alpha # 98 takes only 5 seconds, so the performance is excellent.

4.4 momentum trading strategy

Momentum strategy is one of the most popular investment strategies. Generally speaking, momentum strategy is to “chase up and kill down”. Buy those that rise sharply and sell those that fall sharply. Next, we will introduce how to test momentum trading strategy in dolphin dB.

The most commonly used momentum factor is the yield of the past year minus the most recent month. Momentum strategy is usually adjusted once a month, and the holding period is also one month. In this tutorial, adjust 1 / 21 of your portfolio every day and hold a new portfolio for 21 days.

To test the momentum trading strategy, you need to include data in the following fields: stock symbol, date, price per share (closing price), market value in circulation, daily stock return and daily trading volume.

Obviously, only the daily market data is not enough, we need to connect Hushen_ daily_ Line and Hushen_ daily_ Indicator two tables.

Through equal join, select the required fields from two tables:

s=select ts_code,trade_date,close,change,pre_close,vol,amount,turnover_rate,total_share,float_share,free_share,total_mv,circ_mv from ej(daily_line,daily_indicator,`ts_code`trade_date)

(1) The data is cleaned and filtered, and the momentum signal of the past year minus the yield of the latest month is constructed for each stock.

def loadPriceData(inData){
    stocks = select ts_code, trade_date, close,change/pre_close as ret, circ_mv from inData where weekday(trade_date) between 1:5, isValid(close), isValid(vol) order by ts_code, trade_date
    stocks = select ts_code, trade_date,close,ret,circ_mv, cumprod(1+ret) as cumretIndex from stocks context by ts_code
    return select ts_code, trade_date, close, ret, circ_mv, move(cumretIndex,21)move(cumretIndex,252)-1 as signal from stocks context by ts_code 
priceData = loadPriceData(s)

(2) Portfolio generation

Choose the tradable shares that meet the following conditions: no loss of momentum signal, positive trading volume of the day, market value of more than 100 million yuan and price of more than 5 yuan per share.

def genTradables(indata){
    return select trade_date, ts_code, circ_mv, signal from indata where close>5, circ_mv>10000, vol>0, isValid(signal) order by trade_date
tradables = genTradables(priceData)

According to the momentum signal of each day, 10 groups of circulating stocks are generated. Keep only the two most extreme groups (winners and losers). Suppose that in 21 days, there are always long 1 yuan and short 1 yuan every day, so we are long 1 / 21 in the winner group and short 1 / 21 in the loser group every day. In each group, we can use equal weight or value weight to calculate the weight of each stock on the portfolio formation date.

//Wtscheme = 1 means equal weight; Wtscheme = 2 represents the value weight
def formPortfolio(startDate, endDate, tradables, holdingDays, groups, WtScheme){
    ports = select date(trade_date) as trade_date, ts_code, circ_mv, rank(signal,,groups) as rank, count(ts_code) as symCount, 0.0 as wt from tradables where date(trade_date) between startDate:endDate context by trade_date having count(ts_code)>=100
    if (WtScheme==1){
        update ports set wt = -1.0count(ts_code)holdingDays where rank=0 context by trade_date
        update ports set wt = 1.0count(ts_code)holdingDays where rank=groups-1 context by trade_date
    else if (WtScheme==2){
        update ports set wt = -circ_mvsum(circ_mv)holdingDays where rank=0 context by trade_date
        update ports set wt = circ_mvsum(circ_mv)holdingDays where rank=groups-1 context by trade_date
    return select ts_code, trade_date as tranche, wt from ports where wt != 0 order by ts_code, trade_date
ports = formPortfolio(startDate, endDate, tradables, holdingDays, groups, 2)
dailyRtn = select date(trade_date) as trade_date, ts_code, ret as dailyRet from priceData where date(trade_date) between startDate:endDate

(3) Calculate the profit or loss of each stock in the portfolio for the next 21 days. Close the portfolio 21 days after the portfolio is formed.

def calcStockPnL(ports, dailyRtn, holdingDays, endDate, lastDays){
    ages = table(1..holdingDays as age)
    dates = sort distinct ports.tranche
        dictDateIndex = dict(dates, 1..dates.size())
        dictIndexDate = dict(1..dates.size(), dates)
    pos = select dictIndexDate[dictDateIndex[tranche]+age] as date, ts_code, tranche, age, take(0.0,size age) as ret, wt as expr, take(0.0,size age) as pnl from cj(ports,ages) where isValid(dictIndexDate[dictDateIndex[tranche]+age]), dictIndexDate[dictDateIndex[tranche]+age]<=min(lastDays[ts_code], endDate)

    update pos set ret = dailyRet from ej(pos, dailyRtn,`date`ts_code,`trade_date`ts_code)
    update pos set expr = expr*cumprod(1+ret) from pos context by ts_code, tranche
    update pos set pnl = expr*ret/(1+ret)
    return pos
lastDaysTable = select max(date(trade_date)) as date from priceData group by ts_code
lastDays = dict(lastDaysTable.ts_code,
stockPnL = calcStockPnL(ports, dailyRtn, holdingDays, endDate, lastDays)

(4) Calculate the profit or loss of the portfolio, and draw the momentum strategy Cumulative Return trend chart.

portPnL = select sum(pnl) as pnl from stockPnL group by date order by date
plot(cumsum(portPnL.pnl) as cumulativeReturn,, "Cumulative Returns of the Momentum Strategy")

The following is the back test results of Shanghai and Shenzhen stocks from 2008 to 2017. When back testing, a new trace was generated every day and held for 21 days. Using single thread computing, it takes only 6 seconds.

How to use time series database to process tushare financial data

If pandas is used to process financial data, it requires high memory. The peak value of memory usage is generally 3-4 times that of data. With the accumulation of data, the problem of memory occupation of pandas will become more and more obvious. In terms of performance, panda is weak in multithreading, and can’t make full use of the computing power of multi-core CPU. Moreover, panda can’t partition data according to business fields, and doesn’t support column storage. When querying data, it must scan the whole table, which is inefficient.

5. Summary

As a database, dolphin DB supports single table Pb level storage and flexible partition mode; As a research platform, dolphin DB not only has rich functions, supports fast data cleaning, efficient data import, interactive analysis, in library analysis, stream computing framework and off-line computing, supports code reuse in production environment, but also has excellent performance. Even in the face of huge data sets, it can easily achieve second level and millisecond level low latency interactive analysis. In addition, dolphin DB is very user-friendly, and provides rich programming interfaces, such as python, C + +, Java, C #, R and other programming APIs, as well as Excel add in plug-ins, ODBC, JDBC plug-ins. It also provides powerful integrated development tools, supports graphical data display, and makes the experimental results more intuitive and easy to understand.