Processing of time series data in tensorflow window

Time:2022-5-6

Processing of tensorflow time series data

Data set introduction

Data source:Kaggle Ubiquant Market Prediction

The dataset describes 300 anonymous features (“f_0” to “f_299”) and one target feature (“target”) of multiple investment projects in a time series. It is required to predict the target characteristics according to the anonymous characteristics of subsequent time nodes.

image

The main goal of this paper is to build a specific length of time series RNN network training and test set.

Division of training set, verification set and test set

Since the given requirement is to predict the target characteristics of subsequent time points, the establishment of the model is based on the past model, which will still exist in the future. Therefore, for such a model, it is reasonable to divide the training set, verification set and test set across time. The data set gives a total of 314140 time sequence numbers (“time_id”) from 0 to 1219. Take two percent of them as the test set, from time serial number 1201 to 1219.

image

Acquisition and application of window sequence data

The idea to solve this problem is very simple. Each investment project in the data set is regarded as an independent time series, which can be based on investment first_ ID divides the data set, and then obtains the fixed length time series data through the sliding window method on the divided data set.

However, some problems will be encountered in practical application. Firstly, the time series data obtained by sliding window method has great repeatability. Assuming that the time series length of the target is 20, if the window series data set is directlyWrite to diskIt will occupy nearly 20 times the space of the original data set.

Correspondingly, it is completely adopted in the training processReal time calculationObtaining the sequence of windows is not a desirable method. The process of calculating the window sequence will be repeated in each epoch, and the efficiency of the calculation function directly affects the speed of training.

A compromise scheme is to record only the serial number corresponding to the data at each time point in the window sequence in the original data set and write it to disk as a serial number data set. In the training process, batch is generated by reading the original data set and serial number data set.

Since the RNN network allows indefinite time series as input, rather than batch return in matrix form, which affects the efficiency of input, the window sequence that does not reach the required length is filled with all zeros, and a row of all zeros is inserted into the original data set for this purpose (Note: the insertion of all zeros needs to be after preprocessing operations such as standardization and normalization).

MIN_ Len = 20 # minimum window sequence length. The window sequence below this length will be filled with all zero lines
FEATURE_NUM = 300
ZERO_ Index = 314140 # all zero line serial number
def form_ Indexes (data, time_range): # data: original dataset time_ Range: time series range
    id_list = sorted(data['investment_id'].unique())
    if 0 in id_list:
        id_list.remove(0)
    indexes_list = []
    for id in tqdm(id_list):
        
        sub_data = data[data['investment_id']==id].sort_values(by=['time_id'])
        time_list = tuple(sorted(sub_data['time_id'].unique()))
        for t in range(time_range[0],time_range[1]):
            if t in time_list:
                i_t = time_list.index(t)
                temp = list(sub_data[max(i_t-MIN_LEN+1,0):i_t+1].index.values)
                indexes = [ZERO_INDEX]*(MIN_LEN-len(temp)) + temp
    return indexes_list

Build the training set and test set (validation set) of window sequence data before training

Through TF data. From of dataset_ The advantage of using the generator method to build a data set is that the generator function will run only when the data is used (read or pre read), which will not occupy too much memory. At the same time, shuffle and batch operations can be completed easily.

train_indexset= pd.read_parquet('trainindex.parquet')
val_indexset= pd.read_parquet('valindex.parquet')

def gen_ Func (train_val_or_test): # generator function
    if train_val_or_test == 1:
        for indexes in train_indexset.iterrows():
            features = data.iloc[indexes[1].values].values[:,4:]
            label = data.iloc[indexes[1].values[-1]]['target']
            yield (features,label)
    elif train_val_or_test == 2:
        for indexes in val_indexset.iterrows():
            features = data.iloc[indexes[1].values].values[:,4:]
            label = data.iloc[indexes[1].values[-1]]['target']
            yield (features,label)
    else:
        print("error input")
        raise ValueError

#Specifies the shape and data type of the output
featureSpec = tf.TensorSpec(
    shape=[MIN_LEN,FEATURE_NUM],
    dtype=tf.dtypes.float32,
    name=None
)

labelSpec = tf.TensorSpec(
    shape=[],
    dtype=tf.dtypes.float32,
    name=None
)


train_data = tf.data.Dataset.from_generator(generator=gen_func,args=[1] ,output_signature=(featureSpec,labelSpec))
val_data = tf.data.Dataset.from_generator(generator=gen_func,args=[2] ,output_signature=(featureSpec,labelSpec))

The following models and super parameters are only used for demonstration purposes and have no guiding significance.

MIN_LEN = 20
FEATURE_NUM = 300
BATCH_SIZE = 1000
EPOCH_NUM = 50 

def build_RNNmodel():
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Masking(mask_value=0.,
                                    input_shape=(MIN_LEN, FEATURE_NUM)),
            tf.keras.layers.LSTM(1024,activation='tanh',
                                return_sequences=True,
                                dropout=0.5,
                                kernel_initializer=tf.initializers.TruncatedNormal(stddev=0.01),
                                ),
            tf.keras.layers.LSTM(256,activation='tanh',
                                dropout=0.5,
                                kernel_initializer=tf.initializers.TruncatedNormal(stddev=0.01),
                                ),
            tf.keras.layers.Dense(1,activation='relu')
        ]
    )
    return model
train_batchs = train_data.batch(batch_size=BATCH_SIZE).prefetch(BATCH_SIZE)
val_batchs = val_data.batch(batch_size=BATCH_SIZE).prefetch(BATCH_SIZE)
#Setting prefetch can pre read the data of subsequent batches and improve the running speed

model = build_RNNmodel()
model.compile(loss='mae', optimizer=tf.keras.optimizers.Adam(0.0001))

history = model.fit(train_batchs,epochs=EPOCH_NUM,validation_data=val_batchs)

Here, only a part of the overall data is taken as a demonstration. Each batch has 1000 window sequences, each epoch has 451 batches, and the running time of an epoch is about 530 seconds.

image