NLP in financial market — emotional analysis


By Yuki Takahashi
Compile VK
Source: towards Data Science

Since the launch of alexnet on Imagenet, the deep learning of computer vision has been successfully applied to various applications. On the contrary, NLP has been lagging behind in the application of deep neural network. Many applications that claim to use artificial intelligence usually use some rule-based algorithm and traditional machine learning instead of deep neural networks.

In 2018, in some NLP tasks, a state-of-the-art (Stoa) model called Bert outperformed human scores. Here, I apply several models to sentiment analysis tasks to understand how useful they are in my financial market. The code is available in jupyter notebook and git repo:


NLP tasks can be roughly divided into the following categories.

  1. Text classification – filter spam and classify documents

  2. Word order — word translation, part of speech tagging, named entity recognition

  3. Text meaning — topic model, search, Q & A

  4. Seq2seq – machine translation, text summary, Q & A

  5. Dialogue system

Different tasks require different methods, in most cases a combination of multiple NLP technologies. When developing robots, the back-end logic is usually rule-based search engines and ranking algorithms to form natural communication.

There are good reasons for this. Language has grammar and word order, which can be better handled by rule-based methods, while machine learning methods can better learn word similarity. Vectorization techniques such as word2vec and bag of word help models express texts mathematically. The most famous examples are:

King - Man + Woman = Queen

Paris - France + UK = London

The first example describes gender relations, and the second example describes the concept of the capital. However, in these methods, since the same word is always represented by the same vector in any text, the context cannot be captured, which is incorrect in many cases.

The structure of recurrent neural network (RNN) uses the prior information of input sequence to process time series data, and performs well in capturing and memorizing context. LSTM is a typical structure, which is composed of input gate, output gate and forgetting gate, which overcomes the gradient problem of RNN. There are many improved models based on LSTM, such as bidirectional LSTM, which can capture the context not only from the front word, but also from the back. These methods are useful for some specific tasks, but they are not suitable in practical application.

In 2017, we saw a new way to solve this problem. Bert is a multi encoder stack mask language model launched by Google in 2018. Stoa is implemented and greatly improved in glue, squiad and swag benchmarks. Many articles and blogs explain this architecture, such as Jay alammar’s article:

I work in the financial industry. In the past few years, it is difficult for me to see that our machine learning model on NLP has strong enough performance in the production and application of trading system. Now, Bert based models are becoming mature and easy to use, thanks to the implementation of huggingface and many pre trained models have been disclosed.

My goal is to see if the latest development of this NLP has reached a good level of use in my field. In this article, I compared different models. This is a fairly simple task, that is, emotional analysis of financial texts, which can be used as a baseline to judge whether it is worth trying another R & D in a real solution.

The models compared here are:

  1. Rule based dictionary method

  2. Traditional machine learning method based on TFIDF

  3. LSTM as a cyclic neural network structure

  4. Bert (and Albert)

input data

In the emotion analysis task, I use the following two inputs to represent different languages in the industry.

  1. Financial news headlines – Official

  2. Informal – from stocktweets

I’ll write another article for the latter, so focus on the data of the former here. This is a text example containing a more formal financial domain specific language. I used the financial Phrasebank of Malo et al( Good_ Debt_ or_ Bad_ Debt_ Detecting_ Semantic_ Orientations_ in_ Economic_ Texts) includes 4845 Title texts written by 16 people, and provides the consent level. I used 75% consent levels and 3448 texts as training data.

##Input text example

positive "Finnish steel maker Rautaruukki Oyj ( Ruukki ) said on July 7 , 2008 that it won a 9.0 mln euro ( $ 14.1 mln ) contract to supply and install steel superstructures for Partihallsforbindelsen bridge project in Gothenburg , western Sweden."

neutral "In 2008 , the steel industry accounted for 64 percent of the cargo volumes transported , whereas the energy industry accounted for 28 percent and other industries for 8 percent."

negative "The period-end cash and cash equivalents totaled EUR6 .5 m , compared to EUR10 .5 m in the previous year."

Please note that all data belongs to the source and users must abide by their copyright and license terms.


Here’s how I compared the performance of the four models.

A. Vocabulary based approach

Creating domain specific dictionaries is a traditional method. In some cases, this method is simple and powerful if the source code comes from a specific individual or media. Loughran and McDonald’s emotional word list. This list contains more than 4K words that appear on financial statements with emotion tags. Note: this data requires a license to be used in commercial applications. Please check their website before using.


negative: ABANDON
negative: ABANDONED
constraining: STRICTLY

I used 2355 negative words and 354 positive words. It contains word forms, so do not perform stemming and stemming on input. For this method, it is important to consider the negative form. Like not, no, don, etc. These words will change the meaning of negative words into positive words. If there are negative words in the first three words, here I simply convert the meaning of negative words into positive words.

Then, the emotional score is defined as follows.

tone_score = 100 * (pos_count — neg_count) / word_count

14 different classifiers are trained with default parameters, and then the super parameters of random forest are adjusted by grid search cross validation method.

classifiers = []
classifiers.append(("SVC", SVC(random_state=random_state)))
classifiers.append(("DecisionTree", DecisionTreeClassifier(random_state=random_state)))
classifiers.append(("AdaBoost", AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1)))
classifiers.append(("RandomForest", RandomForestClassifier(random_state=random_state, n_estimators=100)))
classifiers.append(("ExtraTrees", ExtraTreesClassifier(random_state=random_state)))
classifiers.append(("GradientBoosting", GradientBoostingClassifier(random_state=random_state)))
classifiers.append(("MultipleLayerPerceptron", MLPClassifier(random_state=random_state)))
classifiers.append(("KNeighboors", KNeighborsClassifier(n_neighbors=3)))
classifiers.append(("LogisticRegression", LogisticRegression(random_state = random_state)))
classifiers.append(("LinearDiscriminantAnalysis", LinearDiscriminantAnalysis()))
classifiers.append(("GaussianNB", GaussianNB()))
classifiers.append(("Perceptron", Perceptron()))
classifiers.append(("LinearSVC", LinearSVC()))
classifiers.append(("SGD", SGDClassifier()))

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_validate(classifier[1], X_train, y=Y_train, scoring=scoring, cv=kfold, n_jobs=-1))
#Using random forest classifier
rf_clf = RandomForestClassifier()

#Perform grid search
param_grid = {'n_estimators': np.linspace(1, 60, 10, dtype=int),
              'min_samples_split': [1, 3, 5, 10],
              'min_samples_leaf': [1, 2, 3, 5],
              'max_features': [1, 2, 3],
              'max_depth': [None],
              'criterion': ['gini'],
              'bootstrap': [False]}

model = GridSearchCV(rf_clf, param_grid=param_grid, cv=kfold, scoring=scoring, verbose=verbose, refit=refit, n_jobs=-1, return_train_score=True), Y_train)
rf_best = model.best_estimator_

B. Traditional machine learning based on TFIDF vector

Input nltk word_ Tokenize () tokenizes, then stems and removes stop words. Then input it into tfidfvectorizer and classify it by logistic regression and random forest classifier.

###Logistic regression
pipeline1 = Pipeline([
    ('vec', TfidfVectorizer(analyzer='word')),
    ('clf', LogisticRegression())]), Y_train)

###Random forest and grid search
pipeline2 = Pipeline([
    ('vec', TfidfVectorizer(analyzer='word')),
    ('clf', RandomForestClassifier())])

param_grid = {'clf__n_estimators': [10, 50, 100, 150, 200],
              'clf__min_samples_leaf': [1, 2],
              'clf__min_samples_split': [4, 6],
              'clf__max_features': ['auto']

model = GridSearchCV(pipeline2, param_grid=param_grid, cv=kfold, scoring=scoring, verbose=verbose, refit=refit, n_jobs=-1, return_train_score=True), Y_train)
tfidf_best = model.best_estimator_


Since LSTM is designed to remember the long-term memory of the expression context, a custom tokenizer is used and the input is a character rather than a word, so word stemming or output stop words are not required. The input goes first to an embedded layer and then to two LSTM layers. In order to avoid over fitting, dropout is applied, followed by the full connection layer, and finally log softmax is used.

class TextClassifier(nn.Module):
  def __init__(self, vocab_size, embed_size, lstm_size, dense_size, output_size, lstm_layers=2, dropout=0.1):
    Initialization model
    self.vocab_size = vocab_size
    self.embed_size = embed_size
    self.lstm_size = lstm_size
    self.dense_size = dense_size
    self.output_size = output_size
    self.lstm_layers = lstm_layers
    self.dropout = dropout

    self.embedding = nn.Embedding(vocab_size, embed_size)
    self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers, dropout=dropout, batch_first=False)
    self.dropout = nn.Dropout(dropout)

    if dense_size == 0:
      self.fc = nn.Linear(lstm_size, output_size)
      self.fc1 = nn.Linear(lstm_size, dense_size)
      self.fc2 = nn.Linear(dense_size, output_size)

    self.softmax = nn.LogSoftmax(dim=1)

def init_hidden(self, batch_size):
    Initialize hidden state
    weight = next(self.parameters()).data
    hidden = (, batch_size, self.lstm_size).zero_(),
    , batch_size, self.lstm_size).zero_())
    return hidden

def forward(self, nn_input_text, hidden_state):
    In NN_ The preceding propagation of the model is performed on the input
    batch_size = nn_input_text.size(0)
    nn_input_text = nn_input_text.long()
    embeds = self.embedding(nn_input_text)
    lstm_out, hidden_state = self.lstm(embeds, hidden_state)
    #Stack LSTM output, apply dropout
    lstm_out = lstm_out[-1,:,:]
    lstm_out = self.dropout(lstm_out)
    #Full connection layer
    if self.dense_size == 0:
      out = self.fc(lstm_out)
      dense_out = self.fc1(lstm_out)
      out = self.fc2(dense_out)
    # Softmax
    logps = self.softmax(out)

    return logps, hidden_state

As an alternative, we also try glove word embedding of Stanford University, which is an unsupervised learning algorithm to obtain the vector representation of words. Here, Wikipedia and gigawords are pre trained with 6 million logos, 400000 words and 300 dimensional vectors. In our vocabulary, about 90% of the words are found in this glove, and the rest are initialized randomly.

D. Bert and Albert

I used the transformer in huggingface to implement the Bert model. Now they provide tokenizer and encoder, which can generate text ID, pad mask and segment ID, which can be directly used in bertmodel. We use the standard training process.

Similar to the LSTM model, the output of Bert is then passed to dropout, the full connection layer, and then log softmax is applied. If there is not enough calculation resource budget and enough data, training the model from scratch is not an option, so I used the pre training model and fine tuned it. The pre training model is as follows:

  • BERT:bert-base-uncased

  • ALBERT:albert-base-v2

The training process of pre trained Bert is as follows.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

def train_bert(model, tokenizer)
  #Move model to Gup / CPU device
  device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  model =

  #Load data into simpledataset (custom dataset class)
  train_ds = SimpleDataset(x_train, y_train)
  valid_ds = SimpleDataset(x_valid, y_valid)

  #Use dataloader to batch load data in a dataset
  train_loader =, batch_size=batch_size, shuffle=True)
  valid_loader =, batch_size=batch_size, shuffle=False)

  #Optimizer and learning rate attenuation
  num_total_opt_steps = int(len(train_loader) * num_epochs)
  optimizer = AdamW_HF(model.parameters(), lr=learning_rate, correct_bias=False) 
  scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_total_opt_steps*warm_up_proportion, num_training_steps=num_total_opt_steps)  # PyTorch scheduler


  #Tokenizer parameter
  param_tk = {
    'return_tensors': "pt",
    'padding': 'max_length',
    'max_length': max_seq_length,
    'add_special_tokens': True,
    'truncation': True

  best_f1 = 0.
  early_stop = 0
  train_losses = []
  valid_losses = []

  for epoch in tqdm(range(num_epochs), desc="Epoch"):
    # print('================     epoch {}     ==============='.format(epoch+1))
    train_loss = 0.

    for i, batch in enumerate(train_loader):
      #Transfer to device
      x_train_bt, y_train_bt = batch
      x_train_bt = tokenizer(x_train_bt, **param_tk).to(device)
      y_train_bt = torch.tensor(y_train_bt, dtype=torch.long).to(device)

      #Reset gradient

      #Feedforward prediction
      loss, logits = model(**x_train_bt, labels=y_train_bt)

      #Back propagation

      train_loss += loss.item() / len(train_loader)

      #Gradient shear
      torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

      #Update weights and learning rates

    #Evaluation model

    val_loss = 0.
    y_valid_pred = np.zeros((len(y_valid), 3))

    with torch.no_grad():
      for i, batch in enumerate(valid_loader):
        #Transfer to device
        x_valid_bt, y_valid_bt = batch
        x_valid_bt = tokenizer(x_valid_bt, **param_tk).to(device)
        y_valid_bt = torch.tensor(y_valid_bt, dtype=torch.long).to(device)
        loss, logits = model(**x_valid_bt, labels=y_valid_bt)
        val_loss += loss.item() / len(valid_loader)

    #Calculation index
    acc, f1 = metric(y_valid, np.argmax(y_valid_pred, axis=1))

    #If improved, save the model. If not, stop early
    if best_f1 < f1:
      early_stop = 0
      best_f1 = f1
      early_stop += 1

    print('epoch: %d, train loss: %.4f, valid loss: %.4f, acc: %.4f, f1: %.4f, best_f1: %.4f, last lr: %.6f' %
          (epoch+1, train_loss, val_loss, acc, f1, best_f1, scheduler.get_last_lr()[0]))

    if device == 'cuda:0':

    #If the patient number is reached, stop early
    if early_stop >= patience:

    #Return to training mode
  return model


Firstly, the input data are divided into training group and test set at 8:2. The test set remains unchanged until all parameters are fixed and each model is used only once. Since the dataset is not used to calculate the cross set, the validation set is not used for calculation. In addition, in order to overcome the problems of unbalanced data set and small data set, hierarchical k-fold cross validation is used for hyperparameter tuning.

Due to the imbalance of input data, the evaluation is based on F1 score and refers to accuracy.

def metric(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average='macro')
    return acc, f1

scoring = {'Accuracy': 'accuracy', 'F1': 'f1_macro'}
refit = 'F1'
kfold = StratifiedKFold(n_splits=5)

Models a and B use grid search cross validation, while the deep neural network models of C and d use custom cross validation.

#Layered kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=rand_seed)

for n_fold, (train_indices, valid_indices) in enumerate(skf.split(y_train, y_train)):
  model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
  #Input data
  x_train_fold = x_train[train_indices]
  y_train_fold = y_train[train_indices]
  x_valid_fold = x_train[valid_indices]
  y_valid_fold = y_train[valid_indices]
  train_bert(model, x_train_fold, y_train_fold, x_valid_fold, y_valid_fold)


After spending more or less similar super parameter adjustment time, the fine-tuning model based on Bert is obviously better than other models.

Model a performs poorly because the input is too simplified to emotion score, which is a single value to judge emotion, while the random forest model finally marks most of the data as neutral. The simple linear model can get better results only by applying the threshold to the emotion score, but it is still very low in accuracy and F1 score.

We do not use undersampling / oversampling or smote to balance the input data, because it can correct this problem, but it will deviate from the actual situation of imbalance. If it can be proved that the cost of building a dictionary for each problem to be solved is reasonable, the potential improvement of this model is to build a custom dictionary instead of an L-M dictionary.

Model B is much better than the previous model, but it fits the training set with almost 100% accuracy and F1 score, but it is not generalized. I tried to reduce the complexity of the model to avoid over fitting, but the final score in the validation set was low. Balancing data can help solve this problem or collect more data.

Model C produced similar results to the previous model, but little improvement. In fact, the number of training data is not enough to train the neural network from scratch, and it needs to be trained to multiple epochs, which is often over fitted. Pre trained glove did not improve results. A possible improvement to the latter model is to use a large number of texts in similar fields (such as 10K and 10q financial statements) to train glove instead of using the pre trained model in Wikipedia.

The accuracy and F1 score of model d in cross validation and final test are more than 90%. It correctly classifies negative text as 84% and positive text as 94%, which may be due to the number of inputs, but it is best to observe carefully to further improve performance. This shows that due to the transfer learning and language model, the fine-tuning of the pre training model performs well on this small data set.


This experiment shows the potential of the Bert based model in my field. The previous models did not produce enough performance. However, the results are not deterministic, and the results may be different if the super parameters are adjusted.

It is worth noting that in practical application, it is also very important to obtain the correct input data. Without high-quality data (often referred to as “garbage input, garbage output”), the model can not be well trained.

I’ll talk about these issues next time. All the code used here can be found in Git repo:

Original link:

Welcome to panchuang AI blog:

Official Chinese document of sklearn machine learning:

Welcome to panchuang blog resources summary station: