NLP text classification



In fact, I’ve been struggling recently. I’m a little anxious because I’ve always expected to develop in the direction of natural language processing. I dream of becoming an NLP Algorithm Engineer, which is what I like, rather than working for survival. I think this is also one of the few opportunities left in my life to pursue what I like. However, the reality is very cruel. Most of the company’s algorithm engineers are generally famous universities. The recruitment of master’s degree is like a threshold that can’t be crossed, which is daunting. Even though I think that the future road in this direction may not be as easy as other readily available roads, there has always been a belief in my heart that I have no hesitation. Anyway, there must be a dream, What if it happens~

It’s a little far HHH, back to the article, because there are projects involving text analysis (emotion analysis) and also want to make a knowledge reserve for future related projects. Recently, I started some NLP related practices of in-depth learning in tensorflow, At the same time, I learned some application knowledge of the model based on deep learning in the field of text classification (but I’m still a rookie and half understand haha). Here we summarize the relevant knowledge, consolidate the personal knowledge system and share it with you~

(statement in advance: Vegetable chicken view, non big brother view, welcome to correct, and will be revised constantly)

Continuous updates

Text classification based on deep learning

The field of text classification can be divided into:

  • Emotional analysis
  • News analysis
  • Subject classification
  • Question answering system
  • Natural language reasoning (NLI)

Five areas (of course, there are some other sub areas that are not discussed here). In recent years, with the application of deep learning in the field of natural language processing, text classification has also entered the third stage guided by the deep learning model, gradually surpassing the methods based on traditional machine learning.

At present, there are more than 150 kinds of in-depth learning models proposed by the academic community for text classification, which can be divided into 11 categories according to the structure:

  • Feedforward network: treat text as a word bag
  • RNN based model: text is regarded as a series of words to capture text word dependencies and text structure
  • CNN based model: trained to recognize text patterns (such as key phrases) for text classification.
  • Capsule networks: it solves the problem of information loss caused by CNN’s pooling operation.
  • Attention mechanism: it can effectively identify relevant words in text, and has become a useful tool for developing DL model.
  • Memory enhanced network: combining neural network with some external memory, the model can be read and written.
  • Graph neural network: it aims to capture the internal graph structure of natural language, such as syntactic and semantic parsing trees.
  • Siamese neural network: it is specially used for text matching, which is a special case of text classification.
  • Hybrid models: combine attention, RNN, CNN, etc. to capture the local and global features of sentences and documents.
  • Transformers: it has more parallel processing than RNN, so it can use GPU to efficiently (pre) train a very large language model.
  • Modeling techniques beyond supervised learning: including unsupervised learning using automatic encoder and confrontation training, and reinforcement learning.

ABBA ABBA, it doesn’t matter if I don’t understand it very well. I only know a few classic HHH, but I can understand how to make reserves. Maybe I’ll be exposed one day ~ the following figure is some classic in-depth learning text embedding and classification models released from 2013 to 2020. We can see that the entry-level word2vec was put forward in 2013, and the popular tree LSTM and Bert were put forward a few years ago. They are really updated very quickly emmmm

How to select the appropriate neural network model

When we choose the best neural network structure for text classification tasks, we are often confused. A rookie like me is that which model community under Baidu is more popular, we use the code of big guys to reproduce HHH (of course, more use also shows that the performance and applicability of this model will not be too poor). Teachers often say that we should try more and use the one with the highest accuracy. Woo woo, however, the reality is that the annotation of data sets used for accuracy verification in the early stage is a large amount of work, and then the implementation of each model is also Very! Reply! Miscellaneous! Ah! Super Brain burning. After all, the construction of neural network is not like the traditional machine learning algorithm. Just adjust one or two parameters~
Of course, we must try more, but we should try selectively. Then, it is very important to preliminarily screen and determine the category of the model. This actually depends on experience. I won’t teach you bad. Let’s take a look at the official idea of the boss:

The selection of neural network structure depends on the target task and domain, the availability of tags in the domain, the delay and capacity constraints of applications, which lead to great differences in selection. Although there is no doubt that developing a text classifier is a process of repeated trial and error, by analyzing the recent results on a common benchmark (such as glue), we propose the following method to simplify the process, which includes five steps:

  • Select PLM (pre training language model): using PLM can significantly improve all popular text classification tasks, and automatically encoded PLM (such as Bert or Roberta) is usually better than autoregressive PLM (such as openai GPT). Hugging face has a rich PLM Warehouse developed for various tasks.
  • Domain adaptability:Most PLMS are trained on general domain text corpora (such as the web). If the target domain is very different from the general domain, we can consider using the data in the domain to continuously pre train the PLM to adjust the PLM. For domain data with a large amount of unlabeled text, such as biomedicine, pre training the language model from scratch may also be a good choice.
  • Task specific model design:Given the input text, PLM generates a vector sequence in the context representation. Then, add one or more task specific layers at the top to generate the final output of the target task. The choice of architecture for a particular task layer depends on the nature of the task, for example, the language structure that needs to capture text. For example, the feedforward neural network regards the text as a word bag, RNN can capture the word order, CNN is good at identifying patterns such as key phrases, and the attention mechanism can effectively identify relevant words in the text, while Siam neural network can be used for text matching tasks. For example, if the graphic structure of natural language (e.g. analysis tree) is useful for the target task, Then GNN may be a good choice.
  • Task specific micro adjustments:According to the availability of tags in the domain, fixed PLM can be used to train the layer of specific tasks alone or together with PLM. If it is necessary to build multiple similar text classifiers (for example, news classifiers for different fields), multi task fine tuning is a good choice to use the tag data of similar fields.
  • Model compression:PLM is expensive. They usually need to be compressed by, for example, knowledge distillation to meet the delay and capacity constraints in practical applications.

Can’t you understand it? I had a showdown. After reading it for the first time, I was confused. Then I read it carefully twice. It seemed that I understood a little, baidu some professional terms, and I lost it, but I still knew half of it. Ha ha ha ha. It is estimated that we should really go through the process and adopt a complete interpretation! When I have time to practice it again, I’ll update and interpret it again!

Model performance analysis

There are four commonly used indicators for evaluating the performance of text classification models:

  • Accuracy and error rate:The main indicators for evaluating the quality of classification models are fromOverall angleLet’s go. Suppose TP, FP, TN and FN represent true positive, false positive, true negative and false negative respectively, which are the total number of samples. The classification accuracy and error rate are defined in the equation as follows:

    • Accuracy=(TP+TN)/N
    • Error Rate=(FP+FN)/N
    • Error Rate= 1-Accuracy
  • Precision / recall / F1 score)It is also the main indicator for:Prediction resultsYes. The F1 score is the harmonic average of precision and recall. Generally, it is used for the verification of a single category sample. For multi category classification problems, you can also calculate the accuracy and recall rate for each category label, analyze each performance on the category label, or average these values to obtain the overall accuracy and recall rate. F1 is high, indicating that the model is ideal.

    • Precision = TP/(TP+FP)
    • Recall = TP/(TP+TN)
    • F1 score = 2Precision*Recall/(Precision+Recall)
  • Exact Match(EM):Exact matching metric is a popular metric in question answering system. It can measure the prediction percentage that accurately matches any basic fact answer. EM is one of the main indicators for squad.

  • Mean Reciprocal Rank(MRR):MRR is usually used to evaluate the performance of ranking algorithms in NLP tasks, such as query document ranking and QA. Is a collection of all possible answers, and ranki is the ranking position of truth answers.


  • Other widely used indicators include mean average precision (map), area under curve (AUC), false discovery rate, false omission rate, to name a few.

There are many HHH related definitions. Here are some points I noticed during my study:

  • Many people confuse accuracy with precision, although the Chinese conversion is often mixed, remember that these are two different things, one is relative to the overall index, and the other is relative to the prediction result.
  • Generally, the precision (precision) and recall (recall) cannot be improved at the same time。 This is a point that many people don’t pay much attention to, so explain. Generally, when the precision rate is relatively low, we will increase the threshold, that is, reduce the input test samples, so as to ensure that the positive examples predicted by the model are real positive examples. Obviously, although the precision rate is improved, the total test samples I input are reduced, As a result, some samples you could have predicted correctly, but you didn’t enter them. This is the process of improving precision and reducing recall. You can also reverse it. What you get is the process of improving recall and reducing precision. So, generallyHigh precision and low recall; Recall low high, precision low
  • P-R curve problem,Here is a note HHH of a big man found on the Internet, so I put in the screenshot, which is very thorough:


Of course, in addition to using commonly used indicators to evaluate the performance of text classification models,It can also be compared with traditional machine learning algorithms or other non deep learning model indicatorsTo further highlight the performance improvement.

View: opportunities and challenges

In fact, deep learning is a thing put forward a long time ago, but it has been in a mess in recent years, not only due to the development of relevant theories and hardware conditions, but also due to its powerful application results. With the help of deep learning model, great progress has also been made in the related fields of CV and NLP. Cutting edge and advanced novel ideas also emerge in endlessly, such as neural embedding, attention mechanism, self attention, transformer, Bert and xlnet. These ideas have led to the rapid development in the past decade. In this regard, what China has done is not perfect, compared with a large number of mature landing projects and open source tools abroad. However, due to the particularity of Chinese language, we also see that Chinese natural language processing and Chinese text analysis still have great room for development and improvement. Many companies have also developed a one-stop business of Chinese text analysis and are committed to improving its applicability and accuracy.

The lack of large-scale Chinese domain data sets is a major lack of domestic academia I found in the exploration of Chinese natural language, which also inhibits the development of some related fields to a certain extent. At the same time, it is difficult to form a model evaluation measurement for micro general fields, which also makes it difficult to promote the research results of many small teams. On this basis, building new data sets for more challenging text classification tasks, such as QA with multi-step reasoning, text classification for multilingual documents, and text classification for extremely long documents will also become a breakthrough in the rapid development of the next Chinese text analysis field. (personal opinion only HHH)

In addition, because most of the current in-depth learning models are supervised, they require a large number of domain annotation texts and a large amount of investment in manpower and time cost. Although a small amount of learning and zero learning have been put forward, they are still not mature enough. How to better Reducing the input cost of the model in a more balanced way is the next problem to be solved.

Increase the exploration of modeling common sense knowledge. In this regard, the construction of domain knowledge map is an important branch. Combined with the knowledge map for in-depth learning, so as to improve the ability of model performance. While improving the interpretability, it undoubtedly increases the machine’s understanding of semantics, which is very close to human thinking, not just an unpredictable black box model. On this basis, reasoning based on the “default” assumption of unknowns is realized in a similar way of thinking, rather than relying only on the digital model. At present, the research on the construction of knowledge map has made a lot of exploration in China, and tends to a more mature stage. However, the research on in-depth learning combined with knowledge map is rarely explored by teams, or in the embryonic stage (if you remember correctly, most of them are in-depth learning based on graph database), At least there is no more famous open source project (maybe I didn’t find it, hahaha).

Deep exploration of black box model. Although deep learning models have achieved gratifying performance on challenging benchmarks, most of them are unexplainable. For example, why does a model outperform another model on one dataset and perform poorly on other datasets? What did the deep learning model learn? What is the minimum neural network architecture that can achieve a certain accuracy on a given data set? Although the mechanisms of attention and self attention provide some insights to answer these questions, there is still a lack of detailed research on the basic behavior and dynamics of these models. A better understanding of the theoretical aspects of these models can help develop better models for various text analysis scenarios.

Blog started at: