A Survey on Text Classification: From Shallow to Deep Learning
Text classification is the most basic task in natural language processing. Due to the unprecedented success of deep learning, there has been a surge of research in this field in the past decade. Many methods, data sets and evaluation indicators have been proposed in the existing literature, so it is necessary to make a comprehensive summary of these contents. This paper reviews the text classification methods from 1961 to 2020, focusing on the model from shallow learning to deep learning. According to the text involved and the model used for feature extraction and classification, a classification method for text classification is created. Then, each of these categories is discussed in detail, involving technology developments and benchmark data sets that support predictive testing. It also provides a comprehensive comparison between different technologies and determines the advantages and disadvantages of various evaluation indexes. Finally, it summarizes the key implications, future research directions and challenges in the research field.
Text classification process.
In many NLP applications, text classification – the process of assigning predefined labels to text – is a fundamental and important task. The main process of text classification: first, preprocessing the text data of the model. Shallow learning model usually needs to obtain good sample features through artificial methods, and then use the classical machine learning algorithm to classify them. Therefore, the effectiveness of this method is largely limited by the feature extraction. However, unlike the shallow model, deep learning integrates feature engineering into the model fitting process by learning a set of nonlinear transformations.
The development of text classification.
The schematic diagram of the main text classification methods is shown in Figure 2. From 1960s to 2010, the text classification model based on shallow learning was dominant. Shallow learning means statistical based models such as naive Bayes (NB), k-nearest neighbors (KNN) and support vector machines (SVM). Compared with the early rule-based method, this method has obvious advantages in accuracy and stability. However, these methods still need functional design, which is time-consuming and expensive. In addition, they usually ignore the natural order structure or context information in text data, which makes it difficult to learn the semantic information of words. Since 2010, text classification has gradually changed from shallow learning model to deep learning model. Compared with the method based on shallow learning, the deep learning method avoids the manual design of rules and functions, and automatically provides semantic meaningful representation for text mining. Therefore, most of the research work of text classification is based on DNN, which is a data-driven method with high computational complexity. Few studies focus on the shallow learning model to solve the limitations of computation and data.
The main contributions of this paper are as follows.
This paper summarizes the existing models from shallow learning to deep learning. Shallow learning model emphasizes feature extraction and classifier design. Once the text has well-designed features, we can train the classifier to converge quickly. DNNS can automatically extract and learn features without domain knowledge. Then, data sets and evaluation indicators are provided for single label and multi label tasks, and future research challenges are summarized from data, model and performance perspectives. In addition, various information is summarized in the four tables, including the necessary information of the classic shallow and deep learning model, the technical details of DNN, the main information of the main data sets, and the general benchmark of the latest methods in different applications. In summary, the main contributions of this study are as follows:
The process and development of text classification are introduced in Table 1, and the necessary information of the classic model is summarized according to the year of publication, including place, application, citation and code link.
According to the model structure, the main models from shallow learning model to deep learning model are comprehensively analyzed and studied. The classic or more specific models are summarized, and the design differences of basic models, metrics and experimental data sets are summarized in Table 2.
This paper introduces the current data set and gives the description of the main evaluation measures, including single label and multi label text classification tasks. Table 3 summarizes the necessary information of the main data sets, including the number of categories, average sentence length, size of each data set, related papers and data address.
In Table 5, the classification accuracy scores of classical models on benchmark data sets are summarized, and the main challenges of text classification are discussed.
Text classification model.
Text classification is called extracting features from original text data, and predicting text data categories based on these features. In the past decades, many models for text classification have been proposed, as shown in Table 1. Tabulate the main information of the main models of text categorization, including places, applications, citations, and code links. Applications in the table include sentiment analysis (SA), topic tags (TL), news classification (NC), question answering (QA), conversation behavior classification (DAC), natural language inference (NLI) and event prediction (EP). For shallow learning model, Nb is the first model for text classification task. After that, general classification models are proposed, such as KNN, SVM and RF, which are called classifiers and widely used in text classification. Recently, xgboost and lightgbm may have the potential to provide excellent performance. For deep learning model, textcnn has the most references in these models, and CNN model is introduced to solve the problem of text classification for the first time. Although Bert is not specially designed to deal with the task of text classification, considering its effectiveness in many text classification datasets, it has been widely used in the design of text classification model.
Shallow learning model
Shallow learning model speeds up the speed of text classification, improves the accuracy, and expands the application scope of shallow learning. Firstly, the original input text is preprocessed to train the shallow learning model, which usually includes word segmentation, data cleaning and data statistics. Then, text representation aims to express the preprocessed text in a form that is easier for computers, and minimize information loss, such as bag of words (bow), n-gram, term frequency inverted document frequency (TF-IDF), word2vec  and glove . The core of bow is to represent each text with a dictionary sized vector. A single value of a vector represents the word frequency corresponding to its inherent position in the text. Compared with bow, n-gram considers the information of adjacent words and constructs a dictionary by considering adjacent words. TF-IDF uses word frequency and reverses document frequency to model text. Word2vec uses local context information to get word vectors. Glove – with local context and global statistics function – training words – non-zero elements in word co-occurrence matrix. Finally, the text is input into the classifier according to the selected features.
Shallow learning is a kind of machine learning. It learns from data, which is a pre-defined function important to the performance of the predictor. However, factor engineering is a hard work. Before training the classifier, we need to collect knowledge or experience to extract features from the original text. The shallow learning method trains the initial classifier based on various text features extracted from the original text. For small data sets, the shallow learning model usually performs better than the deep learning model under the limitation of computational complexity. Therefore, some researchers have studied the design of shallow models in specific areas with less data.
Deep learning model：
DNN is composed of artificial neural network, which simulates human brain to automatically learn advanced functions from data, and achieves better results than shallow learning model in speech recognition, image processing and text understanding. Input data sets should be analyzed to classify data, such as single label, multi label, unsupervised, unbalanced data sets. According to the characteristics of the data set, the input word vector is sent to DNN for training until the termination condition is reached. The performance of the training model is verified by downstream tasks, such as emotion classification, question answering and event prediction. The DNN over the years is shown in Table 2, including designs different from the corresponding basic models, evaluation indicators and experimental data sets. As shown in Table 2, feedforward neural network and recurrent neural network are the first two deep learning methods for text classification tasks. Compared with shallow learning model, they can improve the performance. Then, CNN, RNN and attention mechanism are used for text classification. Many researchers improve text classification performance for different tasks by improving CNN, RNN and attention, or model fusion and multitasking methods. The emergence of bi-directional encoder representation (BET) from transformer, which can generate the upper and lower culture word vectors, is an important turning point in the development of text classification and other NLP technologies. Many researchers have studied the text classification model based on Bert, which has better performance than the above model in multiple NLP tasks including text classification. In addition, some researchers have studied the GNN based text classification technology to capture the structural information in the text, which can not be replaced by other methods.
Deep learning is composed of multiple hidden layers in neural network, which has higher complexity and can be trained on unstructured data. Deep learning architecture can learn feature representation directly from input without much human intervention and prior knowledge. However, deep learning technology is a data-driven method, which usually requires a lot of data to achieve high performance. Although the model based on self attention can bring some inter word interpretability to DNN, the comparison with the shallow model is not enough to explain its reason and working mode.
Text classification, as an effective information retrieval and mining technology, plays an important role in managing text data. It uses NLP, data mining, machine learning and other techniques to automatically classify and discover different text types. Text classification takes multiple types of text as input, and the text is represented as vector by the pre training model. Then the vector is fed to DNN for training until the termination condition is reached. Finally, the downstream task verifies the performance of the training model. The existing models have shown their usefulness in text classification, but there are still many possible improvements to be explored. Although some new text classification models have repeatedly scratched the accuracy indicators of most classification tasks, it can not indicate whether the model “understands” the text from the semantic level like human. In addition, with the appearance of noise samples, small sample noise may lead to substantial changes in decision-making confidence, and even lead to decision reversal. Therefore, it is necessary to prove the semantic representation ability and robustness of the model in practice. In addition, the pre training semantic representation model represented by word vector can usually improve the performance of downstream NLP tasks. The existing research on the transmission strategy of context free word vector is still relatively preliminary. Therefore, from the perspective of data, model and performance, we conclude that text classification faces the following challenges:
For text classification task, data is essential for the performance of the model, whether it is shallow learning or deep learning. The text data mainly includes multi chapter, short text, cross language, multi label and few sample text. For the characteristics of these data, the existing technical challenges are as follows:
The current deep learning model relies too much on a large number of labeled data. The performance of these models is significantly affected by zero shot or few shot learning.
We all know that the more useful information you input, the better the performance of DNN. Therefore, adding external knowledge (knowledge base or knowledge graph) is an effective way to improve the performance of the model. However, how and what to add remains a challenge.
Multi label text classification task。
Multi label text classification needs to fully consider the semantic relationship between labels, and the embedding and coding of model is a process of lossy compression. Therefore, how to reduce the loss of hierarchical semantics in the training process and how to retain the rich and complex document semantic information is still an urgent problem to be solved.
There are many termsvocabularySpecial areas of the market。
Domain specific texts (such as financial and medical texts) contain many specific words or domain experts, understandable terms, abbreviations, etc., which make the existing pre trained word vectors difficult to use.
Most of the existing shallow and deep learning models have been tried to be used in text classification, including integration methods. Bert learned a language representation that can be used to fine tune many NLP tasks. The main method is to increase the data, improve the computing power and design the training program to get better results. How to trade-off between data and computing resources and prediction performance is worth studying.
Performance evaluation level:
Shallow model and deep model can achieve good performance in most text classification tasks, but the anti-interference ability of the results needs to be improved. How to interpret the depth model is also a technical challenge.
Semantic robustness of the model.
In recent years, researchers have designed many models to enhance the accuracy of text classification models. However, if there are some adversarial samples in the dataset, the performance of the model will be greatly reduced. Therefore, how to improve the robustness of the model is the current research hotspot and challenge.
The interpretability of the model.
DNN has unique advantages in feature extraction and semantic mining, and has completed excellent text classification tasks. However, deep learning is a black box model, the training process is difficult to reproduce, and the implicit semantics and output are poorly interpretable. It improves and optimizes the model and loses clear criteria. In addition, we cannot explain exactly why the model can improve performance.