A Survey on Text Classification: From Shallow to Deep Learning
Text classification is the most basic task in natural language processing. Due to the unprecedented success of deep learning, research in this field has surged in the past decade. Many methods, data sets and evaluation indicators have been proposed in the existing literature, so a comprehensive summary of these contents is needed. This paper reviews the text classification methods from 1961 to 2020, focusing on the model from shallow learning to deep learning. According to the text involved and the model used for feature extraction and classification, a classification method for text classification is created. Each of these categories is then discussed in detail, covering technical developments and benchmark datasets that support predictive testing. It also provides a comprehensive comparison between different technologies, and determines the advantages and disadvantages of various evaluation indicators. Finally, it summarizes the key implications, future research directions and challenges in the research field.
Text classification process.
In many NLP applications, text categorization – the process of assigning predefined tags to text – is a fundamental and important task. The main process of text classification: first, preprocessing the text data of the model. Shallow learning models usually need to obtain good sample features by manual methods, and then use classical machine learning algorithm to classify them. Therefore, the effectiveness of this method is largely limited by feature extraction. However, unlike the shallow model, deep learning integrates feature engineering directly into the output by learning a set of nonlinear transformations, thus integrating feature engineering into the model fitting process.
The development of text classification.
The schematic diagram of main text classification methods is shown in Figure 2. From 1960’s to 2010’s, text classification model based on shallow learning was dominant. Shallow learning means statistical based models such as naive Bayes (NB), k-nearest neighbors (KNN) and support vector machines (SVM). Compared with the early rule-based methods, this method has obvious advantages in accuracy and stability. However, these methods still require functional design, which is time-consuming and expensive. In addition, they usually ignore the natural order structure or context information in the text data, which makes it difficult to learn the semantic information of words. Since 2010, text classification has gradually changed from shallow learning model to deep learning model. Compared with the method based on shallow learning, the deep learning method avoids manual design rules and functions, and automatically provides semantic meaningful representation for text mining. Therefore, most of the research on text classification is based on DNN, which is a data-driven method with high computational complexity. Few studies have focused on shallow learning models to address the limitations of computation and data.
The main contribution of this paper.
This paper summarizes the existing models from shallow learning to deep learning. The shallow learning model emphasizes feature extraction and classifier design. Once the text has well-designed features, the classifier can be trained to quickly converge. DNNS can automatically extract and learn features without domain knowledge. Then, it provides data sets and evaluation indicators for single label and multi label tasks, and summarizes the future research challenges from the perspective of data, model and performance. In addition, various kinds of information are summarized in four tables, including necessary information of classical shallow and deep learning models, technical details of DNN, main information of main data sets and general benchmark of the latest methods in different applications. In summary, the main contributions of this study are as follows:
The process and development of text categorization is described in Table 1, and the necessary information of the classic model is summarized according to the year of publication, including place, application, citation and code link.
According to the model structure, the main models from shallow learning model to deep learning model are comprehensively analyzed and studied. The classic or more specific models are summarized, and the design differences of basic models, metrics and experimental datasets are mainly summarized in Table 2.
This paper introduces the current data sets and presents the main evaluation measures, including single label and multi label text classification tasks. Table 3 summarizes the necessary information for the main data sets, including the number of categories, the average sentence length, the size of each dataset, related papers and data addresses.
In Table 5, the classification accuracy scores of classical models on benchmark data sets are summarized, and the main challenges of text classification are discussed.
Text classification model.
Text classification is called extracting features from the original text data and predicting the categories of text data based on these features. In the past few decades, many models for text categorization have been proposed, as shown in Table 1. Tabulate the main information (including place, application, citation and code link) of the main model of text classification. The applications in this table include sentiment analysis (SA), topic tags (TL), news classification (NC), question answering (QA), conversation behavior classification (DAC), natural language inference (NLI) and event prediction (EP). For shallow learning model, Nb is the first model for text classification task. After that, we propose general classification models, such as KNN, SVM and RF, which are called classifiers and are widely used in text classification. Recently, xgboost and lightgbm may have the potential to provide excellent performance. For deep learning models, textcnn has the most references among these models, and CNN model is introduced to solve the text classification problem for the first time. Although Bert is not designed specifically for text categorization tasks, it has been widely used in the design of text classification models because of its effectiveness on many text classification datasets.
Shallow learning model
Shallow learning model accelerates the speed of text classification, improves the accuracy, and expands the application scope of shallow learning. First, preprocessing the original input text to train the shallow learning model, which usually includes word segmentation, data cleaning and data statistics. Then, text representation aims to express the preprocessed text in a form that is easier for the computer to express and minimize information loss, such as word bag (bow), n-gram, term frequency, inverted document frequency (TF-IDF), word2vec  and glove . The core of bow is to represent each text with a dictionary size vector. The single value of a vector represents the frequency of the word corresponding to its inherent position in the text. Compared with bow, n-gram considers the information of adjacent words and constructs a dictionary by considering adjacent words. TF-IDF uses word frequency and reverses document frequency to model text. Word2vec uses local context information to get word vectors. Glove – with local context and global statistical function – trains the non-zero elements of the word word co-occurrence matrix. Finally, the text represented by the selected feature is input into the classifier.
Shallow learning method is a kind of machine learning. It learns from data, which is a pre-defined function that is important to the performance of predictive values. However, factor engineering is an arduous task. Before training the classifier, we need to collect knowledge or experience to extract features from the original text. The shallow learning method trains the initial classifier based on various text features extracted from the original text. For small data sets, under the limitation of computational complexity, shallow learning model usually performs better than deep learning model. Therefore, some researchers have studied the design of shallow model in specific domain with less data.
Deep learning model：
DNN is composed of an artificial neural network, which simulates the human brain to automatically learn advanced functions from data. It achieves better results than the shallow learning model in speech recognition, image processing and text understanding. Input data sets should be analyzed to classify data, such as single label, multi label, unsupervised, unbalanced data sets. According to the characteristics of the data set, the input word vector is sent to DNN for training until the termination condition is reached. The performance of the training model is verified by downstream tasks, such as emotion classification, question answering and event prediction. The DNNS over the years are shown in Table 2, including designs different from the corresponding basic models, evaluation indicators and experimental data sets. As shown in Table 2, feedforward neural network and recurrent neural network are the first two deep learning methods for text classification task. Compared with shallow learning model, they can improve the performance. Then, CNN, RNN and attention mechanism are applied to text classification. Many researchers have improved the performance of text classification for different tasks by improving CNN, RNN and attention, or model fusion and multitasking methods. The emergence of bidirectional encoder representation (BERT) from transformer, which can generate up and down culture word vectors, is an important turning point in the development of text classification and other NLP technologies. Many of the text categorization models mentioned above have better performance than NLP models. In addition, some researchers have studied text classification technology based on GNN to capture the structural information in text, which is irreplaceable by other methods.
Deep learning is composed of multiple hidden layers in neural network, which has higher complexity and can be trained on unstructured data. Deep learning architecture can learn feature representation directly from input without much manual intervention and prior knowledge. However, deep learning technology is a data-driven method, which usually requires a lot of data to achieve high performance. Although the model based on self attention can bring some interpretability between words to DNN, it is not enough to explain the reason and working mode of DNN compared with the shallow model.
Text classification – as an effective information retrieval and mining technology – plays an important role in managing text data. It uses NLP, data mining, machine learning and other technologies to automatically classify and discover different text types. Text classification takes multiple types of text as input, and the text is represented as vector by the pre training model. Then the vector is fed into the DNN for training until the termination condition is reached. Finally, the performance of the training model is verified by the downstream task. Existing models have shown their usefulness in text categorization, but there are still many possible improvements to be explored. Although some new text classification models repeatedly erase the accuracy indicators of most classification tasks, they can not indicate whether the model “understands” the text from the semantic level like human beings. In addition, with the emergence of noise samples, small sample noise may lead to substantial changes in decision confidence, or even lead to decision reversal. Therefore, it is necessary to prove the semantic representation ability and robustness of the model in practice. In addition, the pre training semantic representation model represented by word vectors can generally improve the performance of downstream NLP tasks. The existing research on the transfer strategy of context free word vector is still relatively preliminary. Therefore, from the perspective of data, model and performance, we conclude that text classification faces the following challenges:
For text classification tasks, data is essential for model performance, whether it is shallow learning or deep learning methods. The research of text data mainly includes multi chapter, short text, cross language, multi label, less sample text. For the characteristics of these data, the existing technical challenges are as follows:
Current deep learning models rely too much on a large number of marked data. The performance of these models is significantly affected in zero shot or few shot learning.
We all know that the more useful information you enter, the better the performance of DNN. Therefore, adding external knowledge (knowledge base or knowledge map) is an effective way to improve the performance of the model. However, how and what to add is still a challenge.
Multi label text classification task。
Multi label text classification needs to fully consider the semantic relationship between tags, and the embedding and coding of the model is a lossy compression process. Therefore, how to reduce the loss of hierarchical semantics in the training process and how to retain rich and complex document semantic information is still an urgent problem to be solved.
There are many termsvocabularySpecial areas of。
Domain specific texts (such as financial and medical texts) contain many specific words or domain experts, understandable words, abbreviations, etc., which makes the existing pre training word vector difficult to use.
Most of the existing shallow and deep learning models have been used for text classification, including ensemble methods. Bert learned a language representation that can be used to fine tune many NLP tasks. The main methods are to increase data, improve computing power and design training programs to obtain better results. It is worth studying how to trade-off between data and computing resources and prediction performance.
Performance evaluation level:
The shallow model and deep model can achieve good performance in most text classification tasks, but the anti-interference ability of the results needs to be improved. How to interpret the depth model is also a technical challenge.
Semantic robustness of the model.
In recent years, researchers have designed many models to enhance the accuracy of text classification models. However, if there are some antagonistic samples in the dataset, the performance of the model will be greatly reduced. Therefore, how to improve the robustness of the model is a hot topic and challenge.
The interpretability of the model.
DNN has unique advantages in feature extraction and semantic mining, and has completed excellent text classification tasks. However, deep learning is a black box model, the training process is difficult to reproduce, and the implicit semantics and output are poorly interpretable. It improves and optimizes the model and loses the clear criterion. In addition, we cannot explain exactly why the model can improve performance.