Chinese natural language processing flow
Corpus is language material. Corpus is the content of linguistic research. Corpus is the basic unit of corpus. Therefore, people simply use the text as a substitute, and take the context in the text as a substitute of the context in the real world. We call a text set corpus. When there are several such text sets, we call them corpora. (definition source: Baidu Encyclopedia) according to the source of the corpus, we divide the corpus into the following two types:
1. Existing corpus
Many business departments, companies and other organizations will accumulate a large number of paper or electronic text materials with the development of business. So, for these materials, we can integrate them a little, and make all the paper texts electronic as our corpus.
2. Download and grab corpus Online
What if there is no data in the hands of individuals now? At this time, we can choose to obtain standard open datasets at home and abroad, such as Sogou corpus and people’s daily corpus in Chinese. Because most of the foreign languages are English or foreign, they are not available here for the time being. You can also choose to crawl some data by yourself, and then carry out the subsequent content.
This paper focuses on the pre-processing of the corpus. In a complete Chinese natural language processing engineering application, the pre-processing of the corpus accounts for about 50% – 70% of the total workload, so the developers spend most of their time on the pre-processing of the corpus. Next, the preprocessing of the corpus is completed in four aspects: data cleansing, word segmentation, part of speech tagging and deactivation.
1. Corpus cleaning
Data cleaning, as the name implies, is to find out what we are interested in in in the corpus, and to clean and delete the content that is not interested and regarded as noise, including extracting the title, abstract, text and other information from the original text, and removing the codes and comments such as advertisement, tag, HTML, JS and so on from the crawled web page content. Common data cleaning methods include: manual de duplication, alignment, deletion and annotation, or rule extraction of content, regular expression matching, extraction according to part of speech and named entities, writing scripts or code batch processing, etc.
Chinese corpus data is a collection of short or long texts, such as sentences, abstracts, paragraphs or whole articles. In general, words and expressions between sentences and paragraphs are continuous and have certain meanings. For text mining and analysis, we want the minimum unit granularity of text processing to be words or words, so we need word segmentation to segment all the text.
Common word segmentation algorithms include: word segmentation method based on string matching, word segmentation method based on understanding, word segmentation method based on statistics and word segmentation method based on rules.
At present, the main difficulties of Chinese word segmentation algorithms are ambiguity recognition and new word recognition. For example, “badminton auction is over”, which can be divided into “badminton auction is over”, or “badminton auction is over”. If you don’t rely on other sentences in the context, it’s hard to know how to understand.
3. Part of speech tagging
Part of speech tagging is to label every word or word, such as adjective, verb, noun, etc. In this way, the text can be integrated with more useful language information in later processing. Part of speech tagging is a classical sequential tagging problem, but for some Chinese natural language processing, part of speech tagging is not necessary. For example, common text classification does not care about part of speech problems, but similar emotional analysis and knowledge reasoning are needed. The figure below is a common Chinese part of speech arrangement.
Common part of speech tagging methods can be divided into rule-based and statistics based methods. Among them, statistical methods, such as maximum entropy based part of speech tagging, statistical maximum probability based output part of speech tagging and HMM based part of speech tagging.
4. De stop words
Stop words generally refer to words that have no contribution to text features, such as punctuation, mood, person and so on. So in general text processing, after word segmentation, the next step is to stop using words. But for Chinese, the operation of deactivation words is not immutable. The dictionary of deactivation words is determined according to specific scenarios. For example, in emotional analysis, modal particles and exclamation marks should be retained, because they have certain contribution and significance to the expression of mood degree and emotional color.
After preprocessing, we need to consider how to express the words and expressions after segmentation into the types that can be calculated by computer. Obviously, if we want to calculate at least we need to convert the Chinese word segmentation string into a number, it should be exactly a vector in mathematics. There are two commonly used representation models: word bag model and word vector.
Bag of word (cow) model, that is, without considering the original order of words in sentences, puts each word or symbol in a set (such as list) directly, and then makes statistics on the number of occurrences in the way of counting. Statistics of word frequency is only the most basic way. TF-IDF is a classic usage of word bag model.
Word vector is a computational model that transforms words and expressions into vector matrix. So far, the most commonly used word representation method is one hot, which represents each word as a very long vector. The dimension of this vector is the vocabulary size. Most of the elements are 0, and only one dimension has a value of 1. This dimension represents the current word. There are also word2vec of Google team, which mainly includes two models: skip gram and continuous bag of words (cbow), as well as two efficient training methods: negative sampling and sequential softmax. It is worth mentioning that word2vec word vector can better express the similarity and analogy between different words. In addition, there are some expressions of word vectors, such as doc2vec, wordrank and fasttext.
As with data mining, feature engineering is essential in text mining. In a practical problem, to construct a good feature vector is to select the appropriate features with strong expression ability. Text features are generally words with semantic information. Feature selection can find a feature subset, which can still retain semantic information. However, the feature subspace found by feature extraction will lose some semantic information. So feature selection is a very challenging process, which depends more on experience and professional knowledge, and there are many existing algorithms for feature selection. At present, there are six common feature selection methods: DF, MI, Ig, Chi, wllr and WFO.
After feature vector selection, the next thing to be done is of course the training model. For different application requirements, we use different models, traditional supervised and unsupervised machine learning models, such as KNN, SVM, naive Bayes, decision tree, gbdt, K-means, etc.; deep learning models, such as CNN, RNN, LSTM, seq2seq, fasttext, textcnn, etc. These models will be used in the following classification, clustering, neural sequence, emotional analysis and other examples, which will not be described here.