Natural language processing — Text vectorization (2)


1、 Abstract

This sharing is based on the continuation of the previous text vector method. In the last content, we mainly shared two methods of text vectorization: the word bag model representation method and the word vector method based on deep learning. Although the word bag model can simply express words as vectors, it will cause dimensional disaster, and can not make use of the word order and other information in the text. The goal of nnlm model is to build a language probability model, but in the process of solving nnlm model, the weight calculation from hidden layer to output layer is a very time-consuming step. Next, we will learn about the C & W model, cbow model and skip gram model.

2、 C & W model

C & W model is a model aiming at generating word vectors. In the previous nnlm model, we learned that its difficult to calculate weight is a very big obstacle. The C & W model here does not use the language model to solve the conditional probability of word context, but directly scores n-ary phrases. It adopts a faster and more efficient way to obtain word vectors. The core idea of C & W model is: if n-ary phrase appears in the corpus, the model will give a higher score to the phrase; For phrases that do not appear or appear few times in the corpus, they will get lower scores. The structure diagram of C & W model is as follows:


Figure 1: structure diagram of C & W model

Relative to the whole corpus, the objective function of C & W model to be optimized is:


Figure 2: C & W model objective function

Where, (W, c) is the n-ary phrase extracted from the corpus. In order to ensure the consistency of the number of context words, n is an odd number; W is the target word; C represents the context of the target word; W ‘is a word randomly extracted from the dictionary. C & W model optimizes the objective function in the way of paired words. It can be seen from the expression in Figure 2 that the objective function expects the score of the positive sample to be at least 1 point higher than that of the negative sample. (W, c) here represents positive samples, which are from the corpus; (W ‘, c) represents the negative sample. The negative sample is obtained by replacing the middle word of the positive sample sequence with other words. Usually, a random word is used to replace the middle word of the correct text sequence, and the new text sequence is basically a sequence that does not conform to grammatical habits. Therefore, this method of constructing negative samples is reasonable. And the negative sample is only obtained by modifying an intermediate word in the positive sample sequence, so the overall context has not changed, so it will not have a great effect on the classification effect.

Different from the target word of nnlm model in the output layer, the output layer of C & W model contains the target word, and its output layer also becomes a node, and the output value of this node represents the score of meta phrase. The number of operations in the last layer of the corresponding C & W model is | h |, which is much lower than | V in the nnlm model| ×| H | times. In terms of weight calculation, compared with nnlm model, C & W model can greatly reduce the amount of calculation.

3、 Cbow model and skip gram model

In order to improve the efficiency of acquiring word vectors, cbow (continuous bag of words) model and skip gram model are obtained on the basis of nnlm and C & W model.

Cbow model uses the middle word of a text as the target word, and removes the hidden layer in the structure, which can greatly improve the running speed and save a lot of weight matrix calculation. In addition, cbow model uses the average value of word vectors of each word in the context to replace the word vectors spliced by nnlm model. Because cbow model removes the hidden layer, its input layer is the representation of semantic context.


Figure 3: structure diagram of cbow model

The conditional probability calculation expression of cbow model for target words is:


Figure 4: cbow probability calculation expression

Cbow model is based on the general form of neural network:


Figure 5: general form of cbow model

The objective function of cbow model is similar to nnlm model, which is maximized:


Figure 6: model maximization

The structure of skip gram model also has no hidden layer. Different from cbow model is the average word vector of input context. Skip gram model is to select a word from the context of the target word W and form its word vector into the representation of the context.


Figure 7: skip gram model structure

Skip gram model is based on the general form of neural network:


Figure 8: general form of skip gram model

For the whole corpus, the objective function expression of skip gram model is:


Figure 9: skip gram model objective function expression

Skip gram model and cbow model actually belong to the implementation methods of two different strategies of word2vec: cbow strategy predicts the probability of the current word according to the context input, and the influence weight of all words in the context on the occurrence probability of the current word is the same, so it is also called continuous bag of words model. Just like taking words from a bag, take out enough words, and there is no requirement for the order of taking out. Skip gram model is just the opposite. Its strategy and purpose is to use the current word as input to predict context probability.


Both of the contents of this model belong to the scope of 2vec. In NLP, if x is regarded as a word in a sentence and Y is the context of the word, then we need a language model f (), whose mission is to judge whether the sample (x, y) conforms to the rules of natural language. Word2vec does not pay attention to how accurate the language model is trained, but cares about the model parameters obtained after the model training, and takes these parameters as the vectorization representation of X, which is the formation of word vector. Based on this, the above cbow model and skip gram model are derived.