NLP tutorial (1) – word vector, SVD decomposition and Word2Vec


NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

author:Han Xinzi@ShowMeAI
Tutorial address
Address of this article
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source

collectShowMeAIView more exciting content

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec
This series isStanford CS224nA full set of study notes for “Natural Language Processing with Deep Learning”, the corresponding course video can be found athereCheck.

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec
ShowMeAIFor all the courseware of the CS224n course, aChinese translation and notes, and made into a GIF animation! clickhereCheck”Lecture 1 – Introduction to NLP and Preliminary Word Vectors“Courseware annotations and explanations with learning. See the end of the article for more information.


CS224nIt is a professional course in the direction of deep learning and natural language processing produced by the top university Stanford. The core content covers RNN, LSTM, CNN, transformer, bert, question and answer, abstract, text generation, language model, reading comprehension and other cutting-edge content.

This note corresponds to the first knowledge section of the Stanford CS224n Natural Language Processing special course:NLP and word vectors. Firstly, the concept of natural language processing (NLP) and the problems it faces are introduced, and then the word vector and its construction method (including dimensionality reduction based on co-occurrence matrix and Word2Vec) are introduced.

content points

  • Natural Language Processing/Natural Language Processing (NLP)
  • Word Vectors/Word Vectors
  • SVD Matrix Decomposition
  • Skip-gram
  • negative sampling
  • transformer
  • CBOW
  • Hierarchical softmax
  • Word2Vec

1. Introduction to Natural Language Processing

1.1 What is special about natural language processing

What is special about human language? Human language is a system dedicated to expressing meaning. Language and writing are the upper-level abstract representations. NLP is very different from computer vision or any other machine learning tasks.

Most words are just a sign outside of linguistics: a word is a signifier that maps to a signified (signified idea or thing). For example, the word “rocket” refers to the concept of a rocket, and thus can be extended to an instance of a rocket. There are also some exceptions when we use words and letters to express symbols, such as the use of “whoompaa”.

Most importantly, the symbols of these languages ​​can be encoded in several forms: sounds, gestures, words, etc., and then transmitted to the brain in continuous signals; the brain itself seems to be able to decode these signals in a continuous way . A great deal of work has been done in the philosophy of language and linguistics to conceptualize human language and to distinguish words from their references, meanings, etc.

❐ Natural language is a discrete[dispersed] / symbolic[Symbolic] / categorical[categorized] system.

1.2 Natural Language Processing Tasks

Natural language processing has different levels of tasks, from language processing to semantic interpretation to discourse processing. The goal of natural language processing is to enable computers to “understand” language by designing algorithms so that they can perform certain tasks. The difficulty of different tasks is different:

1) Simple tasks

  • Spell CheckingSpell Checking
  • Keyword Search Keyword Search
  • Finding Synonyms

2) Intermediate tasks

  • Parse information from websites, documents, etc.

3) Complex tasks

  • Machine TranslationMachine Translation
  • Semantic Analysis Semantic Analysis
  • Refer to resolution Coreference
  • Question Answering System Question Answering

1.3 How to represent vocabulary

The first and arguably most important common denominator among all NLP tasks is thatHow we represent words as input to any model. We will not discuss here that early natural language processing work treated words as atomic symbols.

In order to perform better on most natural language processing tasks, we first need to understandsimilarities and differences between words. With word vectors, we can easily encode this into the vector itself.

(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Natural Language Processing and Word Embeddings

2. Word vector

Use word vectors to encode words, \( N \) The dimensional space is enough for us to encode all the semantics of the language, and each dimension will encode some information that we use the language to convey.

The simple one-hot vector cannot give the similarity between words. We need to reduce the dimension \( \left | V \right | \) to a low-latitude subspace to obtain dense word vectors and get between words Relationship.

3. Word vector based on SVD dimensionality reduction

(aboutDimensionality reduction algorithmcan readShowMeAIMachine Learning Tutorial Articles forGraphical Machine Learning |Detailed explanation of dimensionality reduction algorithm, also available viaShowMeAIMachine Learning Practical ArticlesMachine Learning in Action |The most comprehensive interpretation of machine learning feature engineeringAbout the python application method of dimensionality reduction)

Based on word co-occurrence matrix and SVD decomposition is to build word embedding(that is, a method of word vector).

  • We first traverse a large dataset and count the co-occurrence count matrix \( X \)
  • Then perform SVD decomposition on the matrix \( X \) to get \( USV^T \)
  • Then we use the line of \( U \) as the word vector of all words in the dictionary

Next we discuss several options for the matrix \( X \) .

3.1 Word-Document Matrix

The original solution was done based on the word-document co-occurrence matrix. We guess that related words occur frequently in the same document:

  • For example, “banks”, “bonds”, “stocks”, “moneys”, etc., have a higher probability of appearing together
  • However, “banks”, “octopus”, “banana”, “hockey” are unlikely to appear consecutively

Based on this situation, we create aWord-Document Matrix, \( X \) is constructed in the following way: iterate through hundreds of millions of documents and when word \( i \) appears in document \( j \), we increment \( X_{ij} \) by one.

This is obviously a large matrix \( \mathbb{R}^{ \left | V \right | \times M} \), and its size is proportional to the number of documents \( M \). So we can try something better.

3.2 Word co-occurrence matrix based on sliding window

Full document statistics is a very time-consuming and labor-intensive task, we can make adjustments to count the data in a text window, calculate the number of times each word appears in a window of a specific size, and obtain the co-occurrence matrix \( X \) .

The following is a simple example, we construct the co-occurrence matrix of the text based on the sliding window (the front and rear lengths are 1).

  • I enjoy flying.
  • I like NLP.
  • I like deep learning.

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

Using word co-occurrence matrix

  • Generate a co-occurrence matrix \( X \) of dimension \( \left | V \right |\times \left | V \right | \)
  • Apply SVD on \( X \) to get \( X = USV^T \)
  • Select \( U \) before \( k \) lines to get \( k \) dimension word vector
  • \( \frac{\sum_{i=1}^{k} \sigma_i}{\sum_{i=1}^{ \left | V \right |} \sigma_i} \) represents the first \( k \ ) dimension contains the amount of variance

3.3 Using SVD to reduce the dimensionality of the co-occurrence matrix

  • We use SVD on the matrix \(X\), observe the singular values ​​(elements on the upper diagonal of the matrix \(S\), truncate according to the variance percentage, leaving the first \(k\) elements:

\frac{\sum_{i=1}^{k} \sigma_i}{\sum_{i=1}^{ \left | V \right |} \sigma_i}

  • Then take the sub-matrix \( U_{1: \left | V \right |, 1:k} \) as the word embedding matrix. This gives a \( k \) dimensional representation of each word in the vocabulary.

Use SVD on the matrix \( X \)

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

Dimensionality reduction by selecting the first \( k \) singular vectors

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

The aforementioned methods provide us with enough word vectors to encode semantic and syntactic (part of speech) information, but also bring some problems:

  • The dimension of the matrix will change frequently (new words are added frequently and the size of the corpus will change)
  • The matrix will be very sparse as many words will not co-occur
  • The matrix dimension will generally be very high\( \approx 10^6 \times 10^6 \)
  • Need to add some tricks to \( X \) to solve the extreme imbalance of word frequency

❐ The SVD-based methodHigh computational complexity( \( m \times n \) matrix is ​​computationally expensive \( O({mn}^2) \) ), and it is difficult to incorporate new words or documents.

There are the following solutions to the problems in the above discussion

  • Ignore function words such as “the”, “he”, “has”, etc.
  • Use a ramp window, i.e. weight the co-occurrence counts according to the distance between words in the document
  • Use Pearson’s correlation coefficient and set negative counts to \(0\) instead of just raw counts

❐ Count-based methods can make efficient use of statistics, but the followingiterative based approachWord embeddings can be efficiently constructed on large corpora while controlling the complexity.

4. Iterative optimization algorithm – Word2Vec

(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Natural Language Processing and Word Embeddings

Word2Vec is an iterative model that is able to iteratively learn from text and eventually encode the word vectors for the probabilities of words given a context, instead of computing and storing global information for some large dataset (possibly billions of sentences).

The idea is to design a model whose parameters are word vectors. Then train the model according to an objective function, calculate the error at each model iteration, adjust the model parameters (word vector) based on the optimization algorithm, reduce the loss function, and finally learn the word vector.

Everyone knows that the corresponding idea in the neural network is called “backpropagation”, the simpler the model and task, the faster it can be trained.

❐ Iterative based methodsCapture co-occurrences of words one at a time, instead of directly capturing all co-occurrence counts like the SVD method.

Many researchers have tested different methods along this line of thought. The model designed by [Collobert et al., 2011] first converts each word into a vector. For each specific task (named entity recognition, part-of-speech tagging, etc.), they not only trained the parameters of the model, but also trained word vectors, and achieved very good performance while calculating very good word vectors.

A very efficient method is Word2Vec. Word2Vec is an open source software package from Google, which includes the following core content:

  • two algorithms: continuous bag-of-words(CBOW) and skip-gram

    • CBOWis to predict the word vector of the word based on the context words around the center word
    • skip-gramOn the contrary, it is to predict the probability distribution of words in the surrounding context according to the central word.
  • two training methods: negative sampling and hierarchical softmax

    • Negative samplingDefine the target by drawing negative samples
    • hierarchical softmaxDefine the target by computing the probabilities of all words using an efficient tree structure

❐ Word2Vec relies on a very important assumption in linguistics “distribution similarity”, that is, similar words have similar contexts.

4.1 Language Model

Let’s first understand the language model. Start with an example:

I love learning about natural language processing techniques

A good language model would give this sentence a high probability because syntactically and semantically it is a perfectly valid sentence. similarly, the sentencenatural learning love processing language i technologywould get a very low probability because it’s a nonsensical sentence.

Mathematically, we can say that the probability for a given sequence of \( n \) words is:

P(w_1, w_2, \cdots, w_n)

In the Unigram model approach, we assume that word occurrences are completely independent, thus decomposing the probabilities

P(w_1, w_2, \cdots, w_n)=\prod_{i=1}^{n} P(w_i)

Strictly speaking, the above assumption is unreasonable, because the next word is highly dependent on the previous word sequence. If the above language model is used, it may make a meaningless sentence have a high probability. So we let the probability of the sequence depend on the pairwise probability of the word in the sequence and the word next to it. We call it a bigram model:

P(w_1, w_2, \cdots, w_n)=\prod_{i=2}^{n} P(w_i \mid w_{i-1})

Indeed, it is a bit simple to only care about adjacent words. If you consider the co-occurrence of n consecutive words, you will get n-grams. But even using bigram can bring significant improvement relative to unigram. Considering that in the word-word co-occurrence matrix, the co-occurrence window is \( 1 \), we can basically get such pairwise probabilities. However, this in turn requires computing and storing global information for large datasets.

Now that we understand how to consider sequences of words with probabilities, let’s look at some example models that are able to learn these probabilities.

4.2 CBOW Continuous Bag of Words Model

This method is to put{"I", "Love", "Learning", "NLP", "Technology"}As a context, it is hoped that the central word can be predicted or generated from these wordsstudy. Such a model we call the continuous bag-of-words (CBOW) model.

❐ CBOW is a method of predicting the central word from the context. For each word in this model, we hope to learn two vectors

  • \( v \) (input vector, i.e. context word)
  • \( u \) (output vector, i.e. the center word)

The model input is a word vector representation in one-hot form. The input one-hot vector or context is represented by \( x^{(c)} \), and the output is represented by \( y^{(c)} \). In the CBOW model, since we only have one output, we call \( y \) the one-hot vector of the known center word.

Next we define the unknown parameters of the model

We create two matrices, \( \mathcal{V}\in \mathbb{R}^{n\times \left | V \right |} \) and \( \mathcal{U}\in \mathbb{R} ^{ \left | V \right |\times n} \) . in:

  • \( n \) is the size of any dimension of the embedding space
  • \( \mathcal{V} \) is the input word matrix, so that when it is the input of the model, the \( i \) column of \( \mathcal{V} \) is the \( n \) dimension embedding vector, define this \( n \times 1 \) vector as \( v_i \)
  • Similarly, \( \mathcal{U} \) is the output word matrix. When it is the input of the model, the jth line of \( \mathcal{U} \) is the \( n \)-dimensional embedding vector of the word \( w_{j} \). We define this behavior of \( \mathcal{U} \) \( u_j \)
  • Note that actually for each word \( w_i \) we need to learn two word vectors (ie input word vector \( v_i \) and output word vector \( u_i \) ).

❐ First weMake the following definition for the CBOW model

  • \( w_i \) : word \( i \) in the vocabulary \( V \)
  • \( \mathcal{V}\in \mathbb{R}^{n\times \left | V \right |} \) : input word matrix
  • \( v_i \) : Column \( i \) of \( \mathcal{V} \), input vector representation of word \( w_i \)
  • \( \mathcal{U}\in \mathbb{R}^{ \left | V \right |\times n} \) : output word matrix
  • \( u_i \) : Line \( i \) of \( \mathcal{U} \), output vector representation of word \( w_i \)

We decompose this model into the following steps

  • ① We generate a one-hot word vector \( (x^{(cm)}, \cdots ,x^{(c-1)},x^{(c +1)}, \cdots ,x^{(c+m)}\in \mathbb{R}^{ \left | V \right |}) \)
  • ② We calculate the embedded word vector based on the above one-hot input \( (v_{cm}=\mathcal{V}x^{(cm)},v_{c-m+1}=\mathcal{V}x^ {(c-m+1)},\cdots ,v_{c+m}=\mathcal{V}x^{(c+m)}\in \mathbb{R}^{n}) \) .
  • ③ Average the above word vectors\( \widehat{v}=\frac{v_{cm}+v_{c-m+1}+ \cdots +v_{c+m}}{2m}\in \ mathbb{R}^{n} \) .
  • ④ Calculate the score vector \( z = \mathcal{U}\widehat{v}\in \mathbb{R}^{ \left | V \right |} \) . Similar word-to-vector dot product values ​​are large, which will bring similar words closer together, resulting in a higher score.
  • ⑤ Convert the score to the probability \( \widehat{y}=softmax(z)\in \mathbb{R}^{ \left | V \right |} \) through softmax.
  • ⑥ The probability we want to generate \( \widehat{y} \in \mathbb{R}^{ \left | V \right |} \) and the actual probability\( y \in \mathbb{R}^{ \left | V \right |} \) (actually a one-hot representation) as close as possible (we will build a cross-entropy loss function and iteratively optimize it later).

Here softmax is a commonly used function. It converts a vector into another vector, where the \( i \)th element of the converted vector is\( \frac{e^{\widehat{y}_i}}{\sum_{k=1}^ { \left | V \right |}e^{\widehat{y}_k}} \).

Because the function is an exponential function, the value must be positive.

Normalize the vector by dividing by \( \sum_{k=1}^{ \left | V \right |}e^{\widehat{y}_k} \) (such that \( \sum_{k=1}^ { \left | V \right |}\widehat{y}_k=1 \) ) to get the probability.

The figure below is a calculation diagram of the CBOW model

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

If there are \( \mathcal{V} \) and \( \mathcal{U} \) , and we know how the model works, how do we update the parameters and learn these two matrices? Like all machine learning tasks, we willbuild objective function, here we will usecross entropy\( H(\widehat{y}, y) \) to construct a loss function, which is also a method of measuring the distance between two probability distributions in information theory.

(For cross entropy, please refer toShowMeAIofGraphical AI Math Basic CourseChinese articleGraphical AI Mathematical Foundation |information theory

In the discrete case, the loss function can be formulated intuitively using cross-entropy

H(\widehat{y}, y)=-\sum_{j=1}^{ \left | V \right |} y_{j} \log (\widehat{y}_{j})

In the above formula, \( y \) is a one-hot vector. So the above loss function can be simplified as:

H(\widehat{y}, y)= – y_{j}\,log(\widehat{y}_{j})

\( c \) is the index of the one-hot vector of the correct word. If we predict \( \widehat{y}_{c}=1 \) accurately, we can calculate \( H(\widehat{y}, y)=-1\,log(1)=0 \) . Therefore, we do not face any penalties or losses for completely accurate predictions.

We consider an opposite case where the prediction is very poor and the standard answer \( \widehat{y}_{c}=0.01 \) . Performing similar calculations can get the loss function value \( H(\widehat{y}, y)=-1\,log(0.01)=4.605 \), which means that the current loss is relatively large, and the gap with the standard answer is relatively large.

As can be seen from the above example, for probability distributions, cross-entropy provides us with a good measure of distance. Therefore, our optimization objective function formula is:

minimize J &=-\log P(w_{c} \mid w_{c-m}, \cdots, w_{c-1}, w_{c+1}, \cdots, w_{c+m}) \\
&=-\log P(u_{c} \mid \widehat{v}) \\
&=-\log \frac{\exp (u_{c}^{T} \widehat{v})}{\sum_{j=1}^{|V|} \exp (u_{j}^{T} \widehat{v})} \\
&=-u_{c}^{T} \widehat{v}+\log \sum_{j=1}^{|V|} \exp (u_{j}^{T} \widehat{v})

We use SGD (Stochastic Gradient Descent) to update all relevant word vectors \( u_{c} \) and \( v_j \) .
(For optimization algorithms such as SGD, please refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial |Neural Network Optimization Algorithm

❐ When \( \widehat{y} = y \), \( \widehat{y} \mapsto H(\widehat{y}, y) \) is the minimum value. If we find a \( \widehat{y} \) such that \( H(\widehat{y}, y) \) is close to the minimum, then \( \widehat{y} \approx y \) . This means our model is very good at predicting the head word from the context!

❐ In order to learn the vector (matrix \( U \) and \( V \) ), CBOW defines a loss function to measure its performance in predicting the center word. Then, we optimize the loss function by updating the matrices \(U\) and \(V\) stochastic gradient descent.

❐ SGD calculates gradients and updates parameters for a window:

\( \mathcal{U}_{new} \leftarrow \mathcal{U}_{old} -\alpha \nabla_{\mathcal{U}} J \)

\( \mathcal{V}_{old} \leftarrow \mathcal{V}_{old}-\alpha \nabla_{\mathcal{V}} J \)

4.3 Skip-Gram model

The Skip-Gram model is roughly the same as CBOW, but the model has exchanged the input and output \( x \) and \( y \), that is, \( x \) in CBOW is now \( y \), \( y \ ) is now \( x \) . The input one-hot vector (center word) is expressed as \( x \), and the output vector is \( y^{(j)} \). The \( \mathcal{V} \) and \( \mathcal{U} \) we defined are the same as CBOW.

Skip-Gram model: Predict surrounding context words given a center word.

Let’s get specificDisassemble the Skip-Gram model, first we define some notational markers

  • \( w_i \) : word \( i \) in the vocabulary \( V \)
  • \( \mathcal{V}\in \mathbb{R}^{n\times \left | V \right |} \) : input word matrix
  • \( v_i \) : Column \( i \) of \( \mathcal{V} \), input vector representation of word \( w_i \)
  • \( \mathcal{U}\in \mathbb{R}^{ \left | V \right |\times n} \) : output word matrix
  • \( u_i \) : Line \( i \) of \( \mathcal{U} \), output vector representation of word \( w_i \)

The working method of Skip-Gram can be disassembled into the following steps

  • ① Generate the one-hot vector of the central word \( x\in \mathbb{R}^{ \left | V \right |} \)
  • ② We calculate the word embedding vector \( v_{c}=\mathcal{V}x\in \mathbb{R}^{ \left | V \right |} \) for the central word
  • ③ Generate score vector\( z = \mathcal{U}v_{c} \)
  • ④ Convert the score vector into probability, \( \widehat{y}=softmax(z) \) Note\( \widehat{y}_{cm},\cdots,\widehat{y}_{c-1}, \widehat{y}_{c+1},\cdots,\widehat{y}_{c+m} \) is the probability of occurrence of each context word
  • ⑤ We want our generated probability vector to match the true probability \( y^{(cm)}, \cdots ,y^{(c-1)},y^{(c+1)}, \cdots ,y^{ (c+m)} \) , the one-hot vector is the actual output

As with the CBOW model, we need to generate an objective function to evaluate the model. A major difference from the CBOW model is that we invoke a naive Bayesian assumption to split the probabilities. This is a strong (naive) conditional independence assumption. In other words, given the center word, all output words are completely independent (ie, lines 1 to 2 of the formula)

minimize J &= -\log P(w_{c-m}, \cdots, w_{c-1}, w_{c+1}, \cdots, w_{c+m} \mid w_{c}) \\
&=-\log \prod_{j=0, j \neq m}^{2 m} P(w_{c-m+j} \mid w_{c}) \\
&=-\log \prod_{j=0, j \neq m}^{2 m} P(u_{c-m+j} \mid v_{c}) \\
&=-\log \prod_{j=0, j \neq m}^{2 m} \frac{\exp (u_{c-m+j}^{T} v_{c})}{\sum_{k=1}^{ \left | V \right |} \exp (u_{k}^{T} v_{c})} \\
&=-\sum_{j=0, j \neq m}^{2 m} u_{c-m+j}^{T} v_{c}+2 m \log \sum_{k=1}^{ \left | V \right |} \exp (u_{k}^{T} v_{c})

From this objective function (loss function), we can compute the gradients associated with the unknown parameters and update them via SGD at each iteration.


J =-\sum_{j=0, j \neq m}^{2 m} \log P(u_{c-m+j} \mid v_{c}) =\sum_{j=0, j \neq m}^{2 m} H(\widehat{y}, y_{c-m+j})

  • Where \( H(\widehat{y},y_{c-m+j}) \) is the probability of vector \( \widehat{y} \) and one-hot vector\( y_{c-m+j} \) cross entropy between.

❐ Only one probability vector \( \widehat{y} \) is computed. Skip-Gram treats each context word equally: the model calculates the probability of each word appearing in the context, independent of its distance from the center word.

The picture below isCalculation diagram of the Skip-Gram model

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

4.4 Negative sampling

Let’s go back to the objective function that needs to be optimized. We found that in the case of a large vocabulary, the sum of \( \left | V \right | \) is very large. Any update or evaluation of the objective function takes \( O( \left | V \right |) \) time complexity. A simple idea is not to calculate directly, but to approximate.

❐ The loss function J of CBOW and Skip Gram is computationally expensive because softmax normalization needs to be summed over all scores!

At each training time step, instead of traversing the entire vocabulary, we only sample some negative examples. We “sample” the noise distribution \( P_n(w) \) with a probability that matches the ranking of word frequencies.

Mikolov in the paper “Distributed Representations of Words and Phrases and their CompositionalityNegative sampling was proposed in “. Although negative sampling is based on the Skip-Gram model, it actually optimizes a different objective function.

Consider a set of word pairs \( (w,c) \) , is this set of word pairs the center word and context word that have appeared in the training set? We use \( P(D=1\mid w,c) \) to indicate that \( (w,c) \) has appeared in the corpus, \( P(D=0\mid w,c) \) to indicate \( (w,c) \) did not appear in the corpus.

This is a binary classification problem, which we model based on the sigmoid function:

P(D=1 \mid w, c, \theta)=\sigma(v_c^{T} v_w)=\frac{1}{1+e^{(-v_c^{T} v_w)}}

❐ The sigmoid function is a binary version of softmax, which can be used to build a probability model: \( \sigma(x)=\frac{1}{1+e^{-x}} \)

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

Now, we create a new objective function that maximizes the probability \( P(D=1\mid w,c) \) if the center word and context words are indeed in the corpus, and if the center word and context words are indeed not in the corpus In, maximize the probability \( P(D=0\mid w,c) \) .

(This is the processing idea of ​​logistic regression for binary classification. For related content, please refer toShowMeAIofMachine Learning TutorialarticleGraphical Machine Learning |Detailed explanation of logistic regression algorithm
We use a simple maximum likelihood estimation method for these two probabilities (here we take \( \theta \) as the parameters of the model, in our case it is \( \mathcal{V} \) and \( \ mathcal{U} \) )

\theta &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \widetilde{D}} P(D=0 \mid w, c, \theta) \\
&=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \widetilde{D}}(1-P(D=1 \mid w, c, \theta)) \\
&=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log P(D=1 \mid w, c, \theta)+\sum_{(w, c) \in \widetilde{D}} \log (1-P(D=1 \mid w, c, \theta)) \\
&=\arg \max _{\theta} \sum_{(w, c) \in D} \log \frac{1}{1+\exp (-u_{w}^{T} v_{c})}+\sum_{(w, c) \in \widetilde{D}} \log (1-\frac{1}{1+\exp (-u_{w}^{T} v_{c})}) \\
&=\arg \max _{\theta} \sum_{(w, c) \in D} \log \frac{1}{1+\exp (-u_{w}^{T} v_{c})}+\sum_{(w, c) \in \widetilde{D}} \log (\frac{1}{1+\exp (u_{w}^{T} v_{c})}) \end{aligned}

Here maximizing the likelihood function is equivalent to minimizing the negative log-likelihood:

J=-\sum_{(w, c) \in D} \log \frac{1}{1+\exp (-u_{w}^{T} v_{c})}-\sum_{(w, c) \in \widetilde{D}} \log (\frac{1}{1+\exp (u_{w}^{T} v_{c})})

Note that \( \widetilde{D} \) is a “false” or “negative” corpus. For example we have sentences likenatural learning love processing language i technology, such meaningless sentences will get a very low probability when they appear. We can construct negative examples \( \widetilde{D} \) from randomly sampled words from the corpus.

For the Skip-Gram model, our new objective function for the context word \( c-m+j \) observed given the center word \( c \) is

-\log \sigma(u_{c-m+j}^{T} \cdot v_{c})-\sum_{k=1}^{K} \log \sigma(-\tilde{u}_{k}^{T} \cdot v_{c})

For the CBOW model, we have a given context vector \( \widehat{v}=\frac{v_{cm}+v_{c-m+1}+ \cdots +v_{c+m}}{2m} \) To observe the new objective function of the head word \( u_{c} \) is:

-log\,\sigma(u_{c}^{T}\cdot \widehat{v})-\sum_{k=1}^{K}log\,\sigma(-\widetilde{u}_{k}^{T}\cdot \widehat{v})

In the above formula, \( {\widetilde{u}_{k}\mid k=1, \cdots ,K} \) is the vocabulary sampled from \( P_{n}(w) \). Regarding calculating the probability of selecting a word as a negative sample, random selection can be used. But the author of the paper gave a better formula as follows:

p(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum^m_{j=0}f(w_j)^{\frac{3}{4}}}

In the formula, \( f(w_i) \) represents the frequency of words \( w_i \) appearing in the corpus. The above formula is smoother and can increase the possibility of selecting low-frequency words.

4.5 Hierarchical Softmax

Mikolov in the paper “Distributed Representations of Words and Phrases and their Compositionality“Proposed hierarchical softmax (hierarchical softmax), which is a more effective alternative to ordinary softmax. In practice, hierarchical softmax tends to perform better for low-frequency words, and negative sampling performs better for high-frequency words and lower-dimensional vectors.

Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf node in the tree is a word, and there is only one path from the root node to the leaf node. In this model, there is no output representation of words. Instead, each node of the graph (except the root and leaf nodes) is associated with a vector to be learned by the model. The probability of a word being the output word is defined as the probability of a random walk from the root to the leaf corresponding to the word. The computational cost becomes \( O(log ( \left | V \right |)) \) instead of \( O( \left | V \right |) \).

In this model, the probability \( p(w\mid w_i) \) of the next word \( w \) given a vector \( w_i \) is equal to the leaf node from the root node to the corresponding w Random walk probability of ending. The biggest advantage of this method is that the time complexity of calculating the probability is only \( O(log( \left | V \right |)) \) , corresponding to the length of the path.

❐ The following figure is a schematic diagram of the binary tree of Hierarchical softmax:

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

Let’s introduce some concepts. Let \( L(w) \) be the number of nodes in the path from the root node to the leaf node \( w \). For example, \( L(w_2) \) in the figure above is 3. We define \( n(w,i) \) as the \( i \)th node on the path associated with the vector \( v_{n(w,i)} \). So \( n(w,1) \) is the root node, and \( n(w,L(w)) \) is the parent node of \( w \). Now for each internal node \( n \), we arbitrarily select one of its child nodes, defined as \( ch(n) \) (usually the left node). Then, we can calculate the probability as

p(w \mid w_i)=\prod_{j=1}^{L(w)-1} \sigma([n(w, j+1)=ch(n(w, j))] \cdot v_{n(w, j)}^{T} v_{w_i})


[x]=\left\{\begin{array}{ll}{1} & {\text { if } x \text { is true }} \\ {-1} & {\text { otherwise }}\end{array}\right.

This formula looks very complicated, let’s explain it in detail.

  • First, we will compute the term to multiply based on the shape (left and right branches) of the path from the root node \( (n(w,1)) \) to the leaf node \( (w) \). If we assume that \( ch(n) \) has always been the left node of \( n \), then when the path goes to the left \( [n(w,j+1)=ch(n(w,j)) ] \) returns 1, and to the right returns 0.
  • Also, \( [n(w,j+1)=ch(n(w,j))] \) provides normalization. At a node \( n \), if we sum the probabilities of going to the left and right nodes, for any value of \( v_{n}^{T}v_{w_i} \), we can check that \( \ sigma(v_{n}^{T} v_{w_i})+\sigma(-v_{n}^{T} v_{w_i})=1 \) . Normalization also guarantees \( \sum_{w=1}^{ \left | V \right |}P(w\mid w_i)=1 \) , which is the same as ordinary softmax.
  • Finally we compute the dot product to compare the similarity of the input vector \( v_{w_i} \) to each internal node vector \( v_{n(w,j)}^{T} \). Below we give an example. Take \( w_2 \) in the above figure as an example, it takes two left edges and one right edge to reach \( w_2 \) from the root node, so

p(w_2 \mid w_i) &=p(n(w_2, 1), \text {left}) \cdot p(n(w_2, 2), \text {left}) \cdot p(n(w_2, 3), \text { right }) \\
&=\sigma(v_{n(w_2, 1)}^{T} v_{w_i}) \cdot \sigma(v_{n(w_2, 2)}^{T} v_{w_i}) \cdot \sigma(-v_{n(w_2, 3)}^{T} v_{w_i})

The goal of our training model is to minimize the negative log-likelihood \( -log\,P(w\mid w_i) \) . Instead of updating the output vector for each word, update the vector of the nodes on the path from the root node to the leaf node in the binary tree.

The speed of this method is determined by the way the binary tree is built and words are assigned to leaf nodes. Mikolov in the paper “Distributed Representations of Words and Phrases and their Compositionality“The Huffman tree is used in the tree, and high-frequency words are assigned to shorter paths in the tree.

5. Build word vectors based on the python gensim tool library

In python, it is easy to build and use word vectors based on the Gensim tool library, which providesmost_similardoesnt_matchand other application APIs. we canmost_similarEncapsulate and output the analogy result of the triplet, the code is as follows

model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]
analogy('japan', 'japanese', 'australia')
model.doesnt_match("breakfast cereal dinner lunch".split())

6. Extended reading

Word2Vec Tutorial – The Skip-Gram Model

6.1 Task example

We’re going to train a simple neural network with a single hidden layer to perform a certain task, but we’re not actually using this neural network for the task for which we trained it. Instead, the goal is really just to learn the weights of the hidden layer, which are actually the “word vectors” we’re trying to learn. This technique is also commonly used in unsupervised feature learning. train oneauto-encoderThus, the input vector is compressed in the hidden layer and the hidden layer vector is decompressed in the output layer to obtain the input vector. After training, remove the output layer and just use the hidden layer.

The picture below isThe process of drawing samples from the source text

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec

The picture below isNetwork Architecture Diagram

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec
If two different words have very similar “context” (i.e. the possible words around them are similar), then our model needs to output very similar results for these two words. One way the network outputs similar contextual predictions for these two words is by judging whether the word vectors are similar. So, our network learns similar word vectors for two words if they have similar context!

  • Efficient Estimation of Word Representations in Vector Space(original word2vec paper)
  • Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)

7. References

ShowMeAIDeep Learning and Natural Language Processing Tutorial (Full Version)

ShowMeAIStanford NLP famous course CS224n with detailed explanation (20 lectures · full version)

ShowMeAIRecommended series of tutorials

NLP tutorial (1) - word vector, SVD decomposition and Word2Vec