author:Han Xinzi@ShowMeAI
Tutorial address：https://www.showmeai.tech/tutorials/36
Address of this article：https://www.showmeai.tech/articledetail/230
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
collectShowMeAIView more exciting content
This series isStanford CS224nA full set of study notes for “Natural Language Processing with Deep Learning”, the corresponding course video can be found athereCheck.
ShowMeAIFor all the courseware of the CS224n course, aChinese translation and notes, and made into a GIF animation! clickhereCheck”Lecture 1 – Introduction to NLP and Preliminary Word Vectors“Courseware annotations and explanations with learning. See the end of the article for more information.
introduction
CS224nIt is a professional course in the direction of deep learning and natural language processing produced by the top university Stanford. The core content covers RNN, LSTM, CNN, transformer, bert, question and answer, abstract, text generation, language model, reading comprehension and other cuttingedge content.
This note corresponds to the first knowledge section of the Stanford CS224n Natural Language Processing special course:NLP and word vectors. Firstly, the concept of natural language processing (NLP) and the problems it faces are introduced, and then the word vector and its construction method (including dimensionality reduction based on cooccurrence matrix and Word2Vec) are introduced.
content points
 Natural Language Processing/Natural Language Processing (NLP)
 Word Vectors/Word Vectors
 SVD Matrix Decomposition
 Skipgram
 negative sampling
 transformer
 CBOW
 Hierarchical softmax
 Word2Vec
1. Introduction to Natural Language Processing
1.1 What is special about natural language processing
What is special about human language? Human language is a system dedicated to expressing meaning. Language and writing are the upperlevel abstract representations. NLP is very different from computer vision or any other machine learning tasks.
Most words are just a sign outside of linguistics: a word is a signifier that maps to a signified (signified idea or thing). For example, the word “rocket” refers to the concept of a rocket, and thus can be extended to an instance of a rocket. There are also some exceptions when we use words and letters to express symbols, such as the use of “whoompaa”.
Most importantly, the symbols of these languages can be encoded in several forms: sounds, gestures, words, etc., and then transmitted to the brain in continuous signals; the brain itself seems to be able to decode these signals in a continuous way . A great deal of work has been done in the philosophy of language and linguistics to conceptualize human language and to distinguish words from their references, meanings, etc.
❐ Natural language is a discrete[dispersed] / symbolic[Symbolic] / categorical[categorized] system.
1.2 Natural Language Processing Tasks
Natural language processing has different levels of tasks, from language processing to semantic interpretation to discourse processing. The goal of natural language processing is to enable computers to “understand” language by designing algorithms so that they can perform certain tasks. The difficulty of different tasks is different:
1) Simple tasks
 Spell CheckingSpell Checking
 Keyword Search Keyword Search
 Finding Synonyms
2) Intermediate tasks
 Parse information from websites, documents, etc.
3) Complex tasks
 Machine TranslationMachine Translation
 Semantic Analysis Semantic Analysis
 Refer to resolution Coreference
 Question Answering System Question Answering
1.3 How to represent vocabulary
The first and arguably most important common denominator among all NLP tasks is thatHow we represent words as input to any model. We will not discuss here that early natural language processing work treated words as atomic symbols.
In order to perform better on most natural language processing tasks, we first need to understandsimilarities and differences between words. With word vectors, we can easily encode this into the vector itself.
(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial Natural Language Processing and Word Embeddings）
2. Word vector
Use word vectors to encode words, \( N \) The dimensional space is enough for us to encode all the semantics of the language, and each dimension will encode some information that we use the language to convey.
The simple onehot vector cannot give the similarity between words. We need to reduce the dimension \( \left  V \right  \) to a lowlatitude subspace to obtain dense word vectors and get between words Relationship.
3. Word vector based on SVD dimensionality reduction
(aboutDimensionality reduction algorithmcan readShowMeAIMachine Learning Tutorial Articles forGraphical Machine Learning Detailed explanation of dimensionality reduction algorithm, also available viaShowMeAIMachine Learning Practical ArticlesMachine Learning in Action The most comprehensive interpretation of machine learning feature engineeringAbout the python application method of dimensionality reduction)
Based on word cooccurrence matrix and SVD decomposition is to build word embedding(that is, a method of word vector).
 We first traverse a large dataset and count the cooccurrence count matrix \( X \)
 Then perform SVD decomposition on the matrix \( X \) to get \( USV^T \)
 Then we use the line of \( U \) as the word vector of all words in the dictionary
Next we discuss several options for the matrix \( X \) .
3.1 WordDocument Matrix
The original solution was done based on the worddocument cooccurrence matrix. We guess that related words occur frequently in the same document:
 For example, “banks”, “bonds”, “stocks”, “moneys”, etc., have a higher probability of appearing together
 However, “banks”, “octopus”, “banana”, “hockey” are unlikely to appear consecutively
Based on this situation, we create aWordDocument Matrix, \( X \) is constructed in the following way: iterate through hundreds of millions of documents and when word \( i \) appears in document \( j \), we increment \( X_{ij} \) by one.
This is obviously a large matrix \( \mathbb{R}^{ \left  V \right  \times M} \), and its size is proportional to the number of documents \( M \). So we can try something better.
3.2 Word cooccurrence matrix based on sliding window
Full document statistics is a very timeconsuming and laborintensive task, we can make adjustments to count the data in a text window, calculate the number of times each word appears in a window of a specific size, and obtain the cooccurrence matrix \( X \) .
The following is a simple example, we construct the cooccurrence matrix of the text based on the sliding window (the front and rear lengths are 1).
 I enjoy flying.
 I like NLP.
 I like deep learning.
❐ Using word cooccurrence matrix：
 Generate a cooccurrence matrix \( X \) of dimension \( \left  V \right \times \left  V \right  \)
 Apply SVD on \( X \) to get \( X = USV^T \)
 Select \( U \) before \( k \) lines to get \( k \) dimension word vector
 \( \frac{\sum_{i=1}^{k} \sigma_i}{\sum_{i=1}^{ \left  V \right } \sigma_i} \) represents the first \( k \ ) dimension contains the amount of variance
3.3 Using SVD to reduce the dimensionality of the cooccurrence matrix
 We use SVD on the matrix \(X\), observe the singular values (elements on the upper diagonal of the matrix \(S\), truncate according to the variance percentage, leaving the first \(k\) elements:
$$
\frac{\sum_{i=1}^{k} \sigma_i}{\sum_{i=1}^{ \left  V \right } \sigma_i}
$$
 Then take the submatrix \( U_{1: \left  V \right , 1:k} \) as the word embedding matrix. This gives a \( k \) dimensional representation of each word in the vocabulary.
Use SVD on the matrix \( X \)：
Dimensionality reduction by selecting the first \( k \) singular vectors：
The aforementioned methods provide us with enough word vectors to encode semantic and syntactic (part of speech) information, but also bring some problems:
 The dimension of the matrix will change frequently (new words are added frequently and the size of the corpus will change)
 The matrix will be very sparse as many words will not cooccur
 The matrix dimension will generally be very high\( \approx 10^6 \times 10^6 \)
 Need to add some tricks to \( X \) to solve the extreme imbalance of word frequency
❐ The SVDbased methodHigh computational complexity( \( m \times n \) matrix is computationally expensive \( O({mn}^2) \) ), and it is difficult to incorporate new words or documents.
There are the following solutions to the problems in the above discussion：
 Ignore function words such as “the”, “he”, “has”, etc.
 Use a ramp window, i.e. weight the cooccurrence counts according to the distance between words in the document
 Use Pearson’s correlation coefficient and set negative counts to \(0\) instead of just raw counts
❐ Countbased methods can make efficient use of statistics, but the followingiterative based approachWord embeddings can be efficiently constructed on large corpora while controlling the complexity.
4. Iterative optimization algorithm – Word2Vec
(This section can also refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial Natural Language Processing and Word Embeddings）
Word2Vec is an iterative model that is able to iteratively learn from text and eventually encode the word vectors for the probabilities of words given a context, instead of computing and storing global information for some large dataset (possibly billions of sentences).
The idea is to design a model whose parameters are word vectors. Then train the model according to an objective function, calculate the error at each model iteration, adjust the model parameters (word vector) based on the optimization algorithm, reduce the loss function, and finally learn the word vector.
Everyone knows that the corresponding idea in the neural network is called “backpropagation”, the simpler the model and task, the faster it can be trained.
❐ Iterative based methodsCapture cooccurrences of words one at a time, instead of directly capturing all cooccurrence counts like the SVD method.
Many researchers have tested different methods along this line of thought. The model designed by [Collobert et al., 2011] first converts each word into a vector. For each specific task (named entity recognition, partofspeech tagging, etc.), they not only trained the parameters of the model, but also trained word vectors, and achieved very good performance while calculating very good word vectors.
A very efficient method is Word2Vec. Word2Vec is an open source software package from Google, which includes the following core content:

two algorithms: continuous bagofwords(CBOW) and skipgram
 CBOWis to predict the word vector of the word based on the context words around the center word
 skipgramOn the contrary, it is to predict the probability distribution of words in the surrounding context according to the central word.

two training methods: negative sampling and hierarchical softmax
 Negative samplingDefine the target by drawing negative samples
 hierarchical softmaxDefine the target by computing the probabilities of all words using an efficient tree structure
❐ Word2Vec relies on a very important assumption in linguistics “distribution similarity”, that is, similar words have similar contexts.
4.1 Language Model
Let’s first understand the language model. Start with an example:
I love learning about natural language processing techniques
A good language model would give this sentence a high probability because syntactically and semantically it is a perfectly valid sentence. similarly, the sentencenatural learning love processing language i technology
would get a very low probability because it’s a nonsensical sentence.
Mathematically, we can say that the probability for a given sequence of \( n \) words is:
$$
P(w_1, w_2, \cdots, w_n)
$$
In the Unigram model approach, we assume that word occurrences are completely independent, thus decomposing the probabilities
$$
P(w_1, w_2, \cdots, w_n)=\prod_{i=1}^{n} P(w_i)
$$
Strictly speaking, the above assumption is unreasonable, because the next word is highly dependent on the previous word sequence. If the above language model is used, it may make a meaningless sentence have a high probability. So we let the probability of the sequence depend on the pairwise probability of the word in the sequence and the word next to it. We call it a bigram model:
$$
P(w_1, w_2, \cdots, w_n)=\prod_{i=2}^{n} P(w_i \mid w_{i1})
$$
Indeed, it is a bit simple to only care about adjacent words. If you consider the cooccurrence of n consecutive words, you will get ngrams. But even using bigram can bring significant improvement relative to unigram. Considering that in the wordword cooccurrence matrix, the cooccurrence window is \( 1 \), we can basically get such pairwise probabilities. However, this in turn requires computing and storing global information for large datasets.
Now that we understand how to consider sequences of words with probabilities, let’s look at some example models that are able to learn these probabilities.
4.2 CBOW Continuous Bag of Words Model
This method is to put{"I", "Love", "Learning", "NLP", "Technology"}
As a context, it is hoped that the central word can be predicted or generated from these wordsstudy
. Such a model we call the continuous bagofwords (CBOW) model.
❐ CBOW is a method of predicting the central word from the context. For each word in this model, we hope to learn two vectors
 \( v \) (input vector, i.e. context word)
 \( u \) (output vector, i.e. the center word)
The model input is a word vector representation in onehot form. The input onehot vector or context is represented by \( x^{(c)} \), and the output is represented by \( y^{(c)} \). In the CBOW model, since we only have one output, we call \( y \) the onehot vector of the known center word.
Next we define the unknown parameters of the model。
We create two matrices, \( \mathcal{V}\in \mathbb{R}^{n\times \left  V \right } \) and \( \mathcal{U}\in \mathbb{R} ^{ \left  V \right \times n} \) . in:
 \( n \) is the size of any dimension of the embedding space
 \( \mathcal{V} \) is the input word matrix, so that when it is the input of the model, the \( i \) column of \( \mathcal{V} \) is the \( n \) dimension embedding vector, define this \( n \times 1 \) vector as \( v_i \)
 Similarly, \( \mathcal{U} \) is the output word matrix. When it is the input of the model, the jth line of \( \mathcal{U} \) is the \( n \)dimensional embedding vector of the word \( w_{j} \). We define this behavior of \( \mathcal{U} \) \( u_j \)
 Note that actually for each word \( w_i \) we need to learn two word vectors (ie input word vector \( v_i \) and output word vector \( u_i \) ).
❐ First weMake the following definition for the CBOW model：
 \( w_i \) : word \( i \) in the vocabulary \( V \)
 \( \mathcal{V}\in \mathbb{R}^{n\times \left  V \right } \) : input word matrix
 \( v_i \) : Column \( i \) of \( \mathcal{V} \), input vector representation of word \( w_i \)
 \( \mathcal{U}\in \mathbb{R}^{ \left  V \right \times n} \) : output word matrix
 \( u_i \) : Line \( i \) of \( \mathcal{U} \), output vector representation of word \( w_i \)
We decompose this model into the following steps：
 ① We generate a onehot word vector \( (x^{(cm)}, \cdots ,x^{(c1)},x^{(c +1)}, \cdots ,x^{(c+m)}\in \mathbb{R}^{ \left  V \right }) \)
 ② We calculate the embedded word vector based on the above onehot input \( (v_{cm}=\mathcal{V}x^{(cm)},v_{cm+1}=\mathcal{V}x^ {(cm+1)},\cdots ,v_{c+m}=\mathcal{V}x^{(c+m)}\in \mathbb{R}^{n}) \) .
 ③ Average the above word vectors\( \widehat{v}=\frac{v_{cm}+v_{cm+1}+ \cdots +v_{c+m}}{2m}\in \ mathbb{R}^{n} \) .
 ④ Calculate the score vector \( z = \mathcal{U}\widehat{v}\in \mathbb{R}^{ \left  V \right } \) . Similar wordtovector dot product values are large, which will bring similar words closer together, resulting in a higher score.
 ⑤ Convert the score to the probability \( \widehat{y}=softmax(z)\in \mathbb{R}^{ \left  V \right } \) through softmax.
 ⑥ The probability we want to generate \( \widehat{y} \in \mathbb{R}^{ \left  V \right } \) and the actual probability\( y \in \mathbb{R}^{ \left  V \right } \) (actually a onehot representation) as close as possible (we will build a crossentropy loss function and iteratively optimize it later).
❐ Here softmax is a commonly used function. It converts a vector into another vector, where the \( i \)th element of the converted vector is\( \frac{e^{\widehat{y}_i}}{\sum_{k=1}^ { \left  V \right }e^{\widehat{y}_k}} \).
Because the function is an exponential function, the value must be positive.
Normalize the vector by dividing by \( \sum_{k=1}^{ \left  V \right }e^{\widehat{y}_k} \) (such that \( \sum_{k=1}^ { \left  V \right }\widehat{y}_k=1 \) ) to get the probability.
The figure below is a calculation diagram of the CBOW model：
If there are \( \mathcal{V} \) and \( \mathcal{U} \) , and we know how the model works, how do we update the parameters and learn these two matrices? Like all machine learning tasks, we willbuild objective function, here we will usecross entropy\( H(\widehat{y}, y) \) to construct a loss function, which is also a method of measuring the distance between two probability distributions in information theory.
(For cross entropy, please refer toShowMeAIofGraphical AI Math Basic CourseChinese articleGraphical AI Mathematical Foundation information theory）
In the discrete case, the loss function can be formulated intuitively using crossentropy
$$
H(\widehat{y}, y)=\sum_{j=1}^{ \left  V \right } y_{j} \log (\widehat{y}_{j})
$$
In the above formula, \( y \) is a onehot vector. So the above loss function can be simplified as:
$$
H(\widehat{y}, y)= – y_{j}\,log(\widehat{y}_{j})
$$
\( c \) is the index of the onehot vector of the correct word. If we predict \( \widehat{y}_{c}=1 \) accurately, we can calculate \( H(\widehat{y}, y)=1\,log(1)=0 \) . Therefore, we do not face any penalties or losses for completely accurate predictions.
We consider an opposite case where the prediction is very poor and the standard answer \( \widehat{y}_{c}=0.01 \) . Performing similar calculations can get the loss function value \( H(\widehat{y}, y)=1\,log(0.01)=4.605 \), which means that the current loss is relatively large, and the gap with the standard answer is relatively large.
As can be seen from the above example, for probability distributions, crossentropy provides us with a good measure of distance. Therefore, our optimization objective function formula is:
$$
\begin{aligned}
minimize J &=\log P(w_{c} \mid w_{cm}, \cdots, w_{c1}, w_{c+1}, \cdots, w_{c+m}) \\
&=\log P(u_{c} \mid \widehat{v}) \\
&=\log \frac{\exp (u_{c}^{T} \widehat{v})}{\sum_{j=1}^{V} \exp (u_{j}^{T} \widehat{v})} \\
&=u_{c}^{T} \widehat{v}+\log \sum_{j=1}^{V} \exp (u_{j}^{T} \widehat{v})
\end{aligned}
$$
We use SGD (Stochastic Gradient Descent) to update all relevant word vectors \( u_{c} \) and \( v_j \) .
(For optimization algorithms such as SGD, please refer toShowMeAIA summary article of Teacher Wu Enda’s courseDeep Learning Tutorial Neural Network Optimization Algorithm）
❐ When \( \widehat{y} = y \), \( \widehat{y} \mapsto H(\widehat{y}, y) \) is the minimum value. If we find a \( \widehat{y} \) such that \( H(\widehat{y}, y) \) is close to the minimum, then \( \widehat{y} \approx y \) . This means our model is very good at predicting the head word from the context!
❐ In order to learn the vector (matrix \( U \) and \( V \) ), CBOW defines a loss function to measure its performance in predicting the center word. Then, we optimize the loss function by updating the matrices \(U\) and \(V\) stochastic gradient descent.
❐ SGD calculates gradients and updates parameters for a window:
\( \mathcal{U}_{new} \leftarrow \mathcal{U}_{old} \alpha \nabla_{\mathcal{U}} J \)
\( \mathcal{V}_{old} \leftarrow \mathcal{V}_{old}\alpha \nabla_{\mathcal{V}} J \)
4.3 SkipGram model
The SkipGram model is roughly the same as CBOW, but the model has exchanged the input and output \( x \) and \( y \), that is, \( x \) in CBOW is now \( y \), \( y \ ) is now \( x \) . The input onehot vector (center word) is expressed as \( x \), and the output vector is \( y^{(j)} \). The \( \mathcal{V} \) and \( \mathcal{U} \) we defined are the same as CBOW.
❐ SkipGram model: Predict surrounding context words given a center word.
Let’s get specificDisassemble the SkipGram model, first we define some notational markers
 \( w_i \) : word \( i \) in the vocabulary \( V \)
 \( \mathcal{V}\in \mathbb{R}^{n\times \left  V \right } \) : input word matrix
 \( v_i \) : Column \( i \) of \( \mathcal{V} \), input vector representation of word \( w_i \)
 \( \mathcal{U}\in \mathbb{R}^{ \left  V \right \times n} \) : output word matrix
 \( u_i \) : Line \( i \) of \( \mathcal{U} \), output vector representation of word \( w_i \)
The working method of SkipGram can be disassembled into the following steps：
 ① Generate the onehot vector of the central word \( x\in \mathbb{R}^{ \left  V \right } \)
 ② We calculate the word embedding vector \( v_{c}=\mathcal{V}x\in \mathbb{R}^{ \left  V \right } \) for the central word
 ③ Generate score vector\( z = \mathcal{U}v_{c} \)
 ④ Convert the score vector into probability, \( \widehat{y}=softmax(z) \) Note\( \widehat{y}_{cm},\cdots,\widehat{y}_{c1}, \widehat{y}_{c+1},\cdots,\widehat{y}_{c+m} \) is the probability of occurrence of each context word
 ⑤ We want our generated probability vector to match the true probability \( y^{(cm)}, \cdots ,y^{(c1)},y^{(c+1)}, \cdots ,y^{ (c+m)} \) , the onehot vector is the actual output
As with the CBOW model, we need to generate an objective function to evaluate the model. A major difference from the CBOW model is that we invoke a naive Bayesian assumption to split the probabilities. This is a strong (naive) conditional independence assumption. In other words, given the center word, all output words are completely independent (ie, lines 1 to 2 of the formula)
$$
\begin{aligned}
minimize J &= \log P(w_{cm}, \cdots, w_{c1}, w_{c+1}, \cdots, w_{c+m} \mid w_{c}) \\
&=\log \prod_{j=0, j \neq m}^{2 m} P(w_{cm+j} \mid w_{c}) \\
&=\log \prod_{j=0, j \neq m}^{2 m} P(u_{cm+j} \mid v_{c}) \\
&=\log \prod_{j=0, j \neq m}^{2 m} \frac{\exp (u_{cm+j}^{T} v_{c})}{\sum_{k=1}^{ \left  V \right } \exp (u_{k}^{T} v_{c})} \\
&=\sum_{j=0, j \neq m}^{2 m} u_{cm+j}^{T} v_{c}+2 m \log \sum_{k=1}^{ \left  V \right } \exp (u_{k}^{T} v_{c})
\end{aligned}
$$
From this objective function (loss function), we can compute the gradients associated with the unknown parameters and update them via SGD at each iteration.
Notice:
$$
J =\sum_{j=0, j \neq m}^{2 m} \log P(u_{cm+j} \mid v_{c}) =\sum_{j=0, j \neq m}^{2 m} H(\widehat{y}, y_{cm+j})
$$
 Where \( H(\widehat{y},y_{cm+j}) \) is the probability of vector \( \widehat{y} \) and onehot vector\( y_{cm+j} \) cross entropy between.
❐ Only one probability vector \( \widehat{y} \) is computed. SkipGram treats each context word equally: the model calculates the probability of each word appearing in the context, independent of its distance from the center word.
The picture below isCalculation diagram of the SkipGram model：
4.4 Negative sampling
Let’s go back to the objective function that needs to be optimized. We found that in the case of a large vocabulary, the sum of \( \left  V \right  \) is very large. Any update or evaluation of the objective function takes \( O( \left  V \right ) \) time complexity. A simple idea is not to calculate directly, but to approximate.
❐ The loss function J of CBOW and Skip Gram is computationally expensive because softmax normalization needs to be summed over all scores!
At each training time step, instead of traversing the entire vocabulary, we only sample some negative examples. We “sample” the noise distribution \( P_n(w) \) with a probability that matches the ranking of word frequencies.
Mikolov in the paper “Distributed Representations of Words and Phrases and their CompositionalityNegative sampling was proposed in “. Although negative sampling is based on the SkipGram model, it actually optimizes a different objective function.
Consider a set of word pairs \( (w,c) \) , is this set of word pairs the center word and context word that have appeared in the training set? We use \( P(D=1\mid w,c) \) to indicate that \( (w,c) \) has appeared in the corpus, \( P(D=0\mid w,c) \) to indicate \( (w,c) \) did not appear in the corpus.
This is a binary classification problem, which we model based on the sigmoid function:
$$
P(D=1 \mid w, c, \theta)=\sigma(v_c^{T} v_w)=\frac{1}{1+e^{(v_c^{T} v_w)}}
$$
❐ The sigmoid function is a binary version of softmax, which can be used to build a probability model: \( \sigma(x)=\frac{1}{1+e^{x}} \)
Now, we create a new objective function that maximizes the probability \( P(D=1\mid w,c) \) if the center word and context words are indeed in the corpus, and if the center word and context words are indeed not in the corpus In, maximize the probability \( P(D=0\mid w,c) \) .
(This is the processing idea of logistic regression for binary classification. For related content, please refer toShowMeAIofMachine Learning TutorialarticleGraphical Machine Learning Detailed explanation of logistic regression algorithm）
We use a simple maximum likelihood estimation method for these two probabilities (here we take \( \theta \) as the parameters of the model, in our case it is \( \mathcal{V} \) and \( \ mathcal{U} \) )
$$
\begin{aligned}
\theta &=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \widetilde{D}} P(D=0 \mid w, c, \theta) \\
&=\underset{\theta}{\operatorname{argmax}} \prod_{(w, c) \in D} P(D=1 \mid w, c, \theta) \prod_{(w, c) \in \widetilde{D}}(1P(D=1 \mid w, c, \theta)) \\
&=\underset{\theta}{\operatorname{argmax}} \sum_{(w, c) \in D} \log P(D=1 \mid w, c, \theta)+\sum_{(w, c) \in \widetilde{D}} \log (1P(D=1 \mid w, c, \theta)) \\
&=\arg \max _{\theta} \sum_{(w, c) \in D} \log \frac{1}{1+\exp (u_{w}^{T} v_{c})}+\sum_{(w, c) \in \widetilde{D}} \log (1\frac{1}{1+\exp (u_{w}^{T} v_{c})}) \\
&=\arg \max _{\theta} \sum_{(w, c) \in D} \log \frac{1}{1+\exp (u_{w}^{T} v_{c})}+\sum_{(w, c) \in \widetilde{D}} \log (\frac{1}{1+\exp (u_{w}^{T} v_{c})}) \end{aligned}
$$
Here maximizing the likelihood function is equivalent to minimizing the negative loglikelihood:
$$
J=\sum_{(w, c) \in D} \log \frac{1}{1+\exp (u_{w}^{T} v_{c})}\sum_{(w, c) \in \widetilde{D}} \log (\frac{1}{1+\exp (u_{w}^{T} v_{c})})
$$
Note that \( \widetilde{D} \) is a “false” or “negative” corpus. For example we have sentences likenatural learning love processing language i technology
, such meaningless sentences will get a very low probability when they appear. We can construct negative examples \( \widetilde{D} \) from randomly sampled words from the corpus.
For the SkipGram model, our new objective function for the context word \( cm+j \) observed given the center word \( c \) is
$$
\log \sigma(u_{cm+j}^{T} \cdot v_{c})\sum_{k=1}^{K} \log \sigma(\tilde{u}_{k}^{T} \cdot v_{c})
$$
For the CBOW model, we have a given context vector \( \widehat{v}=\frac{v_{cm}+v_{cm+1}+ \cdots +v_{c+m}}{2m} \) To observe the new objective function of the head word \( u_{c} \) is:
$$
log\,\sigma(u_{c}^{T}\cdot \widehat{v})\sum_{k=1}^{K}log\,\sigma(\widetilde{u}_{k}^{T}\cdot \widehat{v})
$$
In the above formula, \( {\widetilde{u}_{k}\mid k=1, \cdots ,K} \) is the vocabulary sampled from \( P_{n}(w) \). Regarding calculating the probability of selecting a word as a negative sample, random selection can be used. But the author of the paper gave a better formula as follows:
$$
p(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum^m_{j=0}f(w_j)^{\frac{3}{4}}}
$$
In the formula, \( f(w_i) \) represents the frequency of words \( w_i \) appearing in the corpus. The above formula is smoother and can increase the possibility of selecting lowfrequency words.
4.5 Hierarchical Softmax
Mikolov in the paper “Distributed Representations of Words and Phrases and their Compositionality“Proposed hierarchical softmax (hierarchical softmax), which is a more effective alternative to ordinary softmax. In practice, hierarchical softmax tends to perform better for lowfrequency words, and negative sampling performs better for highfrequency words and lowerdimensional vectors.
❐ Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf node in the tree is a word, and there is only one path from the root node to the leaf node. In this model, there is no output representation of words. Instead, each node of the graph (except the root and leaf nodes) is associated with a vector to be learned by the model. The probability of a word being the output word is defined as the probability of a random walk from the root to the leaf corresponding to the word. The computational cost becomes \( O(log ( \left  V \right )) \) instead of \( O( \left  V \right ) \).
In this model, the probability \( p(w\mid w_i) \) of the next word \( w \) given a vector \( w_i \) is equal to the leaf node from the root node to the corresponding w Random walk probability of ending. The biggest advantage of this method is that the time complexity of calculating the probability is only \( O(log( \left  V \right )) \) , corresponding to the length of the path.
❐ The following figure is a schematic diagram of the binary tree of Hierarchical softmax:
Let’s introduce some concepts. Let \( L(w) \) be the number of nodes in the path from the root node to the leaf node \( w \). For example, \( L(w_2) \) in the figure above is 3. We define \( n(w,i) \) as the \( i \)th node on the path associated with the vector \( v_{n(w,i)} \). So \( n(w,1) \) is the root node, and \( n(w,L(w)) \) is the parent node of \( w \). Now for each internal node \( n \), we arbitrarily select one of its child nodes, defined as \( ch(n) \) (usually the left node). Then, we can calculate the probability as
$$
p(w \mid w_i)=\prod_{j=1}^{L(w)1} \sigma([n(w, j+1)=ch(n(w, j))] \cdot v_{n(w, j)}^{T} v_{w_i})
$$
in
$$
[x]=\left\{\begin{array}{ll}{1} & {\text { if } x \text { is true }} \\ {1} & {\text { otherwise }}\end{array}\right.
$$
This formula looks very complicated, let’s explain it in detail.
 First, we will compute the term to multiply based on the shape (left and right branches) of the path from the root node \( (n(w,1)) \) to the leaf node \( (w) \). If we assume that \( ch(n) \) has always been the left node of \( n \), then when the path goes to the left \( [n(w,j+1)=ch(n(w,j)) ] \) returns 1, and to the right returns 0.
 Also, \( [n(w,j+1)=ch(n(w,j))] \) provides normalization. At a node \( n \), if we sum the probabilities of going to the left and right nodes, for any value of \( v_{n}^{T}v_{w_i} \), we can check that \( \ sigma(v_{n}^{T} v_{w_i})+\sigma(v_{n}^{T} v_{w_i})=1 \) . Normalization also guarantees \( \sum_{w=1}^{ \left  V \right }P(w\mid w_i)=1 \) , which is the same as ordinary softmax.
 Finally we compute the dot product to compare the similarity of the input vector \( v_{w_i} \) to each internal node vector \( v_{n(w,j)}^{T} \). Below we give an example. Take \( w_2 \) in the above figure as an example, it takes two left edges and one right edge to reach \( w_2 \) from the root node, so
$$
\begin{aligned}
p(w_2 \mid w_i) &=p(n(w_2, 1), \text {left}) \cdot p(n(w_2, 2), \text {left}) \cdot p(n(w_2, 3), \text { right }) \\
&=\sigma(v_{n(w_2, 1)}^{T} v_{w_i}) \cdot \sigma(v_{n(w_2, 2)}^{T} v_{w_i}) \cdot \sigma(v_{n(w_2, 3)}^{T} v_{w_i})
\end{aligned}
$$
The goal of our training model is to minimize the negative loglikelihood \( log\,P(w\mid w_i) \) . Instead of updating the output vector for each word, update the vector of the nodes on the path from the root node to the leaf node in the binary tree.
The speed of this method is determined by the way the binary tree is built and words are assigned to leaf nodes. Mikolov in the paper “Distributed Representations of Words and Phrases and their Compositionality“The Huffman tree is used in the tree, and highfrequency words are assigned to shorter paths in the tree.
5. Build word vectors based on the python gensim tool library
In python, it is easy to build and use word vectors based on the Gensim tool library, which providesmost_similar
、doesnt_match
and other application APIs. we canmost_similar
Encapsulate and output the analogy result of the triplet, the code is as follows
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
model.most_similar('banana')
def analogy(x1, x2, y1):
result = model.most_similar(positive=[y1, x2], negative=[x1])
return result[0][0]
analogy('japan', 'japanese', 'australia')
model.doesnt_match("breakfast cereal dinner lunch".split())
6. Extended reading
Word2Vec Tutorial – The SkipGram Model
6.1 Task example
We’re going to train a simple neural network with a single hidden layer to perform a certain task, but we’re not actually using this neural network for the task for which we trained it. Instead, the goal is really just to learn the weights of the hidden layer, which are actually the “word vectors” we’re trying to learn. This technique is also commonly used in unsupervised feature learning. train oneautoencoder
Thus, the input vector is compressed in the hidden layer and the hidden layer vector is decompressed in the output layer to obtain the input vector. After training, remove the output layer and just use the hidden layer.
The picture below isThe process of drawing samples from the source text
The picture below isNetwork Architecture Diagram
If two different words have very similar “context” (i.e. the possible words around them are similar), then our model needs to output very similar results for these two words. One way the network outputs similar contextual predictions for these two words is by judging whether the word vectors are similar. So, our network learns similar word vectors for two words if they have similar context!
 Efficient Estimation of Word Representations in Vector Space(original word2vec paper)
 Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)
7. References
 of this tutorialonline version
 “Stanford CS224n Deep Learning and Natural Language Processing”Course Study Guide
 “Stanford CS224n Deep Learning and Natural Language Processing”Analysis of course assignments
 【Bilingual subtitled video】Stanford CS224n  Deep Learning and Natural Language Processing (2019 · 20 lectures)
 Stanford official website  CS224n: Natural Language Processing with Deep Learning
ShowMeAIDeep Learning and Natural Language Processing Tutorial (Full Version)
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (1) – Word Vector, SVD Decomposition and Word2vec
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (2) – Training and Evaluation of GloVe and Word Vectors
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (3) – Neural Network and Backpropagation
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (4) – Syntactic Analysis and Dependency Analysis
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (5) – Language Model, RNN, GRU and LSTM
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (6) – Neural Machine Translation, seq2seq and Attention Mechanism
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (7) – Question Answering System
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (8) – Convolutional Neural Networks in NLP
 ShowMeAI Deep Learning and Natural Language Processing Tutorial (9) – Syntax Analysis and Tree Recurrent Neural Network
ShowMeAIStanford NLP famous course CS224n with detailed explanation (20 lectures · full version)
 Stanford NLP famous course with detailed explanation  CS224n Lecture 1 – Introduction to NLP and Preliminary Word Vectors
 Stanford NLP famous course with detailed explanation  CS224n Lecture 2 – Word vector advanced
 Stanford NLP famous course with detailed explanation  CS224n Lecture 3 – Neural Network Knowledge Review
 Stanford NLP famous course with detailed explanation  CS224n Lecture 4 – Neural Network Backpropagation and Calculation Graph
 Stanford NLP famous course with detailed explanation  CS224n Lecture 5 – Syntax Analysis and Dependency Analysis
 Stanford NLP famous course with detailed explanation  CS224n Lecture 6 – Recurrent Neural Network and Language Model
 Stanford NLP famous course with detailed explanation  CS224n Lecture 7 – Gradient disappearance problem and RNN variants
 Stanford NLP famous course with detailed explanation  CS224n Lecture 8 – Machine translation, seq2seq and attention mechanism
 Detailed explanation of famous Stanford NLP courses  CS224n Lecture 9 – Practical skills and experience in large projects of cs224n course
 Stanford NLP famous course with detailed explanation  CS224n Lecture 10 – Question Answering System in NLP
 Stanford NLP famous course with detailed explanation  CS224n Lecture 11 – Convolutional Neural Network in NLP
 Stanford NLP famous course with detailed explanation  CS224n Lecture 12 – Subword Model
 Stanford NLP famous course with detailed explanation  CS224n Lecture 13 – Contextbased representation and NLP pretraining model
 Stanford NLP famous course with detailed explanation  CS224n Lecture 14 – Transformers selfattention and generation model
 Stanford NLP famous course with detailed explanation  CS224n Lecture 15 – NLP text generation task
 Stanford NLP famous course with detailed explanation  CS224n Lecture 16Reference resolution problem and neural network method
 Stanford NLP famous course with detailed explanation  CS224n Lecture 17 – Multitask learning (taking question answering system as an example)
 Stanford NLP famous course with detailed explanation  CS224n Lecture 18 – Syntax Analysis and Tree Recurrent Neural Network
 Stanford NLP famous course with detailed explanation  CS224n Lecture 19 – AI safety bias and fairness
 Stanford NLP famous course with detailed explanation  CS224n Lecture 20 – NLP and the future of deep learning
ShowMeAIRecommended series of tutorials
 Dachang Technology Realization Scheme Series
 Graphical Python programming: a series of tutorials from entry to mastery
 Graphical Data Analysis: From Beginner to Master Series
 Graphical AI Mathematics Foundation: From Getting Started to Mastering a Series of Tutorials
 Graphical big data technology: from entry to master series of tutorials
 Graphical Machine Learning Algorithms: A Series of Tutorials from Getting Started to Mastering
 Machine learning in practice: teach you how to play machine learning series
 Deep Learning Tutorial: Wu Enda Special Course · Interpretation of a Complete Set of Notes
 Natural Language Processing Tutorial: Stanford CS224n course with learning and interpretation of a full set of notes
 Deep Learning and Computer Vision Tutorial: Stanford CS231n Interpretation of a full set of notes