## preface

Word embedding is one of the most commonly used technology points in the whole natural language processing (NLP), which is widely used in enterprise modeling practice. Using word embedding, we can map natural text language into computer language, and then input it into neural network model for learning and calculation. How to better understand and quickly start to generate word embedding? This paper explains the principle and generation method of word embedding.

## 1、 On word embedding

### What is word embedding

In one sentence, word embedding is word vector, which is a function mapping relationship. We know that in machine learning, features are transmitted in the form of numerical values. Similarly, in NLP, text features also need to be mapped into numerical vectors. For example, after word embedding the word “hello”, we can map it into a 5-Dimensional vector: Hello – > (0.1, 0.5, 0.3, 0.2, 0.2).

### Mapping process of word vector

Generally speaking, we use the mapping process of “word – > vector space 1 – > vector space 2″ to realize text word vectorization. The whole mapping process can be divided into two steps:

#### 1. Word – > vector space 1

This step solves the problem of converting a word into a vector. For example, convert a text word to a one hot vector.

#### 2. Vector space 1 – > vector space 2

This step solves the optimization problem of vector, that is, when there is already a vector, find a better way to optimize it.

## 2、 Using one hot and SVD to find word embedding method

### One hot (word – > vector space 1)

One hot is one of the most common methods for extracting text features. This paper uses one hot to complete the first step of the mapping process, that is, word – > vector space 1.

We take each word in the corpus as a feature column. If there are V words in the corpus, there are V feature columns, for example:

In this mapping process, one hot has the following disadvantages: 1) it is easy to produce sparse features; 2) Easy to cause dimensional explosion; 3) It makes the semantic relationship between words lose.

For example, according to common sense, there should be some similarity between hotels and motels, but our mapping results show that their vector product is 0. The similarity between hotel and motel is equal to that between hotel and cat, which is obviously unreasonable.

**Improvement direction:**

1) Try to map the word vector to a lower dimensional space;

2) At the same time, the semantic similarity of word vectors in the low dimensional space is maintained. In this way, the more relevant words are, the closer their vectors can be in the low dimensional space.

### SVD (vector space 1 – > vector space 2)

#### 1. How to express the relationship between words

SVD, singular value decomposition, is an algorithm widely used in the field of machine learning. It can not only be used for feature decomposition in dimension reduction algorithms, but also widely used in recommendation systems, natural language processing and other fields. It is the cornerstone of many machine learning algorithms. This paper uses SVD to solve the optimization problem of vector.

Firstly, we construct an affinity matrix to ensure that the relationship between words can be reflected without dimensionality reduction. There are many ways to construct affinity matrix. Here are two common ways.

✦**Mode 1**

Suppose you have n articles with m de duplication words, you can construct an affinity matrix as follows:

Each of these values represents the number of occurrences of the word in an article. This matrix can reflect some properties of words. For example, if a word is “sowing”, it may appear more in “agronomy” articles; If a word is “film”, it may appear more in “art” articles.

✦**Mode II**

Assuming that we have m de duplicated words, we can construct a matrix of M * m, where each value represents the number of times that the corresponding two words appear together in an article, for example:

#### 2. Decompose the affinity matrix

With the affinity matrix, it can be SVD decomposed. The purpose is to reduce the dimension. The results are as follows:

We decompose the original affinity matrix X (left) into three parts on the right. The three parts on the right can be understood from left to right:

✦**u-matrix **: a transformation relationship from old high-dimensional vector space to low-dimensional vector space;

✦ **σ matrix**: variance matrix. Each column represents the information content of each coordinate axis in the low dimensional space. The larger the variance, the more abundant the information content, indicating that the data fluctuates significantly on the coordinate axis. In dimensionality reduction, we first consider preserving several coordinate axes with the largest variance;

✦ **V matrix**: a new representation of each word vector. After multiplying the first two matrices, the final word vector representation is obtained.

At this time, the matrix on the right is still v-dimensional and dimensionality reduction has not been realized. Therefore, as mentioned earlier, we take the variance column with a large top k and put u, σ And V matrices are arranged in the order of variance from large to small, so that the final dimension reduction result can be obtained:

#### 3. SVD disadvantages

1) The dimension of affinity matrix may often change, because new words are always added, and SVD decomposition must be done again every time, so this method is not very general; 2) Affinity matrices can be sparse because many words do not appear in pairs.

**Improvement ideas:**

1) In reducing sparsity, we can not only focus on those words that have a contextual relationship with a word; 2) For a model that has never seen a word, consider guessing its information from the context to increase generality.

Along these two ideas, we can consider introducing cbow and skip gram to find word embedding.

## 3、 Cbow and skip gram for word embedding

The full name of cbow is continuous bag of words. Its essence is to predict whether a word is a center word through context word. Skip gram algorithm predicts whether a word is its context given a center word.

The theme of this paper is embedding. Here we mention that the ultimate purpose of predicting the central word and context is to train the semantic relationship of words through the central word and context, and reduce the dimension, so as to get the final desired embedding.

### CBOW

**Idea:**

Suppose you know a center word and a string of context

Try to train a matrix V, which is used to map words to a new vector space (this is what we want to embed!)

At the same time, it can also train a matrix U, which is used to map the embedded vector to the probability space and calculate the probability that a word is center word.

**Training process:**

**Process details:**

(1) Assuming that the C power of X is an intermediate word and the length of context is m, the context sample can be expressed as

Each of these elements is a one hot vector.

(2) For these one hot variables, we hope to map them to a lower dimensional space with word embedding. Word embedding is a function that is mapped to a lower dimensional space to reduce sparsity and maintain semantic relationships in words.

(3) After getting embedding, enter the average value of vector. The reason for taking the average value is that these words have contextual connections. For the convenience of training, we can express them in a more compact way.

(4) In this way, we achieve an average embedding of text in low dimensional space.

Next, we need to train a parameter matrix to calculate the average embedding, so as to output the probability that each word in the average embedding is the center word.

Review of cbow one-stop training process

### Softmax training scoring parameter matrix

Cross entropy:

### skip-gram

Skip gram the known headword and predict the context. It will not be repeated here.

## summary

This paper explains the principle and generation method of word embedding, and answers the relevant problems in the generation process of word embedding, hoping to help readers improve the practical efficiency of word embedding.

Today, machine learning is developing rapidly and applied to many industry scenarios. As a data intelligence enterprise, getui continues to explore in the field of large-scale machine learning and natural language processing, and also applies word embedding to label modeling. At present, getui has built a three-dimensional portrait system covering thousands of labels to continuously provide assistance for customers in mobile Internet, brand marketing, public services and other fields to carry out user insight, population analysis and data operation.

Follow up tweets will continue to share dry goods in the fields of algorithm modeling and machine learning. Please keep an eye on it.