Text Representation of NLP

Time:2019-10-3

Introduction

When we do model training, we need to convert words, sentences and texts into vectors or matrices instead of passing text or words directly to the computer for calculation. How to convert text into vectors is the content of this paper.

Before introducing the content, you need to familiarize yourself with some concepts.

  1. Lexicon: All words in training data can be counted by using Jieba participle.
  2. Confusion Matrix: Confusion Matrix is a situation analysis table in data science, data analysis and machine learning, which summarizes the prediction results of classification model. In the form of matrix, the records in the data set are summarized according to two criteria: the real category and the classification judgment made by the classification model.

Representation of words – one-hot

Appearance is 1, not 0

Each word is a vector of the number dimension of a lexicon, but only one of them is 1, and the others are 0.

For example:

Dictionary: [We, again, go, climb mountains, today, you, yesterday, run]

” Our corresponding vectors are: [1, 0, 0, 0, 0, 0, 0, 0, 0]

shortcoming

  1. Waste space
  2. Can’t express the relationship between words and words

For example, according to one-hot representation, boats and boats are irrelevant.

Sentence Representation-boolean

The vector is the size of the lexicon. Each sentence corresponds to the occurrence of words in the lexicon. The occurrence is 1, but not 0. For example, the following example:

Dictionary: [We, again, go, climb mountains, today, you, yesterday, run]

“We” appeared as 1, “and”did not appear as 0,”to”appear as 1,”climbing”appeared as 1,”today”appeared as 1,”you”did not appear as 0,”yesterday”did not appear as 0,”running”appeared as 1.”

Example: Let’s go hiking today and running tomorrow (1, 0, 1, 1, 1, 0, 0, 1)

Representation of Sentences – count

The vector is the size of the lexicon, and each sentence corresponds to the number of words in the lexicon, such as the following example:

Dictionary: [We, again, go, climb mountains, today, you, yesterday, run]

The number of occurrences of “we” is 1, the number of occurrences is 0, the number of occurrences is 1, the number of occurrences of climbing is 1, the number of occurrences of today is 1, the number of occurrences of you is 0, the number of occurrences of yesterday is 0, and the number of occurrences of running is 0.

Calculate the number of times each word appears

Example: Let’s go hiking today and running tomorrow (2, 0, 2, 1, 1, 0, 0, 1)

Representation of Sentences-TF-IDF

The idea of TFIDF is that the more times a word appears in its text and the fewer times it appears in other texts, the better the word can represent the text in which the word is located.

TF-IDF = TF * IDF
TF = term frequency

IDF = inverse document frequency reverse file frequency, indicating the importance of words
IDF = log(\frac{N}{N(w)})

Among them:

N: Total number of documents

N (w): The more the word w appears in how many documents, the less important it is.

TF-IDF = Word Frequency * Importance

Example: Compute the TF-IDF vector representation of three sentences:

For example, “Today” in the first sentence: TF = 1 (the number of occurrences of “Today” in the sentence)

IDF = log(N/N(w))

The total number of N documents is 3; the number of occurrences of “today” in N (w) documents is 2 (the first sentence occurs once, the second sentence occurs once).

So IDF = log (3 / 2). Other same calculation principles.

Text Representation of NLP

Deficiencies of TF-IDF

The TF-IDF algorithm is based on the assumption that the most meaningful words to distinguish documents should be those that appear frequently in documents, but less frequently in other documents of the whole document set. Therefore, if the TF word frequency is taken as a measure in the feature space coordinate system, it can reflect the characteristics of similar texts.

But in essence, IDF is a kind of weighting that tries to suppress noise, and the more important it is to simply think of words with low text frequency, the more useless it is to think of words with high text frequency. Obviously, this is not entirely correct. The simple structure of IDF can not effectively reflect the importance of words and the distribution of feature words, which makes it unable to complete the function of weight adjustment.The accuracy of TF-IDF method is not very high

Supplementary concepts:

Confusion Matrix: Matrix similar to the matrix for calculating accuracy and recall.

Text Representation of NLP

True positive: positive example and prediction is also positive example
False positive: Negative but positive
False negative: positive but negative
True negative: Negative and negative predictions

Use search examples

AccuracyThe proportion of correct results in the total number of returned results (positive and negative returned results).

Recall rateThe proportion of the correct results retrieved to all positive results (including those returned and those not returned)

precision = True-positive / (True-positive + False-positive)

recall = True-positive / (True-positive + False-negative)

Reference: https://www.greedyai.com/

Wu Enda – “Machine Learning”: https://www.coursera.org/lear…