Introduction to NLP (1) Word-bag Model and Sentence Similarity

Time:2019-8-13

_This article is the first article in the author’s NLP introductory series, and we will enter the NLP era in the future.
In this paper, we will introduce the Bag of Words (NLP) model and how to use the bag model to calculate the similarity between sentences (cosine similarity).
Firstly, let’s look at what a bag of words model is. Let’s take the following two simple sentences as examples:

sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

Usually, NLP can’t process complete paragraphs or sentences at once, so the first step is often clauses and participles. There are only sentences here, so we only need participle. For English sentences, you can use the word_tokenize function in NLTK, and for Chinese sentences, you can use the Jieba module. So the first step is participle, the code is as follows:

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

The output results are as follows:

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]

The participle is finished. The next step is to build a corpus of words and punctuation in all sentences. The code is as follows:

all_list = []
for text in texts:
    all_list += text
corpus = set(all_list)
print(corpus)

The output is as follows:

{'love', 'running', 'reading', 'sky', '.', 'I', 'like', 'sea', ','}

As you can see, there are eight words and punctuation in the corpus. Next, a digital mapping of words and punctuations in the corpus is established to facilitate the vector representation of subsequent sentences. The code is as follows:

corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

The output is as follows:

{'running': 1, 'reading': 2, 'love': 0, 'sky': 3, '.': 4, 'I': 5, 'like': 6, 'sea': 7, ',': 8}

Although words and punctuation do not map numerically in the order in which they appear, it does not affect the vector representation of sentences and the similarity between subsequent sentences.
The next step, which is the key step of the word bag model, is to establish the vector representation of sentences. This representation vector does not simply choose 0,1 numbers by the presence or absence of words or punctuation, but takes the frequency of occurrence of words or punctuation as its corresponding numerical representation. Combining with the corpus dictionary just now, the code of the vector representation of sentences is as follows:

# Establishing Vector Representation of Sentences
def vector_rep(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))

    vec = sorted(vec, key= lambda x: x[0])

    return vec

vec1 = vector_rep(texts[0], corpus_dict)
vec2 = vector_rep(texts[1], corpus_dict)
print(vec1)
print(vec2)

The output is as follows:

[(0, 2), (1, 0), (2, 0), (3, 1), (4, 1), (5, 2), (6, 0), (7, 1), (8, 1)]
[(0, 1), (1, 1), (2, 1), (3, 0), (4, 1), (5, 2), (6, 1), (7, 0), (8, 1)]

Let’s stop for a moment and look at this vector. In the first sentence, I appears twice. In the database dictionary, the number corresponding to I is 5. Therefore, in the first sentence, I appears twice. The tuple in the list is (5,2). The representative word I appears twice in the first sentence. The above output may not be so intuitive. The representative vectors of the real two sentences should be:

[2, 0, 0, 1, 1, 2, 0, 1, 1]
[1, 1, 1, 0, 1, 2, 1, 0, 1]

OK, that’s the end of the bag model. Next, we will calculate the similarity by using the bag of words model we just got, which is the vector representation of two sentences.
In NLP, if the vector representation of two sentences is obtained, the cosine similarity is usually chosen as their similarity, and the cosine similarity of the vector is the cosine value of the angle between the two vectors. The calculated Python code is as follows:

from math import sqrt
def similarity_with_2_sents(vec1, vec2):
    inner_product = 0
    square_length_vec1 = 0
    square_length_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner_product += tup1[1]*tup2[1]
        square_length_vec1 += tup1[1]**2
        square_length_vec2 += tup2[1]**2

    return (inner_product/sqrt(square_length_vec1*square_length_vec2))


cosine_sim = similarity_with_2_sents(vec1, vec2)
Print ('The cosine similarity of the two sentences is:%. 4f. '% cosine_sim)

The output results are as follows:

The cosine similarity of the two sentences is 0.7303.

In this way, we can get the sentence similarity between them by using the word bag model of sentences.
Of course, in actual NLP projects, if we need to calculate the similarity of two sentences, we only need to call gensim module, which is a powerful tool of NLP and can help us deal with many NLP tasks. Here is the code to calculate the similarity of two sentences with gensim:

sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)

from gensim import corpora
from gensim.similarities import Similarity

# Corpus
dictionary = corpora.Dictionary(texts)

# Using doc2bow as Word Bag Model
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)
# Getting the Similarity of Sentences
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
Print ("using gensim to calculate the similarity of two sentences:%. 4f. "% cosine_sim"

The output results are as follows:

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]
Similarity index with 2 documents in 0 shards (stored under -Similarity-index)
Gensim was used to calculate the similarity of the two sentences: 0.7303.

Note that if the following warning occurs when running the code:

gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

If you want to remove these warning, add the following code before importing the code of gensim module:

import warnings
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')
warnings.filterwarnings(action='ignore',category=FutureWarning,module='gensim')

This is the end of this article. Thank you for reading! If there is something wrong, please contact the author, welcome to exchange! Good luck to you~

Note: I have opened the Wechat Public Number: Python Crawler and Algorithms (micro-signal: easy_web_scrape). Welcome to pay attention to it.~~

Recommended Today

Implementation of PHP Facades

Example <?php class RealRoute{ public function get(){ Echo’Get me’; } } class Facade{ public static $resolvedInstance; public static $app; public static function __callStatic($method,$args){ $instance = static::getFacadeRoot(); if(!$instance){ throw new RuntimeException(‘A facade root has not been set.’); } return $instance->$method(…$args); } // Get the Facade root object public static function getFacadeRoot() { return static::resolveFacadeInstance(static::getFacadeAccessor()); } protected […]