Introduction to NLP (2) Exploring the Principle of TF-IDF

Time:2019-8-12

Introduction to TF-IDF

TF-IDF is a commonly used statistical method in NLP. It is used to evaluate the importance of a word to a document set or one of the documents in a corpus. TF-IDF is usually used to extract the features of a text, i.e. keywords. The importance of words increases with the number of times they appear in documents, but decreases inversely with the frequency they appear in corpus.
In NLP, the formula of TF-IDF is as follows:

            tfidf = tf*idf.

Among them, TF is Term Frequency and IDF is Inverse Document Frequency.
_tf is word frequency, that is, the frequency of a word appearing in a document. Suppose a word appearing I times in the whole document, while the whole document has N words, the value of TF is i/N.
_idf is the reverse file frequency, assuming that the entire document has n articles, and a word appears in K articles, the IDF value is

            idf=log2(n/k).

Of course, there will be slightly different formulas for calculating IDF values in different places. For example, some places will add 1 to the denominator K to prevent the denominator from being 0, and some places will add 1 to the denominator, which is the smoothing technique. In this paper, the original formula of IDF value is used, because it is consistent with the formula in gensim.
_Assuming that the entire document has a D article, the TFIDF value of the word I in the j article is

The above is the calculation method of TF-IDF.

Text Introduction and Preprocessing

We will use the following three sample texts:

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

These three articles are about football, basketball and volleyball. They form a document.
_Next is the text preprocessing part.
_First, the line break is removed from the text, then the clause, participle and punctuation are removed. The complete Python code is as follows. The input parameters are text:

import nltk
import string

# Text preprocessing
# Function: Text file clause, participle, and punctuation removed
def get_tokens(text):
    text = text.replace('\n', '')
    Sents = nltk. send_tokenize (text)  clause
    tokens = []
    for sent in sents:
        For word in nltk. word_tokenize (sent): # participle
            If word is not in string. punctuation:  Remove punctuation
                tokens.append(word)
    return tokens

Next, remove the stopwords in the article, and then count the number of occurrences of each word. The complete Python code is as follows. The input parameters are text:

From nltk.corpus import stop words # stop words

# Remove the stop words from the original text file
# Generate count dictionary, that is, the number of occurrences of each word
def make_count(text):
    tokens = get_tokens(text)
    Filtered = [w for W in tokens if not w in stopwords. words ('english')]# Remove the stop words
    count = Counter(filtered)
    return count

Taking Text3 as an example, the count dictionary is generated as follows:

Counter({‘ball’: 4, ‘net’: 4, ‘teammate’: 3, ‘returned’: 2, ‘bat’: 2, ‘court’: 2, ‘team’: 2, ‘across’: 2, ‘touches’: 2, ‘back’: 2, ‘players’: 2, ‘touch’: 1, ‘must’: 1, ‘usually’: 1, ‘side’: 1, ‘player’: 1, ‘area’: 1, ‘Volleyball’: 1, ‘hands’: 1, ‘may’: 1, ‘toward’: 1, ‘A’: 1, ‘third’: 1, ‘two’: 1, ‘six’: 1, ‘opposing’: 1, ‘within’: 1, ‘prevent’: 1, ‘allowed’: 1, ‘’’: 1, ‘playing’: 1, ‘played’: 1, ‘volley’: 1, ‘surface—that’: 1, ‘volleys’: 1, ‘opponents’: 1, ‘use’: 1, ‘high’: 1, ‘teams’: 1, ‘bats’: 1, ‘To’: 1, ‘game’: 1, ‘make’: 1, ‘forth’: 1, ‘three’: 1, ‘trying’: 1})

TF-IDF in Gensim

After the text is preprocessed, for the above three sample texts, we will get a count dictionary, which is the number of words in each text. Next, we will use the implemented TF-IDF model in gensim to output the top three words of TF-IDF in each article and their TFIDF values. The complete code is as follows:

From nltk.corpus import stop words # stop words
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

The output results are as follows:

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888

The output results are in line with our expectations. For example, football and rugby keywords are extracted from football articles, plat and cm keywords are extracted from basketball articles and net and teammate keywords are extracted from volleyball articles.

Practice TF-IDF Model by Yourself

With the above understanding of TF-IDF model, we can actually practice it ourselves, which is the best way to learn the algorithm!
_The following is the author’s practice of TF-IDF code (followed by text preprocessing code):

import math

# Calculating TF
def tf(word, count):
    return count[word] / sum(count.values())
# Calculate how many files count_list contains word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# Computing IDF
def idf(word, count_list):
    Return math. log2 (len (count_list)/ (n_containing (word, count_list)# logarithm bottom 2
# Calculating TF-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

# TF-IDF test
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    # sorted_words = matutils.unitvec(sorted_words)
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

The output results are as follows:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.30677
    Word: rugby, TF-IDF: 0.07669
    Word: known, TF-IDF: 0.05113
Top words in document 2
    Word: play, TF-IDF: 0.05283
    Word: inches, TF-IDF: 0.03522
    Word: worth, TF-IDF: 0.03522
Top words in document 3
    Word: net, TF-IDF: 0.10226
    Word: teammate, TF-IDF: 0.07669
    Word: across, TF-IDF: 0.05113

It can be seen that the keywords extracted from the TF-IDF model which I practiced by myself are consistent with gensim. As for the reason why the last two words in basketball are inconsistent, it is because the results of random selection of these words are the same as tfidf. But there is a problem, that is, the calculated TFIDF value is different, what is the reason?
See the source code for calculating TF-IDF values in gensim (https://github.com/RaRe-Techn…):

That is to say, gensim normalizes the obtained TF-IDF vectors and transforms them into unit vectors. Therefore, we need to add the normalization step to the code just now. The code is as follows:

import numpy as np

# Normalize vectors
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF test
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

The output results are as follows:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: shooting, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: back, TF-IDF: 0.22888

The output results are consistent with those of gensim.

summary

Gensim is a well-known module when Python does NLP. Read more source code when you have time. In the future, we will continue to introduce the application of TF-IDF in other fields. Welcome to share with us.~

Note: I have opened the Wechat Public Number: Python Crawler and Algorithms (micro-signal: easy_web_scrape). Welcome to pay attention to it.~~

The complete code of this article is as follows:

import nltk
import math
import string
From nltk.corpus import stop words # stop words
Counting from collections import Counter
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# Text preprocessing
# Function: Text file clause, participle, and punctuation removed
def get_tokens(text):
    text = text.replace('\n', '')
    Sents = nltk. send_tokenize (text)  clause
    tokens = []
    for sent in sents:
        For word in nltk. word_tokenize (sent): # participle
            If word is not in string. punctuation:  Remove punctuation
                tokens.append(word)
    return tokens

# Remove the stop words from the original text file
# Generate count dictionary, that is, the number of occurrences of each word
def make_count(text):
    tokens = get_tokens(text)
    Filtered = [w for W in tokens if not w in stopwords. words ('english')]# Remove the stop words
    count = Counter(filtered)
    return count

# Calculating TF
def tf(word, count):
    return count[word] / sum(count.values())
# Calculate how many files count_list contains word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# Computing IDF
def idf(word, count_list):
    Return math. log2 (len (count_list)/ (n_containing (word, count_list)# logarithm bottom 2
# Calculating TF-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# Normalize vectors
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF test
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
        
"""
Output results:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: word, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: inches, TF-IDF: 0.19915
    Word: points, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: bat, TF-IDF: 0.22888

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888
"""