Python Jieba Thesaurus

Time:2022-8-2

Jieba library is an excellent third-party library for Chinese word segmentation. Chinese text needs to obtain a single word through word segmentation

Jieba library installation

Run CMD window as administrator and enter the command: PIP install Jieba

Jieba library function introduction

features

  • Support three word segmentation modes
    • Precise mode: try to cut the sentence most accurately, which is suitable for text analysis
    • Full mode: scan all the words that can be formed into words in the sentence, which is very fast, but it can’t solve ambiguity
    • Search engine mode: on the basis of precise mode, long words are segmented again to improve the recall rate, which is suitable for search engine word segmentation
  • Support traditional Chinese word segmentation
  • Support custom dictionary

Word segmentation function

  • jieba. Cut and jieba The lcut method accepts two incoming parameters
    • The first parameter is the string that needs word segmentation
    • cut_ The all parameter is used to control whether full mode is adopted

Lcut converts the returned object into a list object and returns

  • jieba. cut_ for_ Search and jieba lcut_ for_ The search method accepts a parameter
    • String requiring word segmentation

This method is suitable for word segmentation of inverted index constructed by search engine, with fine granularity
jieba. lcut_ for_ The search method returns the list type

add custom dictionary

Developers can specify their own custom dictionaries to include words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words by itself can ensure a higher accuracy

usage

  1. Use custom dictionary file
    • jieba. load_ userdict(file_name) # file_ Name is the path of the custom dictionary
  2. Using Jieba to dynamically modify dictionaries in programs
    • jieba. add_ word(new_words) # new_ Words is the new word you want to add
    • jieba. del_ Word (words) # delete words

Keyword extraction

  • jieba.analyse. extract_ Tags (sentence, TOPK) # you need to import jieba.analyze first

Sentence is the text to be extracted
TOPK refers to the keywords that return the largest tf/idf weight. The default is 20

Part of speech tagging

  • jieba.posseg. Posttokenizer (tokenizer=none) creates a new custom word breaker. The tokenizer parameter can specify the Jieba used internally Tokenizer participle

jieba.posseg. DT is the default part of speech marker
Mark the part of speech of each word after sentence segmentation, using a marking method compatible with ICTCLAS

case

1、 Precise mode

import jieba
LIST1 = jieba.lcut ("the people's Republic of China is a great country")
print(list1)
Print ("exact mode:" + "/". Join (LIST1))

2、 Full mode

list2 = jieba. Lcut ("the people's Republic of China is a great country", cut_all = true)
print(list2,end=",")
Print ("full mode:" + "/". Join (List2))

3、 Search engine mode

list3 = jieba.lcut_ for_ Search ("the people's Republic of China is a great country")
print(list3)
Print ("search engine mode:" + "".Join (list3))

4、 Modify dictionary

import jieba
Text = "CITIC construction investment company invested in a game, and CITIC also invested in a game company"
word = jieba.lcut(text)
print(word)

#Add words
jieba.add_ Word ("CSC")
jieba.add_ Word ("investment company")
word1 = jieba.lcut(text)
print(word1)

#Delete word
jieba.del_ Word ("CSC")
word2 = jieba.lcut(text)
print(word2)

5、 Part of speech tagging

import jieba.posseg as pseg

Words = pseg.cut ("I love Tiananmen Square in Beijing")
for i in words:
    print(i.word,i.flag)

6、 Count the number of appearances of characters in the romance of the Three Kingdoms

Download the romance of the Three Kingdoms text

import  jieba

Txt = open ("file path", "R", encoding='utf-8') Read() # open and read the file
words = jieba. Lcut (txt) # use precise mode to segment text
Counts = {} # store words and their occurrence times in the form of key value pairs

for word in words:
    If len (word) = = 1: # single words are not included
        continue
    else:
        counts[word] = counts. Get (word, 0) + 1  \
        
Items = list (counts. Items()) \\convert key value pairs into lists
items. Sort (key=lambda x: x[1], reverse=true) \ 

for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))
import jieba

Excludes = {"general", "said", "Jingzhou", "two", "no", "no", "so", "how"}
Txt = open ("Romance of the Three Kingdoms.Txt", "R", encoding='utf-8').Read()
words  = jieba.lcut(txt)
counts = {}

for word in words:
    if len(word) == 1:
        continue
    Elif word = = "Zhugeliang" or word = = "Kongming said":
        Rword = "Kongming"
    Elif word = = "Guan Gong" or word = = "Yun Chang":
        Rword = "Guan Yu"
    Elif word = = "Xuande" or word = = "Xuande Yue":
        Rword = "Liu Bei"
    Elif word = = "Munde" or word = = "prime minister":
        Rword = "Cao Cao"
    else:
        rword = word
        counts[rword] = counts.get(rword,0) + 1
    
for i in excludes:
    del counts[i]
    
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 

for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))