Jieba library is an excellent third-party library for Chinese word segmentation. Chinese text needs to obtain a single word through word segmentation
Jieba library installation
Run CMD window as administrator and enter the command: PIP install Jieba
Jieba library function introduction
features
- Support three word segmentation modes
- Precise mode: try to cut the sentence most accurately, which is suitable for text analysis
- Full mode: scan all the words that can be formed into words in the sentence, which is very fast, but it can’t solve ambiguity
- Search engine mode: on the basis of precise mode, long words are segmented again to improve the recall rate, which is suitable for search engine word segmentation
- Support traditional Chinese word segmentation
- Support custom dictionary
Word segmentation function
- jieba. Cut and jieba The lcut method accepts two incoming parameters
- The first parameter is the string that needs word segmentation
- cut_ The all parameter is used to control whether full mode is adopted
Lcut converts the returned object into a list object and returns
- jieba. cut_ for_ Search and jieba lcut_ for_ The search method accepts a parameter
- String requiring word segmentation
This method is suitable for word segmentation of inverted index constructed by search engine, with fine granularity
jieba. lcut_ for_ The search method returns the list type
add custom dictionary
Developers can specify their own custom dictionaries to include words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words by itself can ensure a higher accuracy
usage
- Use custom dictionary file
- jieba. load_ userdict(file_name) # file_ Name is the path of the custom dictionary
- Using Jieba to dynamically modify dictionaries in programs
- jieba. add_ word(new_words) # new_ Words is the new word you want to add
- jieba. del_ Word (words) # delete words
Keyword extraction
- jieba.analyse. extract_ Tags (sentence, TOPK) # you need to import jieba.analyze first
Sentence is the text to be extracted
TOPK refers to the keywords that return the largest tf/idf weight. The default is 20
Part of speech tagging
- jieba.posseg. Posttokenizer (tokenizer=none) creates a new custom word breaker. The tokenizer parameter can specify the Jieba used internally Tokenizer participle
jieba.posseg. DT is the default part of speech marker
Mark the part of speech of each word after sentence segmentation, using a marking method compatible with ICTCLAS
case
1、 Precise mode
import jieba
LIST1 = jieba.lcut ("the people's Republic of China is a great country")
print(list1)
Print ("exact mode:" + "/". Join (LIST1))
2、 Full mode
list2 = jieba. Lcut ("the people's Republic of China is a great country", cut_all = true)
print(list2,end=",")
Print ("full mode:" + "/". Join (List2))
3、 Search engine mode
list3 = jieba.lcut_ for_ Search ("the people's Republic of China is a great country")
print(list3)
Print ("search engine mode:" + "".Join (list3))
4、 Modify dictionary
import jieba
Text = "CITIC construction investment company invested in a game, and CITIC also invested in a game company"
word = jieba.lcut(text)
print(word)
#Add words
jieba.add_ Word ("CSC")
jieba.add_ Word ("investment company")
word1 = jieba.lcut(text)
print(word1)
#Delete word
jieba.del_ Word ("CSC")
word2 = jieba.lcut(text)
print(word2)
5、 Part of speech tagging
import jieba.posseg as pseg
Words = pseg.cut ("I love Tiananmen Square in Beijing")
for i in words:
print(i.word,i.flag)
6、 Count the number of appearances of characters in the romance of the Three Kingdoms
Download the romance of the Three Kingdoms text
import jieba
Txt = open ("file path", "R", encoding='utf-8') Read() # open and read the file
words = jieba. Lcut (txt) # use precise mode to segment text
Counts = {} # store words and their occurrence times in the form of key value pairs
for word in words:
If len (word) = = 1: # single words are not included
continue
else:
counts[word] = counts. Get (word, 0) + 1 \
Items = list (counts. Items()) \\convert key value pairs into lists
items. Sort (key=lambda x: x[1], reverse=true) \
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
import jieba
Excludes = {"general", "said", "Jingzhou", "two", "no", "no", "so", "how"}
Txt = open ("Romance of the Three Kingdoms.Txt", "R", encoding='utf-8').Read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
Elif word = = "Zhugeliang" or word = = "Kongming said":
Rword = "Kongming"
Elif word = = "Guan Gong" or word = = "Yun Chang":
Rword = "Guan Yu"
Elif word = = "Xuande" or word = = "Xuande Yue":
Rword = "Liu Bei"
Elif word = = "Munde" or word = = "prime minister":
Rword = "Cao Cao"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for i in excludes:
del counts[i]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))