Notes on Jieba participle learning (2)

Time:2019-9-11

<!– toc –>

Participle pattern

Jieba participle has many modes to choose from. Optional models include:

  • Total Segmentation Mode

  • Exact mode

  • Search Engine Model

The switch of HMM model is also provided.

The full segmentation mode is to output all the participles of a string.

The exact pattern is the most probabilistic word segmentation for a sentence.

Search engine mode provides precise mode of re-segmentation, which divides long words into short words again.

The results are as follows:

# encoding=utf-8
import jieba

Seg_list = jieba. cut ("I come to Beijing Tsinghua University", cut_all = True)
Print ("Full Mode:"+"/ join (seg_list)# full mode)

Seg_list = jieba. cut ("I come to Beijing Tsinghua University", cut_all = False)
Print ("Default Mode:"+"/ join (seg_list)] # exact mode

Seg_list = jieba. cut ("He came to Netease Hangyan Mansion") # Default is the exact mode
print(", ".join(seg_list))

Seg_list = jieba.cut_for_search ("Master Xiao Ming graduated from the Institute of Computing, Chinese Academy of Sciences, and further studied at Kyoto University, Japan")  Search Engine Model
print(", ".join(seg_list))

The result is

[Full Model]: I/came to/Beijing/Tsinghua/Tsinghua University/Huada/University

[Accurate Model]: I / Come to / Beijing / Tsinghua University

[New Word Recognition] He, here he is, here he is, Netease, Hangyan, Mansion (here Hangyan is not in the dictionary, but it is also recognized by Viterbi algorithm)

[Search Engine Model]: Xiao Ming, Master, Graduated, in China, Science, College, Academy of Sciences, Chinese Academy of Sciences, Institute of Computing and Computing, and later, in Japan, Kyoto, University, Kyoto University, Japan.

Among them, the new word recognition is based on the Viterbi algorithm of HMM model.

The precise pattern, HMM model and Viterbi algorithm for recognizing new words are worth studying in detail.

jieba.cut()

After loading the dictionary, Jieba participle needs to be participled, which is the core function in the code.jieba.cut()The code is as follows:

 def cut(self, sentence, cut_all=False, HMM=True):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into seperated words.
        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''
        sentence = strdecode(sentence)

        if cut_all:
            re_han = re_han_cut_all
            re_skip = re_skip_cut_all
        else:
            re_han = re_han_default
            re_skip = re_skip_default
        if cut_all:
            cut_block = self.__cut_all
        elif HMM:
            cut_block = self.__cut_DAG
        else:
            cut_block = self.__cut_DAG_NO_HMM
        blocks = re_han.split(sentence)
        for blk in blocks:
            if not blk:
                continue
            if re_han.match(blk):
                for word in cut_block(blk):
                    yield word
            else:
                tmp = re_skip.split(blk)
                for x in tmp:
                    if re_skip.match(x):
                        yield x
                    elif not cut_all:
                        for xx in x:
                            yield xx
                    else:
                        yield x

Among them,

The default mode is given in docstr, and the exact word segmentation + HMM model is opened.

Lines 12-23 configure variables.

Line 24 does Chinese segmentation of sentences, dividing sentences into blocks containing only processable characters, and discarding special characters, because characters not included in some dictionaries may have an impact on word segmentation.

In line 24, the default value of re_han is re_han_default, which is a regular expression defined as follows:

# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)", re.U)

You can see that special characters such as spaces, tabs, newline characters are filtered out in this regular expression.

Lines 25-40 use yield to implement the return result as an iterator, which is described in the document:

The structure returned by jieba. cut and jieba. cut_for_search is an iterative generator that can use the for loop to obtain each word (unicode) after segmentation.

In line 31-40, if you encounter a block as an unconventional character, verify that the block is directly output as a result of word segmentation. For example, punctuation symbols and so on, the result of word segmentation is in the form of a single word, which is carried out by these ten lines of code.

The key is in lines 28-30. If it’s a separable block, call the function.cut_blockThe default iscut_block = self.__cut_DAGTo participle

jieba.__cut_DAG()

__cut_DAGThe function is to segment words according to DAG, that is, directed acyclic graph. The code is as follows:

def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        buf = ''
        N = len(sentence)
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if y - x == 1:
                buf += l_word
            else:
                if buf:
                    if len(buf) == 1:
                        yield buf
                        buf = ''
                    else:
                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t
                        else:
                            for elem in buf:
                                yield elem
                        buf = ''
                yield l_word
            x = y

        if buf:
            if len(buf) == 1:
                yield buf
            elif not self.FREQ.get(buf):
                recognized = finalseg.cut(buf)
                for t in recognized:
                    yield t
            else:
                for elem in buf:
                    yield elem

For a sentence, the DAG of the directed acyclic graph is obtained firstly, and then the maximum probabilistic path of the directed acyclic graph is calculated by dp.
After calculating the maximum probabilistic path, iterate, if it is a login word, then output, if it is a word, find out the words which are connected together. These may be unknown words, use HMM model to segment words, and output after the end of segmentation.

So far, the participle ends.

Among them, what deserves follow-up study isLine 2 Gets DAGLine 4 Computes the Maximum Probability PathandLines 20 and 34 use the HMM model to segment unknown wordsIt will be interpreted in the following articles.

DAG = self.get_DAG(sentence)

    ...

self.calc(sentence, DAG, route)

    ...

recognized = finalseg.cut(buf)