The Principle of Stubbling Word Segmentation

Time:2019-8-11

introduce

The stuttering participle is a popular thesaurus. The source code address is github. Today we will follow up the source code and see the principle of stuttering participle.

principle

    def cut(self, sentence, cut_all=False, HMM=True):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into separated words.

        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''

There are three modes when using stuttering participle. The entry conditions of these three modes are as follows:

        if cut_all:
            cut_block = self.__cut_all
        elif HMM:
            cut_block = self.__cut_DAG
        else:
            cut_block = self.__cut_DAG_NO_HMM

First, let’s look at these three patterns.

  • __cut_all:

    1. I came to Beijing Tsinghua University and the result: I / came to / Beijing / Tsinghua / Tsinghua University / Huada / University
    2. He came to Netease Hangyan Building and the result was that he/he/he/he/he/Netease/Hangzhou/Yanyan/Building.
  • __cut_DAG:

    1. I came to Beijing Tsinghua University and the result: I / I / I / I / Beijing / Tsinghua University
    2. He came to the Hangyan Building of Netease and the result was: He/He/He/He/Netease/Hangyan/Building
  • __cut_DAG_NO_HMM:

    1. I came to Beijing Tsinghua University and the result: I / I / I / I / Beijing / Tsinghua University
    2. He came to Netease Hangyan Building and the result was that he/he/he/he/he/Netease/Hangzhou/Yanyan/Building.

Next we will analyze these three modes:
These three modes have one thing in common. The first step is to construct DAG, that is, to construct directed acyclic graphs.
The source code is as follows:

    def get_DAG(self, sentence):
        self.check_initialized()
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ:
                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
                tmplist.append(k)
            DAG[k] = tmplist
        return DAG

If sentence is’I’m here at Tsinghua University in Beijing’, then DAG is

{0: [0], 1: [1, 2], 2: [2], 3: [3, 4], 4: [4], 5: [5, 6, 8], 6: [6, 7], 7: [7, 8], 8: [8]}

Intuitively, DAG [5]= [5, 6, 8] means that if you start with’Qing’and end with’5, 6 and 8′, it can be a word, namely’Qing’,’Tsinghua’,’Tsinghua University’.‘
Self. FREQ is the most important method in get_DAG. How did it come from?

The Principle of Stubbling Word Segmentation
In fact, self. FREQ is generated by dict. TXT file in Jieba directory. The method is as follows:
Dict.txt has 349046 rows, each in the form of:

121 7830 M
11 1670 M
11211 M
One case 3 M
One minute 8 m
List 34 I

The first part is the word, the second part is the frequency of the word, and the third part is the part of speech of the word.
Take reading as an example, first execute self.FREQ [‘one by one enumeration’]= 34, and then check whether’one’,’one by one’,’one by one’and’one by one enumeration’ have been stored in self.FREQ before, if they have been stored before, then skip, otherwise execute self.FREQ [‘one’]= 0, self.FREQ [‘one by one’]= 0, self.FREQ [‘one by one’]= 0, self.FREQ [‘one’]= 0, self. EQ [‘one column’]= 0
So self. FREQ not only stores the normal words and the number of times they appear, but also stores the prefixes of all words, and sets the number of times the prefixes appear to be 0 to distinguish them from the normal words.

Okay, now we’re done with the DAG section, and then we’ll introduce these three modes separately.

__cut_all

The source code is as follows:

    def __cut_all(self, sentence):
        dag = self.get_DAG(sentence)
        old_j = -1
        for k, L in iteritems(dag):
            if len(L) == 1 and k > old_j:
                yield sentence[k:L[0] + 1]
                old_j = L[0]
            else:
                for j in L:
                    if j > k:
                        yield sentence[k:j + 1]
                        old_j = j

Let’s not go into details about this specific traversal method. Let’s look at the source code for ourselves.

__cut_DAG

    def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        ......

First let’s look at the self. Calc method

    def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
        logtotal = log(self.total)
        for idx in xrange(N - 1, -1, -1):
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

Here we use a technique, log (a) + log (b) = log (ab), which skillfully avoids multiplication and avoids the risk of spillovers.
In fact, Calc function is the implementation of vertibi algorithm, students who do not know vertibi algorithm Baidu bar.

Then paste the whole _cut_DAG source code:

    def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        buf = ''
        N = len(sentence)
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if y - x == 1:
                buf += l_word
            else:
                if buf:
                    if len(buf) == 1:
                        yield buf
                        buf = ''
                    else:
                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t
                        else:
                            for elem in buf:
                                yield elem
                        buf = ''
                yield l_word
            x = y

        if buf:
            if len(buf) == 1:
                yield buf
            elif not self.FREQ.get(buf):
                recognized = finalseg.cut(buf)
                for t in recognized:
                    yield t
            else:
                for elem in buf:
                    yield elem

Among them, the focus is on this part.

                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t

When will we enter finalseg. cut (buf)? In fact, it’s only when you encounter words that don’t appear in dict. TXT that you enter this function:
In this function, we use the HMM method to annotate these unrecognized successful words, and then we will introduce the related content of the project:

The Principle of Stubbling Word Segmentation
Among them, prob_start.py stores information related to the starting state of HMM, and the numbers in the file are processed by log:

P={'B': -0.26268660809250016,
 'E': -3.14e+100,
 'M': -3.14e+100,
 'S': -1.4652633398537678}

B stands for start, E for end, M for middle, S for single. So at the beginning, the state of HMM may only be S or B, while E and M are negative infinite.
Prob_trans.py stores the state transition matrix:

P={'B': {'E': -0.510825623765990, 'M': -0.916290731874155},
 'E': {'B': -0.5897149736854513, 'S': -0.8085250474669937},
 'M': {'E': -0.33344856811948514, 'M': -1.2603623820268226},
 'S': {'B': -0.7211965654669841, 'S': -0.6658631448798212}}

Prob_emit.py stores the probability that the Chinese character will appear in this state, such as p(‘Liu’| S) = – 0.916.

P={'B': {'\u4e00': -3.6544978750449433,
       '\u4e01': -8.125041941842026,
       '\u4e03': -7.817392401429855,
       '\u4e07': -6.3096425804013165,
       '\u4e08': -8.866689067453933,
       '\u4e09': -5.932085850549891,
       '\u4e0a': -5.739552583325728,
       '\u4e0b': -5.997089097239644,
       '\u4e0d': -4.274262055936421,
       '\u4e0e': -8.355569307500769,
       ......

In this way, word segmentation can also be carried out.
The corresponding status of’I/I/I/I/Beijing/Tsinghua University’should be’SBEBEBMME’.

__cut_DAG_NO_HMM

In fact, the difference between _cut_DAG_NO_HMM and _cut_DAG is that _cut_DAG_NO_HMM does not use HMM to segment vertibi parts that have not been successfully segmented. The source code is as follows:

    def __cut_DAG_NO_HMM(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if re_eng.match(l_word) and len(l_word) == 1:
                buf += l_word
                x = y
            else:
                if buf:
                    yield buf
                    buf = ''
                yield l_word
                x = y
        if buf:
            yield buf
            buf = ''

Recommended Today

Implementation of PHP Facades

Example <?php class RealRoute{ public function get(){ Echo’Get me’; } } class Facade{ public static $resolvedInstance; public static $app; public static function __callStatic($method,$args){ $instance = static::getFacadeRoot(); if(!$instance){ throw new RuntimeException(‘A facade root has not been set.’); } return $instance->$method(…$args); } // Get the Facade root object public static function getFacadeRoot() { return static::resolveFacadeInstance(static::getFacadeAccessor()); } protected […]