Notes on Jieba participle learning (3)


DAG (Directed Acyclic Graph)

Directed acyclic graphs, DAG for short, is a data structure of graphs. In fact, it is naive, that is, directed graphs without rings._

DAG is widely used in word segmentation. Whether it is the maximum probabilistic path or the back set of N N, DAG exists widely in word segmentation.

Because DAG itself is a digraph, it is feasible to use adjacency matrix to represent DAG, but Jieba uses Python dict to express DAG more conveniently. Its representation method is as follows:


Take the sentence “I’m studying stuttering participle on National Day” as an example, the Dict Of DAG generated by it is expressed as:

{0: [0, 1, 2], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5, 6], 6: [6], 7: [7, 8], 8: [8], 9: [9, 10], 10: [10]}

Among them,

Guo [0] Qing [1] Festival [2] I [3] studied [5] in [4] [6] knot [7] Ba [8] [9] word [10]

The get_DAG() function code is as follows:

def get_DAG(self, sentence):
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ:
                if self.FREQ[frag]:
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
            DAG[k] = tmplist
        return DAG

Frag is fragment, you can see the code loop slicing sentence, FREQ is dictionary {word: frequency} Dict

Since all prefixes of word and word have been added to the dictionary at the time of loading, once frag is not in FREQ, it can be concluded that frag and words prefixed with frag are not in the dictionary and can jump out of the loop.

The next step is to use DP dynamic programming to solve the maximum probabilistic path.

Maximum Probability Path

It is worth noting that every node of DAG is weighted. For words in dictionary, the weight is its word frequency, that is FREQ [word]. We require route = w1, w2, w3,…, wn to maximize weight (wi).

Dynamic Programming Solution Method

There are two conditions to satisfy DP

  • Repetitive subproblems

  • Optimal Substructure

Let’s analyze the maximum probabilistic path problem.

Repetitive subproblems

For the node Wi and its possible successors Wj and Wk, there are:

The weight of any path to Wj via Wi is the weight of the path through Wi plus the weight of Wj {Ri-> j}={Ri + weight (j)};
The weight of any path arriving at Wk through Wi is the weight of the path passing through Wi plus the weight of Wk {Ri-> k}= {Ri + weight (k)}.

That is to say, for nodes Wj and Wk with common precursor Wi, the path to Wi needs to be calculated repeatedly.

Optimal Substructure

For the optimal path Rmax and one terminal node Wx of the whole sentence, for its possible existence of multiple precursors Wi, Wj, Wk, let the maximum paths to Wi, Wj and Wk be Rmaxi, Rmaxj, Rmaxk, respectively.

Rmax = max(Rmaxi,Rmaxj,Rmaxk…) + weight(Wx)

So the problem is transformed into

Find Rmaxi, Rmaxj, Rmaxk…

The optimal substructure is composed, and the optimal solution in the substructure is a part of the global optimal solution.

State transition equation

From the previous section, it is easy to write the state transition equation.

Rmax = max{(Rmaxi,Rmaxj,Rmaxk…) + weight(Wx)}


As I understand above, the code is very simple. Notice that the total value is calculated when loading the dictionary, which is the sum of word frequencies. Then there are some tricks, such as logarithmic tricks. The code is a typical DP solution code.

def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
        logtotal = log(
        for idx in xrange(N - 1, -1, -1):
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])