DAG (Directed Acyclic Graph)
Directed acyclic graphs, DAG for short, is a data structure of graphs. In fact, it is naive, that is, directed graphs without rings._
DAG is widely used in word segmentation. Whether it is the maximum probabilistic path or the back set of N N, DAG exists widely in word segmentation.
Because DAG itself is a digraph, it is feasible to use adjacency matrix to represent DAG, but Jieba uses Python dict to express DAG more conveniently. Its representation method is as follows:
{prior1:[next1,next2...,nextN]，prior2:[next1',next2'...nextN']...}
Take the sentence “I’m studying stuttering participle on National Day” as an example, the Dict Of DAG generated by it is expressed as:
{0: [0, 1, 2], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5, 6], 6: [6], 7: [7, 8], 8: [8], 9: [9, 10], 10: [10]}
Among them,
Guo [0] Qing [1] Festival [2] I [3] studied [5] in [4] [6] knot [7] Ba [8] [9] word [10]
The get_DAG() function code is as follows:
def get_DAG(self, sentence):
self.check_initialized()
DAG = {}
N = len(sentence)
for k in xrange(N):
tmplist = []
i = k
frag = sentence[k]
while i < N and frag in self.FREQ:
if self.FREQ[frag]:
tmplist.append(i)
i += 1
frag = sentence[k:i + 1]
if not tmplist:
tmplist.append(k)
DAG[k] = tmplist
return DAG
Frag is fragment, you can see the code loop slicing sentence, FREQ is dictionary {word: frequency} Dict
Since all prefixes of word and word have been added to the dictionary at the time of loading, once frag is not in FREQ, it can be concluded that frag and words prefixed with frag are not in the dictionary and can jump out of the loop.
The next step is to use DP dynamic programming to solve the maximum probabilistic path.
Maximum Probability Path
It is worth noting that every node of DAG is weighted. For words in dictionary, the weight is its word frequency, that is FREQ [word]. We require route = w1, w2, w3,…, wn to maximize weight (wi).
Dynamic Programming Solution Method
There are two conditions to satisfy DP

Repetitive subproblems

Optimal Substructure
Let’s analyze the maximum probabilistic path problem.
Repetitive subproblems
For the node Wi and its possible successors Wj and Wk, there are:
The weight of any path to Wj via Wi is the weight of the path through Wi plus the weight of Wj {Ri> j}={Ri + weight (j)};
The weight of any path arriving at Wk through Wi is the weight of the path passing through Wi plus the weight of Wk {Ri> k}= {Ri + weight (k)}.
That is to say, for nodes Wj and Wk with common precursor Wi, the path to Wi needs to be calculated repeatedly.
Optimal Substructure
For the optimal path Rmax and one terminal node Wx of the whole sentence, for its possible existence of multiple precursors Wi, Wj, Wk, let the maximum paths to Wi, Wj and Wk be Rmaxi, Rmaxj, Rmaxk, respectively.
Rmax = max(Rmaxi,Rmaxj,Rmaxk…) + weight(Wx)
So the problem is transformed into
Find Rmaxi, Rmaxj, Rmaxk…
The optimal substructure is composed, and the optimal solution in the substructure is a part of the global optimal solution.
State transition equation
From the previous section, it is easy to write the state transition equation.
Rmax = max{(Rmaxi,Rmaxj,Rmaxk…) + weight(Wx)}
Code
As I understand above, the code is very simple. Notice that the total value is calculated when loading the dictionary, which is the sum of word frequencies. Then there are some tricks, such as logarithmic tricks. The code is a typical DP solution code.
def calc(self, sentence, DAG, route):
N = len(sentence)
route[N] = (0, 0)
logtotal = log(self.total)
for idx in xrange(N  1, 1, 1):
route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) 
logtotal + route[x + 1][0], x) for x in DAG[idx])