[pointer network series 1] – globalpointer NER


Write in front

​ This series mainly shows the achievements of pointer network in NER and relation extraction series, and summarizes the advantages and disadvantages and theoretical analysis according to the notes of the leaders.


​ In the previous work, weNERTraditionalLSTM+CRFThe indicators in each field have also achieved good results. The simple field is similar to the education backgroundf1The values are all above 95, and the more complex ones, such as industry majors, are also above 90; The final evaluation complete accuracy rate (for a document, all fields are extracted and correctly calculated) is more than 85. There is no good optimization scheme based on the current model, which is nothing more than targeted data supplement, post-processing and so on. Based on this, investigate the currentNERA better way, of course, is to use it directlyBERT+CRFIt will achieve better results than the current one, with a good improvement, but the performance problem is difficult to solve. What can be done is rightBERTDistillation and pruning were carried out, but the effect was not satisfactory.

​ I happened to see Miss Su’sGlobalPointerWith the help of Mr. Su’s model, the complete removal rate increased from 85 to 91. The CPU test took less than 17ms. It has to be said that the performance and indicators reached the standard. (Note: theBERTIt’s on the fourth floorBERT, which is completed by using the four layer model results of online teacher student trainingBERTThe complete accuracy rate is as high as 96, and the time is unpredictable. I’m lazy.)

​ Here is a brief introduction to the combination of elegance and performanceGlobalPointerWhat’s going on!!!

​ Using the idea of global normalization for entity recognition, it is compatible with nested entities and non nested entities, and the theoretical ratio is designedCRFIt’s more reasonable. You don’t need to think about it during trainingCRFIn the same way, the denominator is recursively calculated. Dynamic programming is not required for prediction. It is completely parallel. In the ideal case, the time complexity is blind\(\color{blue}{O(1)}\)


\(\color{red}{pointer\ difference between network and globalpointer}\)

  • The pointer network generally uses two modules to identify the head and tail of entities respectively, while globalpointer judges the first as a whole, as shown in the figure above, so it has a “global view”.

Basic ideas

​ Let’s say that we want tonIn theory, the entities we recognize are continuous, and the maximum length can benFrom this, it can be concluded that the maximum number of candidate entities is\(\color{blue}{n(n+1)/2}\)Pieces; The problem is that we pick out the real entities from these candidates, and it becomesmRelationships selected from candidateskMulti label classification problem.

​ One problem can be found here is that the complexity is\(\color{blue}{O(n^2)}\)But this is actually the complexity in space. It can be completely parallel in time and can be reduced to\(\color{blue}{O(n)}\)

Mathematical form

​ Count RegnoftInput the encoded vector sequence\(\color{blue}{h_1,h_2,…h_n}\), by transformation\(\color{blue}{q_{i,\alpha}=w_{q,\alpha}h_i+b_{q,\alpha}}\)and\(\color{blue}{k_{i,\alpha}=w_{k,\alpha}h_i+b_{k,\alpha}}\), you can get the sequence vector\(\color{blue}{[q_{1,\alpha},q_{2,\alpha},…,p_{n,\alpha}]}\)and\(\color{blue}{[k_{1,\alpha},k_{2,\alpha},…,k_{n,\alpha}]}\), is to identify the\(\color{blue}{\alpha}\)The sequence of vectors used by each type of entity. You can define:


Above is\(\color{blue}{\alpha}\)Type entity\(\color{blue}{t_{[i:j]}}\)fromireachjGave a score of. It can be seen here that it is actually a simplified version of multi head attentionVRelated operations of.

Location code

​ Theoretically, formula (1) is enough, but in actual training, due to insufficient anticipation, the performance is often unsatisfactory. The reason is that some relative position information is missing. In teacher Su’s experiment, with or without relative position information, the index gap is nearly 30%.

​ What happens if there is no location information? We can give a simple example, for example:Beijing Balabala, Shanghai Balala bar, Guangdong, suppose we recognize place names. Since globalpointer is insensitive to the length of sentences and location information, it is possible thatBeijing barabara ShanghaiIdentified as location. But having position information will solve this problem and distinguish the real entities.

​ I’ve always wanted to study the differences and advantages and disadvantages of different location codes, but I haven’t started. Let’s set up a flag again. I’ll write it in the next chapter. Yes, it’s finished. Directly speaking, Mr. Su used the rope rotary position coding in it, which is actually a transformation matrix\(\color{blue}{R_i}\), satisfied\(\color{blue}{R_i^TR_j=R{j-i}}\), which is applied to theqk, the results are as follows:

\[\color{blue}{s_\alpha(i,j)=(R_iq_{i,\alpha})^T(R_jk_{j,\alpha})\\ =q_{i,\alpha}^TR_i^TR_jk_{j,\alpha}\\ =q_{i,\alpha}^TR_{j-i}k_{j,\alpha} ——— (2)}

So you can explicitly\(\color{blue}{s_\alpha(i,j)}\)It is filled with relative position information.

​ The above content is basically finished from the basic idea to the mathematical form and location codingGlobalPointerPrinciple and model representation of. Let’s talk about optimization.

loss function

​ As can be seen from the above, the final scoring function is equivalent to\(\color{blue}{\alpha}\)individual\(\color{blue}{n(n+1)/2}\)Class is a binary classification problem, which is equivalent to having\(\color{blue}{n(n+1)/2}\)With so many candidates, each candidate is equivalent to a binary classification problem. Obviously, there will be serious class imbalance in the end.

​ Refer to 2 “extending” softmax+ cross entropy “to multi label classification”, which refers to the promotion of cross entropy for single target multi classification, which is suitable for scenarios with a large number of targets but few target labels. The formula is as follows:

\[\color{blue}{log(1+\sum_{(i,j)\in P_\alpha}e^{-s_\alpha(i,j)})+log(1+\sum_{(i,j)\in Q_\alpha}e^{s_\alpha(i,j)})—-(3)}

among\(\color{blue}{P_\alpha}\)Yes all types of this sample are\(\color{blue}{\alpha}\)The closing set and of the entities of,\(\color{blue}{Q_\alpha}\)Are all non entities or non types of the sample\(\color{blue}{\alpha}\)The final set sum of the entities of, note that we only need to consider\(\color{blue}{i\leq j}\)A combination of:

\[\color{blue} {Ω = {(I, J) \1\leq i\leq J \leq n}\\p\alpha={(I, J) \t\u{[I, j]} is an entity of type \alpha}——- (4) \q_ \alpha=Ω-P_ \alpha}

In the decoding phase, all\(\color{blue}{s_\alpha(i,j)\gt0}\)Fragment of\(\color{blue}{t_{[i:j]}}\)Are treated as types\(\color{blue}{\alpha}\)Entity output of. It can be seen that the decoding process is extremely simple, and the decoding efficiency is\(\color{blue}{O(1)}\)!。

experimental result

​ Show the experimental results on cmeee (nested task) data.

Validation set F1 Test set F1 Training speed Forecast speed
CRF 63.81% 64.39% 1x 1x
GP 64.84% 65.98% 1.52x 1.13x

Compare with CRF [pure copy]

​ Assume that the number of sequence labels isk, then frame by framesoftmaxandcrfThe difference is:\(\color{red}{the former calls sequence tagging N K classification problems, while the latter calls sequence tagging 1 k^n classification problem}\)。 This also shows that frame by framesoftmaxandcrfTheoretical disadvantages when used in NER. Frame by framesoftmaxTreat sequence annotations asnindividualkFor the classification problem, it is too loose, because the correct prediction of the label at a certain location does not mean that the entity can be correctly extracted. At least one segment is correct only when the label is correct; contrary,CRFConsider the sequence annotation as 1\(\color{blue}{k^n}\)For the classification problem, it is too strict, because this means that it requires all entities to predict correctly, and only some entities will not be given a score. Although in actual use we useCRFSome correct prediction results can also appear, but that only shows that the generalization ability of the model itself is good,CRFThe design itself does contain the meaning of “only points are given to all pairs”.

​ Therefore, CRF is not reasonable in theory. In contrast, globalpointer is closer to the use and evaluation scenarios: it is entity based, and it is designed as a “multi label classification” problem. In this way, its loss function and evaluation index are entity granularity, even if only a part of it is given a reasonable score. Therefore, even in non nested ner scenarios, it is “reasonable” that globalpointer can achieve better performance than CRF.

Shoulders Of Giants

1. Sujianlin (May. 01, 2021). Globalpointer: a unified approach to nested and non nested NER

2. Extending “softmax+ cross entropy” to multi label classification problem

Recommended Today

Why is reids fast

1. What is redis? Redis is completely open source and complies with the BSD protocol. It is a high-performance key value database. Redis is also one of the most popular NoSQL databases at present. It contains a variety of data structures, supports network, is memory based, and has an optional key value pair storage database […]