Write in front
This series mainly shows the achievements of pointer network in NER and relation extraction series, and summarizes the advantages and disadvantages and theoretical analysis according to the notes of the leaders.
GlobalPointer
In the previous work, weNER
TraditionalLSTM+CRF
The indicators in each field have also achieved good results. The simple field is similar to the education backgroundf1
The values are all above 95, and the more complex ones, such as industry majors, are also above 90; The final evaluation complete accuracy rate (for a document, all fields are extracted and correctly calculated) is more than 85. There is no good optimization scheme based on the current model, which is nothing more than targeted data supplement, post-processing and so on. Based on this, investigate the currentNER
A better way, of course, is to use it directlyBERT+CRF
It will achieve better results than the current one, with a good improvement, but the performance problem is difficult to solve. What can be done is rightBERT
Distillation and pruning were carried out, but the effect was not satisfactory.
I happened to see Miss Su’sGlobalPointer
With the help of Mr. Su’s model, the complete removal rate increased from 85 to 91. The CPU test took less than 17ms. It has to be said that the performance and indicators reached the standard. (Note: theBERT
It’s on the fourth floorBERT
, which is completed by using the four layer model results of online teacher student trainingBERT
The complete accuracy rate is as high as 96, and the time is unpredictable. I’m lazy.)
Here is a brief introduction to the combination of elegance and performanceGlobalPointer
What’s going on!!!
Using the idea of global normalization for entity recognition, it is compatible with nested entities and non nested entities, and the theoretical ratio is designedCRF
It’s more reasonable. You don’t need to think about it during trainingCRF
In the same way, the denominator is recursively calculated. Dynamic programming is not required for prediction. It is completely parallel. In the ideal case, the time complexity is blind\(\color{blue}{O(1)}\)。
\(\color{red}{pointer\ difference between network and globalpointer}\)
- The pointer network generally uses two modules to identify the head and tail of entities respectively, while globalpointer judges the first as a whole, as shown in the figure above, so it has a “global view”.
Basic ideas
Let’s say that we want ton
In theory, the entities we recognize are continuous, and the maximum length can ben
From this, it can be concluded that the maximum number of candidate entities is\(\color{blue}{n(n+1)/2}\)Pieces; The problem is that we pick out the real entities from these candidates, and it becomesm
Relationships selected from candidatesk
Multi label classification problem.
One problem can be found here is that the complexity is\(\color{blue}{O(n^2)}\)But this is actually the complexity in space. It can be completely parallel in time and can be reduced to\(\color{blue}{O(n)}\)。
Mathematical form
Count Regn
oft
Input the encoded vector sequence\(\color{blue}{h_1,h_2,…h_n}\), by transformation\(\color{blue}{q_{i,\alpha}=w_{q,\alpha}h_i+b_{q,\alpha}}\)and\(\color{blue}{k_{i,\alpha}=w_{k,\alpha}h_i+b_{k,\alpha}}\), you can get the sequence vector\(\color{blue}{[q_{1,\alpha},q_{2,\alpha},…,p_{n,\alpha}]}\)and\(\color{blue}{[k_{1,\alpha},k_{2,\alpha},…,k_{n,\alpha}]}\), is to identify the\(\color{blue}{\alpha}\)The sequence of vectors used by each type of entity. You can define:
\]
Above is\(\color{blue}{\alpha}\)Type entity\(\color{blue}{t_{[i:j]}}\)fromi
reachj
Gave a score of. It can be seen here that it is actually a simplified version of multi head attentionV
Related operations of.
Location code
Theoretically, formula (1) is enough, but in actual training, due to insufficient anticipation, the performance is often unsatisfactory. The reason is that some relative position information is missing. In teacher Su’s experiment, with or without relative position information, the index gap is nearly 30%.
What happens if there is no location information? We can give a simple example, for example:Beijing Balabala, Shanghai Balala bar, Guangdong, suppose we recognize place names. Since globalpointer is insensitive to the length of sentences and location information, it is possible thatBeijing barabara ShanghaiIdentified as location. But having position information will solve this problem and distinguish the real entities.
I’ve always wanted to study the differences and advantages and disadvantages of different location codes, but I haven’t started. Let’s set up a flag again. I’ll write it in the next chapter. Yes, it’s finished. Directly speaking, Mr. Su used the rope rotary position coding in it, which is actually a transformation matrix\(\color{blue}{R_i}\), satisfied\(\color{blue}{R_i^TR_j=R{j-i}}\), which is applied to theq
,k
, the results are as follows:
\]
So you can explicitly\(\color{blue}{s_\alpha(i,j)}\)It is filled with relative position information.
The above content is basically finished from the basic idea to the mathematical form and location codingGlobalPointer
Principle and model representation of. Let’s talk about optimization.
loss function
As can be seen from the above, the final scoring function is equivalent to\(\color{blue}{\alpha}\)individual\(\color{blue}{n(n+1)/2}\)Class is a binary classification problem, which is equivalent to having\(\color{blue}{n(n+1)/2}\)With so many candidates, each candidate is equivalent to a binary classification problem. Obviously, there will be serious class imbalance in the end.
Refer to 2 “extending” softmax+ cross entropy “to multi label classification”, which refers to the promotion of cross entropy for single target multi classification, which is suitable for scenarios with a large number of targets but few target labels. The formula is as follows:
\]
among\(\color{blue}{P_\alpha}\)Yes all types of this sample are\(\color{blue}{\alpha}\)The closing set and of the entities of,\(\color{blue}{Q_\alpha}\)Are all non entities or non types of the sample\(\color{blue}{\alpha}\)The final set sum of the entities of, note that we only need to consider\(\color{blue}{i\leq j}\)A combination of:
In the decoding phase, all\(\color{blue}{s_\alpha(i,j)\gt0}\)Fragment of\(\color{blue}{t_{[i:j]}}\)Are treated as types\(\color{blue}{\alpha}\)Entity output of. It can be seen that the decoding process is extremely simple, and the decoding efficiency is\(\color{blue}{O(1)}\)!。
experimental result
Show the experimental results on cmeee (nested task) data.
Validation set F1 | Test set F1 | Training speed | Forecast speed | |
---|---|---|---|---|
CRF | 63.81% | 64.39% | 1x | 1x |
GP | 64.84% | 65.98% | 1.52x | 1.13x |
Compare with CRF [pure copy]
Assume that the number of sequence labels isk
, then frame by framesoftmax
andcrf
The difference is:\(\color{red}{the former calls sequence tagging N K classification problems, while the latter calls sequence tagging 1 k^n classification problem}\)。 This also shows that frame by framesoftmax
andcrf
Theoretical disadvantages when used in NER. Frame by framesoftmax
Treat sequence annotations asn
individualk
For the classification problem, it is too loose, because the correct prediction of the label at a certain location does not mean that the entity can be correctly extracted. At least one segment is correct only when the label is correct; contrary,CRF
Consider the sequence annotation as 1\(\color{blue}{k^n}\)For the classification problem, it is too strict, because this means that it requires all entities to predict correctly, and only some entities will not be given a score. Although in actual use we useCRF
Some correct prediction results can also appear, but that only shows that the generalization ability of the model itself is good,CRF
The design itself does contain the meaning of “only points are given to all pairs”.
Therefore, CRF is not reasonable in theory. In contrast, globalpointer is closer to the use and evaluation scenarios: it is entity based, and it is designed as a “multi label classification” problem. In this way, its loss function and evaluation index are entity granularity, even if only a part of it is given a reasonable score. Therefore, even in non nested ner scenarios, it is “reasonable” that globalpointer can achieve better performance than CRF.
Shoulders Of Giants
1. Sujianlin (May. 01, 2021). Globalpointer: a unified approach to nested and non nested NER
2. Extending “softmax+ cross entropy” to multi label classification problem