An attempt to extract triples in NLP (27) open domain

Time:2020-3-29

When I write this article, my heart is excited. This is because, since I wrote this article last June, I used relationship extraction to build a knowledge map Since then, I have been trying to find a way to extract triples in the open field. Many readers have asked me about this. Today, I will give a reply. Although it is not the right answer (there is no right answer at present), at least, I have written down my own answer.
It’s not long since I came up with this extraction system, but my mood has changed from the initial excitement and ecstasy to the later plainness and dissatisfaction until now. It turns out that triple extraction in the open field is too difficult to give a perfect answer with my personal efforts and intelligence. Therefore, the title of this article is to try, only as an attempt, it can not solve this problem. But, I still want to write something, hope to be able to have a little inspiration to the author, at the same time, it is also a summary of my exploration in the past six months.
                              . This paper will introduce an attempt of triple extraction in the open field.
                    . The project structure of the project is as follows:
An attempt to extract triples in NLP (27) open domain
The project is divided into four parts. The main modules are as follows:

  • Extract_example: use the trained model to extract triples of basic novels and news to form a knowledge map example;
  • Sequence  labeling: to train the labeling algorithm for the labeled entity data;
  • SPO “tagging” platform: tagging platform, tagging subject, predicate and object, and whether triples are valid;
  • Text classification: text classification, used to determine whether the extracted triplet is valid.

   the flow chart of the extraction system of this project is as follows:
An attempt to extract triples in NLP (27) open domain
Next, I will introduce them one by one.

Annotation platform

                        the author uses tornado to build a simple annotation platform. In the annotation page, the announcer needs to input the annotated sentences (sentence level. The reason why we adopt this annotation method is that we can mark subject, predicate and object in sentences. These entities will form a possible combination of triples, and then use 0, 1 to mark whether the triples are effective. In this way, we can do triple extraction in the open field.
A simple example of annotation is as follows:
An attempt to extract triples in NLP (27) open domain
Then we will make some explanations for the above annotation results. Our annotation is based on the sentence level, and different elements are marked with distinction. Two subjects, one predicate (shared) and two objects are marked, among which the predicate is shared by these subjects and objects, so only one annotation is needed. In this way, click “show s p o”, a total of 4 triples will be displayed, s, P, O are separated by ා, 0, 1 indicates whether it is a valid triplet, and the default is 0.
In my spare time, I annotated more than 3200 samples in total. For sequence annotation, it means more than 3200 samples. For text classification, it means more than 9000 samples.

sequence labeling

For the above annotation example, the following annotation sequence will be formed:

American b-subj
National i-subj
Ji i-subj
Control i-subj
China i-subj
Heart i-subj
Main b-pred
Any i-pred
Ray B-OBJ
De I-OBJ
Phenanthrene I-OBJ
I-OBJ
De I-OBJ
(    O
Left O
Circle O
)    O
And O
American b-subj
National i-subj
National i-subj
Li i-subj
Wei-subj
Sheng i-subj
Research i-subj
I-subj
Institute i-subj
I-subj
Min i-subj
And i-subj
To i-subj
Dye i-subj
Disease i-subj
Research i-subj
I-subj
Institute i-subj
Main b-pred
Any i-pred
Fu B-OBJ
West I-OBJ
(    O
Right O
Circle O
)    O

                            . For the introduction of this model, please refer to NLP (XXV) to realize the Albert + bi LSTM + CRF model.
The training results on the test set are as follows:

accuracy:  93.69%; precision:  76.26%; recall:  82.33%; FB1:  79.18
OBJ: precision:  80.47%; recall:  88.81%; FB1:  84.44  927
PRED: precision:  76.89%; recall:  83.69%; FB1:  80.14  1021
SUBJ: precision:  71.72%; recall:  75.32%; FB1:  73.48  983

The overall F1 value on the test set is close to 80%.

Text categorization

                  .
   although the title of this paper is about the attempt of triple extraction in the field of development, in fact, when I label, I still label more the title of the characters, the relationship between the characters, the relationship between the company and the people, the starring of the film and TV series, the director’s information, etc. More than 9000 samples of effective text classification are formed, with a total of 1365 relationships. The top 20 relationships with the largest number are as follows:
An attempt to extract triples in NLP (27) open domain
                   

CDC director Redfield, CDC director Redfield (left circle) and NIH Institute of allergy and infectious diseases director fossi (right circle)
Director of the Centers for Disease Control and Prevention (left circle) and director of the National Institutes of health allergy and infectious diseases (right circle)
Redfield, director of the National Institutes of allergy and infectious diseases, National Institutes of health, Redfield, director of the Centers for Disease Control and Prevention (left circle) and fossi, director of the National Institutes of allergy and infectious diseases, National Institutes of health, right circle
Director of the Institute of allergy and infectious diseases, NIH

In the actual model training, the subject in the original text will be replaced with s * len (subject), the predicate with P, and the object with O.
                   . Using the classic deep learning model Albert + bi Gru + att + FC, set the maximum length of text as 128, train 30 epochs, and adopt the early stopping mechanism. The loss and ACC images in the training process are as follows:
An attempt to extract triples in NLP (27) open domain
Finally, the accuracy on the test set is about 96%.

Triple extraction of new data

   after the above model is trained, we can encapsulate it as an HTTP service. For the new input sentence, we first use the sequence annotation model to predict the subject, predicate and object, and combine them into a triple and sentence splicing, input them into the text classification model to determine whether the triple is valid, 0 is invalid, 1 is valid.
                  
An attempt to extract triples in NLP (27) open domain
An attempt to extract triples in NLP (27) open domain
An attempt to extract triples in NLP (27) open domain
  extract_exampleFor the effect of extraction, including the effect of several novels and some news, please refer to another project: https://github.com/percent4/knowledge’graph’demo. We can also refer to some examples of knowledge map construction given in the article.

summary

                        . Moreover, this project is relatively large, and it is not suitable to be described in detail here. The author only gives ideas and approximate processing flow here. For specific implementation code, please refer to the GitHub address below.
   in the actual extraction process, a large number of useless triples are extracted from some sentences, resulting in a high recall rate. This is because the project is aimed at triples extraction in the open field, so the effect is not as good as expected. The methods to improve the extraction effect are as follows:

  • At present, there are only more than 3200 samples of the algorithm;
  • In terms of model: now it is in the form of pipeline, and their respective effects are OK, but in general, they are not as good as the joint form;
  • In the case of other triples that you want to extract, it is suggested to add such annotations;
  • Text prediction takes a long time (this problem has been solved).

As an attempt of triple extraction in the open field, there are few articles or projects in this field before, so it can be said to be the exploration stage.

The source code and data have been provided in GitHub project at https://github.com/percent4/spo’extract’platform.

My WeChat official account isPython crawler and algorithm, welcome to~

Reference

  1. An attempt to construct knowledge map with relation extraction: https://www.cnblogs.com/jclia
  2. An attempt to extract triples in NLP (26) domain: https://blog.csdn.net/jclian9
  3. NLP (25) implements Albert + bi LSTM + CRF model: https://blog.csdn.net/jclian9
  4. For example: https://blog.csdn.net/jclian9
  5. A real battle of NLP (21) character relationship extraction: https://blog.csdn.net/jclian9
  6. Method, practice and application of knowledge map, by Wang haofen, Qi Guilin and Chen Huajun, published by China industry and information publishing group and electronic industry press.