When I write this article, my heart is excited. This is because, since I wrote this article last June, I used relationship extraction to build a knowledge map Since then, I have been trying to find a way to extract triples in the open field. Many readers have asked me about this. Today, I will give a reply. Although it is not the right answer (there is no right answer at present), at least, I have written down my own answer.
It’s not long since I came up with this extraction system, but my mood has changed from the initial excitement and ecstasy to the later plainness and dissatisfaction until now. It turns out that triple extraction in the open field is too difficult to give a perfect answer with my personal efforts and intelligence. Therefore, the title of this article is to try, only as an attempt, it can not solve this problem. But, I still want to write something, hope to be able to have a little inspiration to the author, at the same time, it is also a summary of my exploration in the past six months.
. This paper will introduce an attempt of triple extraction in the open field.
. The project structure of the project is as follows:
The project is divided into four parts. The main modules are as follows:
- Extract_example: use the trained model to extract triples of basic novels and news to form a knowledge map example;
- Sequence labeling: to train the labeling algorithm for the labeled entity data;
- SPO “tagging” platform: tagging platform, tagging subject, predicate and object, and whether triples are valid;
- Text classification: text classification, used to determine whether the extracted triplet is valid.
the flow chart of the extraction system of this project is as follows:
Next, I will introduce them one by one.
the author uses tornado to build a simple annotation platform. In the annotation page, the announcer needs to input the annotated sentences (sentence level. The reason why we adopt this annotation method is that we can mark subject, predicate and object in sentences. These entities will form a possible combination of triples, and then use 0, 1 to mark whether the triples are effective. In this way, we can do triple extraction in the open field.
A simple example of annotation is as follows:
Then we will make some explanations for the above annotation results. Our annotation is based on the sentence level, and different elements are marked with distinction. Two subjects, one predicate (shared) and two objects are marked, among which the predicate is shared by these subjects and objects, so only one annotation is needed. In this way, click “show s p o”, a total of 4 triples will be displayed, s, P, O are separated by ා, 0, 1 indicates whether it is a valid triplet, and the default is 0.
In my spare time, I annotated more than 3200 samples in total. For sequence annotation, it means more than 3200 samples. For text classification, it means more than 9000 samples.
For the above annotation example, the following annotation sequence will be formed:
American b-subj National i-subj Ji i-subj Control i-subj China i-subj Heart i-subj Main b-pred Any i-pred Ray B-OBJ De I-OBJ Phenanthrene I-OBJ I-OBJ De I-OBJ （ O Left O Circle O ） O And O American b-subj National i-subj National i-subj Li i-subj Wei-subj Sheng i-subj Research i-subj I-subj Institute i-subj I-subj Min i-subj And i-subj To i-subj Dye i-subj Disease i-subj Research i-subj I-subj Institute i-subj Main b-pred Any i-pred Fu B-OBJ West I-OBJ （ O Right O Circle O ） O
. For the introduction of this model, please refer to NLP (XXV) to realize the Albert + bi LSTM + CRF model.
The training results on the test set are as follows:
accuracy: 93.69%; precision: 76.26%; recall: 82.33%; FB1: 79.18 OBJ: precision: 80.47%; recall: 88.81%; FB1: 84.44 927 PRED: precision: 76.89%; recall: 83.69%; FB1: 80.14 1021 SUBJ: precision: 71.72%; recall: 75.32%; FB1: 73.48 983
The overall F1 value on the test set is close to 80%.
although the title of this paper is about the attempt of triple extraction in the field of development, in fact, when I label, I still label more the title of the characters, the relationship between the characters, the relationship between the company and the people, the starring of the film and TV series, the director’s information, etc. More than 9000 samples of effective text classification are formed, with a total of 1365 relationships. The top 20 relationships with the largest number are as follows:
CDC director Redfield, CDC director Redfield (left circle) and NIH Institute of allergy and infectious diseases director fossi (right circle) Director of the Centers for Disease Control and Prevention (left circle) and director of the National Institutes of health allergy and infectious diseases (right circle) Redfield, director of the National Institutes of allergy and infectious diseases, National Institutes of health, Redfield, director of the Centers for Disease Control and Prevention (left circle) and fossi, director of the National Institutes of allergy and infectious diseases, National Institutes of health, right circle Director of the Institute of allergy and infectious diseases, NIH
In the actual model training, the subject in the original text will be replaced with s * len (subject), the predicate with P, and the object with O.
. Using the classic deep learning model Albert + bi Gru + att + FC, set the maximum length of text as 128, train 30 epochs, and adopt the early stopping mechanism. The loss and ACC images in the training process are as follows:
Finally, the accuracy on the test set is about 96%.
Triple extraction of new data
after the above model is trained, we can encapsulate it as an HTTP service. For the new input sentence, we first use the sequence annotation model to predict the subject, predicate and object, and combine them into a triple and sentence splicing, input them into the text classification model to determine whether the triple is valid, 0 is invalid, 1 is valid.
extract_exampleFor the effect of extraction, including the effect of several novels and some news, please refer to another project: https://github.com/percent4/knowledge’graph’demo. We can also refer to some examples of knowledge map construction given in the article.
. Moreover, this project is relatively large, and it is not suitable to be described in detail here. The author only gives ideas and approximate processing flow here. For specific implementation code, please refer to the GitHub address below.
in the actual extraction process, a large number of useless triples are extracted from some sentences, resulting in a high recall rate. This is because the project is aimed at triples extraction in the open field, so the effect is not as good as expected. The methods to improve the extraction effect are as follows:
- At present, there are only more than 3200 samples of the algorithm;
- In terms of model: now it is in the form of pipeline, and their respective effects are OK, but in general, they are not as good as the joint form;
- In the case of other triples that you want to extract, it is suggested to add such annotations;
- Text prediction takes a long time (this problem has been solved).
As an attempt of triple extraction in the open field, there are few articles or projects in this field before, so it can be said to be the exploration stage.
The source code and data have been provided in GitHub project at https://github.com/percent4/spo’extract’platform.
My WeChat official account is
Python crawler and algorithm, welcome to~
- An attempt to construct knowledge map with relation extraction: https://www.cnblogs.com/jclia
- An attempt to extract triples in NLP (26) domain: https://blog.csdn.net/jclian9
- NLP (25) implements Albert + bi LSTM + CRF model: https://blog.csdn.net/jclian9
- For example: https://blog.csdn.net/jclian9
- A real battle of NLP (21) character relationship extraction: https://blog.csdn.net/jclian9
- Method, practice and application of knowledge map, by Wang haofen, Qi Guilin and Chen Huajun, published by China industry and information publishing group and electronic industry press.