The maximum sequence length that the pre training model Bert can process is 512. When facing long text (document level), text truncation or sliding window is usually used to make the sequence length of the input model meet the preset value, but this processing method will lead to the loss of global information related to the task. As shown in the figure below, in the QA task, the answers related to the questions are outside the length range handled by Bert, so it is difficult to make the model answer correctly. In order to solve this problem, I will share a paper in neurips 2020《CogLTX: Applying BERT to Long Texts》The author is from Tsinghua University and Ali team. The core idea of this paper is to imitate people’s understanding and memory mechanism, use Bert to process long texts, propose memrecal mechanism, identify key sentences in long texts, and then form new texts for downstream tasks through sorting, extraction and fusion. This paper is open source, the address ishttps://github.com/Sleepychord/CogLTX。
The above figure shows the processing flow of cogltx for three different tasks:
(1) Cut the input long text x into continuous short sentences (x ,…, x [n]) in a dynamic way;
(2) Through the menrecall module, identify the key text related to the task from (x ,…, x [n]) to form a chunkz；
(3) According to different task types, key chunkszForm the input format meeting the task requirements, and then input the Bert model (hereinafter referred to as reasoner) for downstream task processing;
The above figure is actually easy to understand, which is consistent with the normal Bert processing process,The key part is menrecall. How to extract task related key text from long text and replace long text with key text as input is the core of this paper。
About menrecall, the author is based on the following assumption:
Meaning of the above formula: long textInput task model and key text blockThe effect of input task model is similar and consistent; On the other hand, it means that there are many sentences not related to the task in the long text, and deleting them will not affect the task processing effect. This assumption is very reasonable.
Based on the above assumptions, the author sets up a judge（）Model, to judgeWhether it is a key chunk; The judgment calculation method is as follows:
Follow judge () traversal, if the output is 1, it indicatesIt is a key statement related to the task. Add it toThen judge, in turn, when len（）When the maximum length of the input Bert is exceeded, it can be used as a condition for terminating the judgment.
This leads to the following problem: through judge（）Extracting key chunks from long text is a new task. What is its loss function when training this task?
To solve the above problems, this paper designs two loss function methods, one for supervised learning and the other for unsupervised learning, as follows:
Formulas (3), (4) show the cross entropy comparison between the judged chunk and the real chunk; Formulas (5) and (6) mean to reduce or add a sentence, and judge whether the fragment is a key chunk by looking at the change of the loss value of the original task. Obviously, this is a step-by-step iterative process. The following figure shows the detailed calculation process of menrecall.
This paper uses its own method to carry out experiments under multiple tasks, includingReading comprehension，Multi-hop question answering，Text classification，Multi-label classification, the following are the experimental results of some tasks.
It shows that the model test effect is particularly obvious, and the data set under each task is improved by nearly 4% ~ 5%. It also shows that capturing more useful information in the long text for learning is certainly beneficial to the task.
There are many papers related to the processing of long texts. Their common practice is to turn long texts into short texts, and different ideas arise on how to deal with them. I used to use kmeans clustering to identify key chunks, but this method has a great relationship with what tasks to do, because it is not obvious whether many sentences are key sentences in some tasks. Clustering by vector representation is not very effective. After reading the idea of this paper, I think it’s worth learning from and sharing it at the same time~
More articles can be read about the author’s official account.Natural language processing algorithm and Practice