# [interpretation of the paper] a sharp tool for text classification: Bert trim

Time：2022-1-21

Thesis title: how to fine tune Bert for text classification?
English Title: how to fine tune Bert for text classification?
Author: Fudan UniversityQiu XipengTeacher research group
Experiment code:https://github.com/xuyige/BERT4doc-Classification

## preface

Now, we are very fond of the pre training model in the competition, which is basically the first choice for the NLP competition baseline (there are also pre training models for image classification). Although the pre training model is very strong and may bring us great improvement through simple fine-tuning, you will find that it is difficult to improve the pre training model such as Bert when refining pills to a certain extent in the later stage of the competition, and the score has reached the bottleneck. At this time, it is necessary to fine tune the use of specific tasks, which involves the tricks accumulated by test experience, This paper has done a very large and sufficient experiment, which provides us with valuable Bert fine-tuning experience and methodology. When we need to apply Bert to specific practical tasks, we can refer to the parameter adjustment route provided in this paper for optimization. I have tried repeatedly in the NLP competition. There is always a trick that is your dish. I recommend you to read this paper!

## Abstract

The main purpose of this paper is to explore different Bert tuning methods in text classification tasks and provide a general Bert tuning solution. This paper explores three routes: (1) Bert’s own fine-tuning strategy, including long text processing, learning rate, selection of different layers and so on; (2) Further pre training Bert in target tasks, fields and cross fields; (3) Multi task learning. The fine tuned Bert has achieved the current optimal results on seven English data sets and Sogou Chinese data sets. Interested friends can click on the experimental code above to run and play~

## Thesis background and research motivation

Text segmentation is a very classic task in NLP, which is to judge the specific category of a given text, such as whether the text emotion is positive or negative. Although there have been relevant department research work that based onLarge corpusThe pre training model can have very good effects and benefits on text classification and other NLP tasks. One of the great benefits of this is that we don’t need to train a new model from scratch, which saves a lot of resources and time. A common pre training model is our common word embedding, such as word2vec, glove vector, or one word polysemous word vector models cove and Elmo. These word vectors are often used as additional features of NLP tasks. Another pre training model is vectorized representation at the sentence level, such as ulmfit. Others include openai GPT and Bert.

Although Bert has made amazing achievements in many natural language understanding tasks, its potential has not been fully explored. There are few studies to further improve the performance of Bert on target tasks. The main purpose of this paper is to explore various ways to maximize the use of Bert to enhance its performance in text classification tasks.
The main contributions of this paper are as follows:

(1) A general solution is proposed to fine tune the pre training Bert model, which includes three steps: (1) further pre training Bert
In mission training data or in domain data; (2) If multiple related tasks are available, multi task learning can be selected to fine tune Bert; (3) Fine tune Bert for the target task.

(2) This paper studies the fine-tuning method of Bert in target tasks, including long text preprocessing, layer by layer selection, layer by layer learning rate and catastrophic forgetting

(3) We have achieved SOTA results on seven widely studied English text classification data sets and one Chinese news classification data set

## Paper core

• Fine tuning strategies: there are many ways to fine tune Bert for target tasks
Use Bert. For example, different layers of Bert capture different levels of semantic and syntactic information. Which layer is more suitable for the target task? How do we choose better optimization algorithms and learning rates?

• Further pre training: Bert is trained in the general domain, and its data distribution is different from that of the target domain. A natural idea is to further pre train Bert using target domain data. This is really effective. After the fine-tuning reaches a certain bottleneck, you can try itpt on the competition corpus, that is, continue the pre training. stayHaihua reading comprehension competitionas well asQuality analysis model of enterprise hidden danger investigation based on Text MiningHave been successfully verified~

• Multi task fine tuning: without a pre trained LM model, multi task learning has shown its effectiveness in taking advantage of sharing knowledge among multiple tasks. When multiple tasks are available in the target domain, an interesting question is whether it is still beneficial to fine tune Bert on all tasks at the same time.

## Fine tuning strategy

1. Process long text
We know that the maximum sequence length of Bert is 512. The first problem when Bert is applied to text classification is how to deal with text with a length greater than 512. This paper attempts to deal with long articles in the following ways.

Truncation methods
The key information of the article is at the beginning and end. We can use three different methods of truncating text to perform Bert tuning.

1. Head only: keep the first 510 tokens 510 characters, plus two special characters, just 512;
2. tail-only: keep the last 510 tokens; The tail 510 characters, similarly, plus two special characters, just 512;
3. head+tail: empirically select the first 128and the last 382 tokens.： Tail combination

Hierarchical methods
The input text is first divided into k = L / 510 segments and fed into Bert to obtain the representation vector of K text segments. The representation of each score is the hidden state of the [CLS] tag of the last layer, and then we use mean pooling, maximum pooling and self attention to combine the representation of all scores.

The results in the above table show that the head + tail truncation method performs best on IMDB and Sogou data sets. The follow-up experiments are also treated in this way.

2. Characteristics of different layers
Each layer of Bert captures different features of the input text. The text studies the effectiveness of features from different layers, and then we fine tune the model and record the performance of test error rate.

We can see that the last layer has the best representation effect; The last four layers have the best effect of Max pooling
3. Catastrophic forgetting
Catastrophic forgetting is usually a common criticism in transfer learning, which means that the pre trained knowledge will be forgotten in the process of learning new knowledge.
Therefore, this paper also studies whether Bert has the problem of catastrophic forgetting. We use different learning rates to fine tune Bert and find that a lower learning rate, such as 2e-5, is needed to make Bert overcome the problem of catastrophic forgetting. Under 4e-4’s large learning rate, the training set cannot converge.

I also know that when the pre training model fails and cannot converge, check whether there is a problem with the setting of the next super parameter.
4. Layer wise decreasing layer rate
The following table shows the performance of different basic learning rates and attenuation factors on the IMDB dataset. We find that assigning a lower learning rate to the lower layer is effective for fine-tuning Bert, and the more appropriate setting is ξ= 0.95 and LR = 2.0e-5

Set different learning rates and attenuation factors for different Bert. How does Bert perform? Put parameters θ \ theta θ Divided into{ θ 1 , … , θ L } {\theta1,\dots,\thetaL}{θ
1
,…,θ
L
}, where θ l \theta^l θ
l

## Itpt: continue pre training

Bert is pre trained on the general corpus. If we want to apply text classification in specific fields, there must be some gaps in data distribution. At this time, deep pre training can be considered.

Within task pre training: Bert performs pre training on the training corpus
In domain pre training: pre training the corpus in the same field
Cross domain pre training: pre training corpus in different fields

Bert itpt fit means “Bert + with in task pre training + fine tuning”. The above figure shows that it is beneficial to continue pre training of asynchronous numbers on IMDB dataset.
2 in domain and cross domain further pre training

We found that almost all further pre training models performed better than the original Bert basic model on all seven data sets. In general, intra domain pre training can bring better performance than intra task pre training. On small sentence TREC data sets, intra task pre training will damage performance, while in domain pre training using yah. Yah. A. Corpus can achieve better results on TREC.

This paper is compared with other models, and the results are shown in the table below:

We can see that the error rates of itpt, idpt and cdpt are reduced to varying degrees in different data sets compared with other models.

All tasks will share the Bert layer and embedding layer. The only layer not shared is the final classification layer. Each task has its own classification layer.

The above table shows that the effect of multi-task fine-tuning based on Bert is improved, but the multi-task fine-tuning of cdpt is decreased. Therefore, multi-task learning may not be necessary to improve the generalization of relevant text classification subtasks.

## Few shot learning with small samples

Experiments show that Bert can significantly improve the performance of small-scale data.

## Further pre training on Bert large model

The experimental results show that fine tuning the Bert large model on a specific task can obtain the current optimal results.

Next, let’s bring you the dry goods part:Different learning rate strategiesUse of

## Different learning rate strategies

1. Constant Schedule

2. Constant Schedule with Warmup

3. Cosine with Warmup

4. Cosine With Hard Restarts

5. Linear Schedule with Warmup

6. Polynomial Decay with Warmup

## 2021-11-09 volcano map based on RNA SEQ table (second time)

setwd(“C:\\Users\\Administrator.DESKTOP-4UQ3Q0K\\Desktop”) library(“readxl”) data <- read_excel(“RNA-seq.xlsx”) library(dplyr) library(ggplot2) library(ggrepel) data #Convert to tibble for subsequent use and remove unnecessary columns; Data < – Data [C (- 10, – 11, – 14, – 15, – 16, – 19, – 20, – 21, – 22)] # don’t try #data <- as_tibble(data[c(-10,-11,-14,-15,-16,-19,-20,-21,-22)]) data$padj<-as.numeric(as.matrix(data$padj)) #Take logarithm of Q value; data\$log10FDR […]