Follow Mr. Li Mu to do intensive reading of Bert thesis paragraph by paragraph (notes)


Thesis addressChinese translationCode addressVideo addressMost of the content of this article comes from。 It’s only for sorting and supplement. It’s recommended to see Mr. Li Mu’s original video. It’s really good

Suggested learning sequence:5min global understanding -> Teacher Li Mu’s explanation of the thesis -> Illustration or hand push Bert -> Code explanation, all powerful Amway

Graphic BertIt’s really suitable for the whole series after finishing the paper and feeling each small part!! I like this up so much

1 – title + author

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Pre training: train a model on a large data set. The main task of the model is to train other tasks
  • Deep bidirectional Transformers: deep bidirectional transformers
  • Language understanding: more broadly, transformer is mainly used in machine translation Mt
  • Conclusion: Bert: it uses deep, two-way and transformer for pre training and language understanding tasks.

2 – Summary

First paragraph of the summary: which two works are relevant and what are the differences: Bert is a change based on GPT and Elmo.

New language representation model Bert:Bidirectional Encoder Representations from TTransformer is a bi-directional coded representation of transformers model.Unlike Elmo and generic pre trained transformer:

  • Bert gets deep bidirectional representations of unlabeled text through pre training from unlabeled text (joint conditioning combines left and right context information)
    • GPT undirectional, usingleftContext information to predict the future
    • Bert bidirectional, usingLeft and right sidesContext information for
  • Pre trained Bert can fine tune by adding an output layer. It can have a good state of the art effect on many tasks (question and answer, reasoning) without modifying the architecture of specific tasks
    • Elmo based on RNNs, down stream taskThe architecture needs to be adjusted
    • GPT, down stream taskJust change to the top
    • Bert based on transformers, down stream taskJust adjust the top layer

It can be said that Bert hasPre training, depth, two-wayThese characteristics

Second paragraph of the summaryBenefits of Bert:

simple and empirically powerful

  • Better in 11 different tasks, and specific comparison results (including absolute accuracy and relative accuracy of other tasks)


First paragraph of the introductionBackground: This paper focuses on a little bit of research direction

  1. Language model pre training can improve the performance of NLP tasks
  2. NLP tasks are mainly divided into two categories: sentence level tasks – emotion recognition; Token level tasks word level characters – ner (person name, street name) need fine grained output

Note that NLP pre training existed long ago, and Bert made NLP pre training out of the circle

Second paragraph of the introduction: extension of the first paragraph of the summary

  • There are two common types of pre trained language representations:
    • Feature based
      • yesEvery downstream taskTo construct a andThis task is relatedneural network
      • The trained representation (such as word embedding, etc.) as additional features is put into the model together with the input as a good feature expression
      • For example, Elmo, use RNN
    • Based on fine tuning parameters
      • When the pre trained model is placed in the downstream application, it only needs to change the top layer
      • The pre trained parameters are fine tuned downstream
      • GPT

The purpose of introducing others’ work: to pave the way for yourself

  • Elmo and GPT use the unidirectional language model and the same objective function during pre training
    • After all, language models are unidirectional and predict the future. Not for the first sentence, the third sentence, predict the second sentence

Third paragraph of the introduction: limitations of current related technologies

  • The standard language model is unidirectional and unidirectional, which limits the choice of model architecture
    • GPT architecture from left to right can only see one sentence from left to right.
    • However, for the sentence emotion classification task: from left to right, from right to left should be legal. What I input is a complete sentence
    • Even for token level tasks, such as Q & a QA, you should be able to look at the complete sentence and choose the answer, not step by step from left to right.
    • If you can view the information in both directions from the incorporated context, you can improve the task performance.

Know the limitations of relevant work, + ideas to solve limitations — >Fourth paragraph of the introduction: how to solve it?

Bert uses MLM (masked language model) masked language model as the goal of pre training to reduce the unidirectional constraint of language model

  • What does the MLM masked language model do?
    • Randomly select the input etymological tokens each time, and then mask them. The objective function is to predict the masked words; Similar to word digging and cloze filling.
  • What is the difference between MLM and standard language model?
    • MLM can see the left and right context information, which is the basis of pre train deep bidirectional transformer
  • What else does Bert have besides MLM?
    • NSP: next sentence prediction
    • Judge whether the two sentences are randomly sampled or adjacent to the original text, and learn the information of sense level

Fifth paragraph of the introduction: contribution of the article

  1. It shows the importance of bidirectional two-way information
    1. GPT uses only unidirectional information
    2. Peter also made a two-way attempt in 2018, but it was just a simple attemptSee from leftIt is combined with the model independent training + shallow concatenation seen from the right to the left
    3. The application of Bert in bidirectional pre training is better
  2. If the result of the pre training model is good, there is no need to make so many changes to the structure of specific tasks
    1. Bert is the first fine-tuning model and has achieved good results in a series of sense level and token level tasks
  3. BERTOpen source, easy to use

4 – conclusion

Recent experiments show that the unsupervised pre training model is very good, and tasks with few resources can also be trained

This paper extends the previous results to deep bidirectional architectures, so that the same pre training model can handle a large number of NLP tasks

In recent years, the research of language model transfer learning shows that:

  • Unsupervised pre training is important for many language understanding systems,
  • The discovery of their research enables tasks with less resources to be learned from deepunidirectionalBenefit from architectures
  • We’ll take these findingsFurther generalization – > going deep into two-way architectureDeep bidirectional architectures, which allows the same pre training model to successfully handle a wide range of NLP tasks

A little summary

  • Elmo uses bidirectional information, but the architecture of RNN is relatively old
  • The GPT architecture transformer is new, but only uses the unidirectional information
  • Bert = bidirectional information of Elmo + new architecture transformer of GPT. After all, many language tasks are not predicting the future, but cloze
    • Combine the two and prove that two-way is useful

Sharp comment: a + B suture work or C technology solves the problems in D field. Don’t think the idea is small and not worth writing; Write it simply and simply. It’s easy to use. Maybe it’ll make a circle

5 – related work

2.1 Unsupervised Feature-based approaches

Unsupervised work based on feature representation: word embedding, Elmo, etc

Word representation learningIt is an active research field for many years.Pre trained word embeddingIt is considered to be an integral part of the modern NLP system, with ab initio learningembedIt provides a significant improvement

These methods have been extended to coarser granularity, such asSentence embeddingorParagraph embedding。 Like traditional word embedding, these learned representations are often used as features in downstream models

ELMoThe traditional word embedding research is summarized into different dimensions. They suggest extracting context sensitive features from language models. When embedding contextual words into existing task specific architectures, Elmo targets some major NLP benchmarks

2.2 Unsupervised Fine-tuning approaches

Unsupervised pre training – > fine tuning: GPT, etc., a new trend of transfer learning from language models (LMS)

Sentences or document encoders that generate contextual token representationsPre training from unmarked text, thenFine tune supervised downstream tasks

The advantage of these methods is that there is little need to learn parameters from scratch. Partly based on this advantage, openai GPT (Radford et al., 2018) has achieved unprecedented advanced results in many sentence level tasks in the glue benchmark

2.3 Transfer Learning from Supervised Data

Transfer learning on supervised and labeled data

NLP supervised big data set: natural language information and machine translation

CV did a good job. Imagenet trained well and then migrated

NLP performance is not so good: it may be very different, and NLP data is far from enough

Bert found:

  • In NLP, the effect of training on a large number of unlabeled data sets is better than that on labeled data sets with less data
  • Many CV tasks also began to adopt this idea, training models on a large number of unmarked images

6-bert model

Bert has two tasks: pre training + fine tuning

  • Pre training: training with unlabeled data
  • Fine tuning: Bert model is adopted, but the weights are obtained during pre training
    • All weights inFine tuningThey will participate in training when they are, using marked data
    • A new Bert model is common to each downstream task (initialized by pre training parameters), but each downstream task will fine tune its own Bert model according to the labeled data of its own task

There are a lot of pre training and fine-tuning CV in Bert

Teacher Li Mu’s sharp comment: it’s good to have an introduction to pre training and fine tuning, even if it has been used before.
Assuming that readers know the technology of the paper and only give ref will confuse readers. The paper writing should be self consistent and simple, so as to avoid the reader’s ignorance of pre training and fine-tuning and increase the obstacles to understanding the article.

Schematic diagram of pre training + fine tuning:

Downstream task: create the same Bert model, and the initialization value of the weight comes from the pre trained weight.

Mnli, NER and squiad downstream tasks have their own labeled data. Continue to train Bert and get their own version of Bert for each downstream task.

Model architecture

Multi layer bidirectional transformer encoder: a decoder of multi-layer bidirectional transformer, based on Transformer’s papers and code

Three parameters are mainly adjusted:

  • 50: Number of transform blocks
  • H: Hidden size hidden layer size
  • A: The number of head in multi head of self attention mechanism

There are two main sizes:

  • Changed Bert_ Base (learn 100 million parameters, l = 12, H = 768, a = 12)
  • BERT_ Large (340 million parameters, l = 24, H = 1024, a = 16)

The complexity of Bert model is linear with the number of layers L, and square with the width H.

Because the depth has doubled, choose a value above the width, so that the square of this increase is about twice as much as before

H = 16, because the dimension of each head is fixed at 64. As your width increases, so does the number of heads

BERT_ The parameter selection of base is similar to that of GPT, which can fairly compare the models; BERT_ Large brush list

Converting super parameters into learnable parameters + transformer architecture review

This article makes it clear:Understand parameter calculation of “NLP Bert base” model

A remark of word vector:Word vector

Input and output (pre training & fine tuning common part)

  • Downstream tasks can process one sentence and some can process two sentences. Bert can process downstream tasks with different number of sentences, so that the input can be a single sentence and a pair of sentences (question answer)

    • A single sentence: a continuous text is not necessarily a true semantic sentence
    • The input is called a sequence, which can be one sentence or two
  • This is different from transformer

    • The input of transformer pre training is a sequence pair. The encoder and decoder will input a sequence respectively
    • Bert has only one encoder. In order for Bert to process two sentences, it is necessary to combine the two sentences into a sequence

Bert syncopation

  • WordPiece
    • If you cut words according to the space: a word is a token. When the amount of data is large, the dictionary will be especially large, to the level of millions. The parameters that can be learned are basically in the embedded layer
    • Adopt this method: if the probability of a word is small, only those with high frequency of the word will be retainedSubsequence(probably its root)
      • Finally, the dictionary of words (subsequences) that often appear in 30K token is obtained

Input sequence composition of Bert

  • [ CLS ] + [ SEP ]
  • The beginning of the sequence is always [CLS], and the output is the whole sentence level information sequence representation
    • Bert uses transformerencoder, the self attention layer will look at each word entered and othersRelationship of all words
    • Even if the word [CLS] is placed in my first place, he still has a way to see all the words after it. So it doesn’t matter if he puts it first. He doesn’t have to put it last
  • At the end of the sentence, distinguish two sentences: two methods
    • + [SEP] after each sentence means separate
    • Learn an embedding layer to indicate whether the whole sentence is the first or second sentence (so extravagant)

The figure above shows the whole process of pre training

  • Each token enters Bert to get the embedded representation of the token.
  • Enter Bert as a whole and then output a result sequence
    • The output of the last transformer block represents the Bert representation of the etymology token. Add additional output layers later to get the desired results

The figure above shows input embeddings. the process of getting the embedding of this token:

  • For a given token
  • Bert input representation = token representation + segment representation + position embedding representation
    • Token embedding: Etymology. Each token has a corresponding word vector
    • Segment means whether it is a sentence or B sentence
    • The input of position is the position information of token etymology in this sequence. Starting from 0, 1 2 3 4 — > maximum.The final value is learned(transformer is given)
    • Together, the information of each vector will be combined
  • Get [sequence of a vector] from [sequence of a etymology], and finally enter the transformer block

1.Pre-training BERT

Pre trained key factors: objective function, pre trained data

TASK1: Masked lm

Why is bidirectional good? What is MLM? Cloze Test

See the summary for details, and the repetition is omitted

  • There is a 15% probability that the etymology in the etymological sequence generated by wordpiece will be randomly replaced with a mask
    • However, there is no replacement for special etymology, i.e., the first etymology [CLS] and the split etymology [SEP] in the middle
    • If the input sequence length is 1000, predict 150 words.
  • Problems caused by MLM:The data seen in pre training and fine tuning are different。 The input sequence of pre training is 15% [mask],There is no [mask] in the data during fine adjustment, see different data
    • Solution: for the 15% of the words selected masked:
      • 80% probability is replaced by [mask]
      • Replace 10% with random token
      • 10% do nothing and do not change the original token. But its result is $t_ I $will be used for prediction (similar to the mask? The same as that in fine tuning?)
      • Use the ablation study to choose how to decide what these 80 are

Task 2: predict the next juinnsp

  • In question answering and natural language reasoning, forms areSentence pair
    • It would be great if Bert could learn the sense level information
  • We give two sentences a and B in the input sequence, 50% positive and 50% negative
    • That is, 50% of B is really after a, and 50% is randomly sampled by a random sense
      • The man was going to a shop, and then he bought a gallon of milk. IsNext
      • Counterexample: the man went to the store, and then penguin is a bird that can’t fly. NotNext
      • By the way, you can see that flightless is divided into two words by wordpiece

In fact, these two tasks are not practical in real scenes. We just want to get a by-product – the “feeling” of the language and the representation of the words in the best sentences. After all, we don’t want to label so much data, so it’s good for us to learn the connections in these sentences

Pre training data

2 datasets: bookscorpus (800 m) + English Wikipedia (2500 m)

Use one article after another instead of randomly interrupted sentences. a document-level corpus rather than a shuffled sentence-level corpus

Transformer can indeed handle long sequences. Of course, the effect of inputting an entire text will be better

2.Fine-tuning BERT


  • What are the differences between Bert and some encoder decoder based architectures (transformer is encoder decoder)
    • The whole sentence pairs are put together and input Bert. Self attention can look at each other between the two sentences
    • Generally, the encoder of transformer cannot see the decoder
    • Bert is better in many ways, but the price is that it can’t do machine translation like transformer
      • The decoder of transformer can produce the effect of only looking at the previous sentence through mask, so as to do machine translation

Do downstream tasks

Design the input and output related to our tasks according to the downstream tasks

  • The model doesn’t change much. Add an output layersoftmaxGet labellabel
  • The key question isHow to change the input into the desired sentenceTo enter?
    • We simply insert the specific inputs and outputs of the task into Bert and fine tune all parameters end-to-end
    • If there are two sentences, of course, sentences a and B, which can be sentence understanding, sentence pair, hypothesis premise or question and answer
    • If there is only one sentence, if you want to classify sentences, B is 886
      • For example, text classification task or sequence tagging
      • According to the requirements of downstream tasks, either output [CLS] representation is fed into an output layer for classification – get the output corresponding to the first etymology [CLS] for classification, such as entry or sentimental analysis
      • Or the token representations are fed into an output layer for token level tasks. Get the output corresponding to those etymologies and do the sequence tagging or question answering output we want. In fact, it’s almost the same
  • Remember to add at the endsoftmaxoh

By the way, fine tuning is cheaper than pre training. TPU 1 hour, GPU a few hours.

7 – Experiment

A very important thing has been writtenSpecifically, how to construct input and output for each downstream task


General language understanding evaluation

  • Multiple data sets
  • sentence-level tasks

Output via:

  • [CLS] Bert output representation vector — > learn an output layer W, and use softmax to classify the label
  • $log(softmax(CW^T)$

Table 1 shows the performance of Bert in classification tasks

2.SQuAD v1.1

Stanford question answering dataset

  • Give a passage and ask a question. The answer is in this passage
    • You need to judge the beginning and end of the answer
    • That is, for each etymological token, judge whether it is the beginning or end of the answer
  • Specifically, learn two vectors s and E, corresponding to the etymology respectively. Token is the probability of the beginning word of the answer and the probability of the ending word of the answer

The specific calculation of each token is the probability of the beginning of the answer, and the calculation of the ending word e is similar to it:

  • S and each etymological token of the second sentence are multiplied by softmax to obtain the normalized probability
    $$P_i =  \frac{e ^ { S * T_i }}{\Sigma_j e ^ { S * T_j }}$$

When this article is fine tuned, epochs = 3 (data is scanned three times), learning rate = 5e-5, batch_ size = 32[significance note]

  • Our experiment found that when using Bert for fine-tuning, the result is very unstable. With the same parameters and the same data set, the variance is particularly large after training for 10 times.
  • In fact, it’s very simple. Epochs are not enough. 3 is too small. It may be better to learn more times
  • In addition, optimizer is an incomplete version of Adam. There is no problem with Bert after long training. When the training time is not enough, you still need a full version of Adam

Omitted a 3 squad v2 0 also did well


Situations with advantageous generations judge the relationship between two sentences. Bert is not much different from the previous training, and the effect is good.

Conclusion: Bert is very convenient and effective on different data sets.

The input is expressed as “a pair of sentences”. Finally, get the output corresponding to Bert, and then add an output layer softmax. It’s done.

Bert has made a great contribution to the whole field of NLP. There are a large number of tasks. With a relatively simple one, only changing the data input form and finally adding an output layer, the effect is very good.

8-Ablation studies

Ablation study

Look at the contribution of each component of Bert.

  • No NSP
  • LTR from left to right (no MLM) & no NSP
  • LTR from left to right (no MLM) & no NSP + bilstm (idea from Elmo)

If any component is removed, the effect of Bert will be discounted, especially MRPC.

Effect of Model Size

  • BERT_ Base 110 m learnable parameters
  • BERT_ Large 340 m learnable parameters

Bert first proved that great efforts worked miracles and triggered the “big” war of the model

Now: gpt-3 100 billion learnable parameters

Feature-based Approach with BERT

For Bert without fine tuning, the effect of using the pre trained Bert feature as a static feature input is not as good as + fine tuning

Using Bert requires fine tuning


Teacher’s sharp comment: what you write is OK

  • First write the difference between Bert and Elmo (bidirectional + RNN) and GPT (undirectional + transformer)
  • Introduce the Bert model
  • Bert experimental setup and good effect
  • Conclusion highlight the contribution of ‘bidirectional’

Article 1 selling point, easy to remember.

But should Bert choose ‘bidirectional’?

You can write, but you should also write. What are the shortcomings of Bi directivity?

  • GPT uses decoder
  • Bert uses an encoder, which makes it difficult to do general tasks: machine translation and text summarization
  • However, classification problems are more common in NLP. NLP researchers like Bert, which can be easily applied to the problems they want to solve in NLP.

Look at Bert, there is a complete idea to solve the problem — everyone’s expectations for DL

  • Train a very deep and wide model and pre train well on a large data set; The trained model parameters can solve many small problems and improve the performance of small data sets through fine tuning
  • This model can be used on many small problems after it is taken out, and the performance of these small data can be comprehensively improved through fine-tuning. We have used this in computer vision for many years

Bert moved the routine of CV to NLP, a 300 million parameter model, and showed that the larger the model, the better the effect. Work hard to achieve miracles.

Why is Bert remembered?

Bert uses Elmo, GPT and larger training data set, and the effect is better; Bere is also surpassed by larger training data sets and larger models.

The citation rate of Bert is 10 times higher than that of GPT, which has great influence ✔