Baidu NLP pre training model ernie2.0 the strongest practical courses come! [with tutorial]

Time:2019-11-8

In March 2019, baidu officially released the NLP model Ernie, which once attracted extensive attention and discussion in the industry when it comprehensively surpassed Bert in Chinese tasks. After just a few months, baidu Ernie has been upgraded to release Ernie 2.0, a semantic understanding framework for continuous learning, and Ernie 2.0 pre training model based on this framework. After 1.0, Ernie made a new breakthrough in English tasks, surpassing Bert and xlnet in a total of 16 Chinese and English tasks, and achieved SOTA effect.

The content of this article can be said to be the most powerful practical course in history. It can take you to try Ernie from the simple to the deep. You can go to AI studio fork code (https://aistudio.baidu.com/aistudio/projectdetail/117030) and get 12 hours of GPU computing power if you run it. Every day, you can have it~

I. Basic part

1.1 prepare code, data and model

Step1: Download Ernie code. Warm tip: if the download is slow, pause and try again

!git clone --depth 1 https://github.com/PaddlePaddle/ERNIE.git

Step 2: Download and extract finetune data

!wget --no-check-certificate https://ernie.bj.bcebos.com/task_data_zh.tgz 
!tar xf task_data_zh.tgz

Step 3: download the pre training model

!wget --no-check-certificate https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz 
!mkdir -p ERNIE1.0 
!tar zxf ERNIE_1.0_max-len-512.tar.gz -C ERNIE1.0

If the download is slow, we can use the pre downloaded code and data

%cd ~ 
!cp -r work/ERNIE1.0 ERNIE1.0 
!cp -r work/task_data task_data 
!cp -r work/lesson/ERNIE ERNIE

After finishing the preparation of the Ernie code part, let’s take a sequence annotation task as an example.

What are sequence annotation tasks?

The following figure can give you a general understanding of the sequence annotation task:

What can sequence annotation tasks do?

Yes: information extraction, data structure, help search engines search more accurately
Yes,…

Sequence annotation task: let’s see what the data of this task looks like?

The input data of the sequence annotation task consists of two parts:
1)Label mapping file: store label to ID mapping.
2)Training test data: 2 columns, text, label (use hidden characters between each word in the text \ 2 division, label is the same.)

#Label mapping file
!cat task_data/msra_ner/label_map.json
{ 
"B-PER": 0, 
"I-PER": 1, 
"B-ORG": 2, 
"I-ORG": 3, 
"B-LOC": 4, 
"I-LOC": 5, 
"O": 6 
}

 

#Test data
!head task_data/msra_ner/dev.tsv

B: Begin
I: Inside
O: Outside

Ernie applied to serialized annotations

 

1.2 making finetune with Ernie

Step1: setting environment variables

%cd ERNIE 
!ln -s ../task_data 
!ln -s ../ERNIE1.0
%env TASK_DATA_PATH=task_data
%env MODEL_PATH=ERNIE1.0 
!echo "task_data_path: ${TASK_DATA_PATH}"
!echo "model_path: ${MODEL_PATH}"

Step 2: run the finetune script

!sh script/zh_task/ernie_base/run_msra_ner.sh

1.3 printing finetune results

During the finetune process, the forecast results of the test set will be saved automatically. We can check whether the forecast results meet the expectations.

As it takes some time for finetune to finish, you can directly view the prediction results of the converged model and test set of finetune before

%cd ~
show_ner_prediction('work/lesson/test_result.5.final')

II. Advanced part

2.1 GPU memory is too small. How to use Ernie?

Script advanced: how to use only the first three parameters to enable finetune when the model is too large to be fully put into video memory?

If only a few layers of model can be loaded!

Method: you only need to modify a line of configuration file Ernie ﹣ config.json to automatically use the first three layer parameters to enable finetune.

Tip: Ernie ﹐ config.json is in the pre training model published by Ernie 1.0

Todo combined with “terminal” label, run it
Tip: you can use sed and PWD commands

Step1: setting environment variables

%cd ~%cd ERNIE
!ln -s ../task_data
!ln -s ../ERNIE1.0
%env TASK_DATA_PATH=task_data
%env MODEL_PATH=ERNIE1.0
!echo "task_data_path: ${TASK_DATA_PATH}"
!echo "model_path: ${MODEL_PATH}"
!pwd
!sh script/zh_task/ernie_base/run_msra_ner.sh

2.2 how to adapt Ernie to my business data?

Advanced data: how to modify the input format?

Suppose that the input data format of MsrA ner task has changed, and each sample is saved not in row, but in column. Column saving means that each sample is composed of multiple lines, each line contains a character and corresponding label, and different samples are divided by blank lines. The specific examples are as follows:

text_a label
Hai O
Fishing O
Than O
Match O
Ground O
Point O
At O
Ha B-LOC
Door I-LOC
And O
Kim B-LOC
Door I-LOC
O
Inter O
O
Hai O
Domain O
。 O

When the input data is columnar, how can we modify Ernie’s data processing code to adapt to the new data format.
First, let’s get a general idea of Ernie’s data processing flow:

  • Ernie’s data processing code for finetune tasks is in reader / task ﹣ reader.py, which has written a reader class suitable for a variety of different types of tasks in advance. Ernie reads and processes data through the reader for subsequent models to use.
  • The reader class abstracts the data processing flow as follows:

Step 1. Read the samples one by one from the file, read the files of different formats through the methods such as “read” TSV, and save each sample read into a list

Step 2. Convert the read samples into records one by one. Record contains all the features required by a sample model after data processing. The process of processing a record is generally divided into the following steps:

1. Tokenize the text and truncate it when it exceeds the maximum length;
2. After adding ‘[CLS]’, ‘[SEP]’ and other markers, the text will be ID;
3. Generate the corresponding position and token type information of each token.

Step 3. Make multiple records into a batch. When the length of features in the same batch is inconsistent, add to the maximum feature length in the batch.

After understanding Ernie’s data processing flow, we found that when the input data format changed, we only need to modify the code in step 1, keep other codes unchanged, and then we can adapt to the new data format. Specifically, you only need to add the following “read” TSV function (overriding the “read” TSV “of the base class basereader) to the sequencelabelreader class of reader / task \.

def _read_tsv(self, input_file, quotechar=None):
with open(input_file, 'r', encoding='utf8') as f:
reader = csv_reader(f)
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != 'label'
]
Example = namedtuple('Example', headers)

examples = []
buf_t, buf_l = [], []
for line in reader:
if len(line) != 2:
assert len(buf_t) == len(buf_l)
example = Example(u'^B'.join(buf_t), u'^B'.join(buf_l))
examples.append(example)
buf_t, buf_l = [], []
continue
if line[0].strip() == '':
continue
buf_t.append(line[0])
buf_l.append(line[1])
if len(buf_t) > 0:
assert len(buf_t) == len(buf_l)
example = Example(u'^B'.join(buf_t), u'^B'.join(buf_l))
examples.append(example)
buf_t, buf_l = [], []
return examples

We put the modified data and code in the work / lesson / 2 directory in advance, and can replace the corresponding files in the Ernie project, and then try to run it

%cd ~
!cp -r work/lesson/2/msra_ner_columnwise task_data/msra_ner_columnwise
!cp -r work/lesson/2/task_reader.py ERNIE/reader/task_reader.py
!cp -r work/lesson/2/run_msra_ner.sh ERNIE/script/zh_task/ernie_base/run_msra_ner_columnwise.sh
%cd ERNIE
!ln -s ../task_data
!ln -s ../ERNIE1.0
%env TASK_DATA_PATH=task_data
%env MODEL_PATH=ERNIE1.0 !sh script/zh_task/ernie_base/run_msra_ner_columnwise.sh

2.3 where to change the model structure?

Advanced model: how to replace the loss function of sequence annotation task with CRF?
At present, in the finetune code of sequence annotation task, softmax CE is used as the loss function, which is relatively simple and does not consider the relationship between words in the sequence. How to replace a better loss function?

We just need to modify the create ﹣ model function and replace the softmax CE loss function with linear ﹣ chain ﹣ CRF. The specific code is as follows:

def create_model(args, pyreader_name, ernie_config, is_prediction=False):
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1],
[-1, args.max_seq_len, 1], [-1, args.max_seq_len, 1], [-1, 1]],
dtypes=[
'int64', 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'
],
lod_levels=[0, 0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)

(src_ids, sent_ids, pos_ids, task_ids, input_mask, labels,
seq_lens) = fluid.layers.read_file(pyreader)

ernie = ErnieModel(
src_ids=src_ids,
position_ids=pos_ids,
sentence_ids=sent_ids,
task_ids=task_ids,
input_mask=input_mask,
config=ernie_config,
use_fp16=args.use_fp16)

enc_out = ernie.get_sequence_output()
enc_out = fluid.layers.dropout(
x=enc_out, dropout_prob=0.1, dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=enc_out,
size=args.num_labels,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name="cls_seq_label_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_seq_label_out_b",
initializer=fluid.initializer.Constant(0.)))
infers = fluid.layers.argmax(logits, axis=2)

ret_infers = fluid.layers.reshape(x=infers, shape=[-1, 1])
lod_labels = fluid.layers.sequence_unpad(labels, seq_lens)
lod_infers = fluid.layers.sequence_unpad(infers, seq_lens)
lod_logits = fluid.layers.sequence_unpad(logits, seq_lens)

(_, _, _, num_infer, num_label, num_correct) = fluid.layers.chunk_eval(
input=lod_infers,
label=lod_labels,
chunk_scheme=args.chunk_scheme,
num_chunk_types=((args.num_labels-1)//(len(args.chunk_scheme)-1)))

probs = fluid.layers.softmax(logits)
crf_loss = fluid.layers.linear_chain_crf(
input=lod_logits,
label=lod_labels,
param_attr=fluid.ParamAttr(
name='crf_w',
initializer=fluid.initializer.TruncatedNormal(scale=0.02)))
loss = fluid.layers.mean(x=crf_loss)

graph_vars = {
"inputs": src_ids,
"loss": loss,
"probs": probs,
"seqlen": seq_lens,
"num_infer": num_infer,
"num_label": num_label,
"num_correct": num_correct,
}

for k, v in graph_vars.items():
v.persistable = True

return pyreader, graph_vars

We put the modified data and code in the work / lesson / 3 directory in advance, and can replace the corresponding files in the Ernie project, and then try to run it

%cd ~
!cp -r work/lesson/3/sequence_label.py ERNIE/finetune/sequence_label.py
%cd ERNIE
!ln -s ../task_data
!ln -s ../ERNIE1.0
%env TASK_DATA_PATH=task_data
%env MODEL_PATH=ERNIE1.0
!sh script/zh_task/ernie_base/run_msra_ner_columnwise.sh

Rerun the finetune script after modification:

After waiting for running, the last evaluation result is taken. The comparison is as follows:

The above is the whole operation of the actual combat course. You can click the link below to directly fork:
https://aistudio.baidu.com/aistudio/projectdetail/117030

Focus!
To view the complete contents and tutorials of Ernie model, please click the link below. It is suggested that star collect to personal homepage for subsequent viewing.
GitHub:https://github.com/PaddlePaddle/ERNIE

Version iteration and latest progress will be released in the first time of GitHub. Please keep your attention!

We also invite you to join the official technical exchange of Ernie * * QQ group: 760439550 * *, where you can exchange technical questions, and there will be research and development students of Ernie to answer questions for you in time.