The more you learn, the more interesting it becomes: “take you to learn NLP” series project 07 – things about machine translation

Time:2021-7-30

Course introduction
“Take you to learn NLP by hand” is a series of practical projects based on paddle paddlenlp. This series is elaborately created by many senior engineers of Baidu. It provides the whole process explanation from word vector, pre training language model, information extraction, emotion analysis, text question answering, structured data question answering, text translation, machine simultaneous interpreting, dialogue system and other practical projects, aiming at helping developers to grasp Baidu PaddlePaddle framework’s usage in NLP field more comprehensively and clearly. And be able to draw inferences from one example and flexibly use the propeller frame and paddlenlp for NLP in-depth learning practice.

The more you learn, the more interesting it becomes:

In June, Baidu PaddlePaddle Natural Language Processing department jointly launched the 12 NLP video lesson, explaining the practice project in detail.

To watch the course playback, please stamp:https://aistudio.baidu.com/ai…

Welcome to the course QQ group (Group No.: 758287592) to communicate~~

Background introduction
Machine translation is the process of using computers to convert one natural language (source language) into another natural language (target language).

This project is the PaddlePaddle implementation of Machine Translation’s mainstream model Transformer. Let’s build our own translation model based on this project.

Transformer is a new network structure proposed in the paper “attention is all you need” to complete sequence to sequence (seq2seq) learning tasks such as machine translation. It completely uses the attention mechanism to realize sequence to sequence modeling.

The more you learn, the more interesting it becomes:

Figure 1: transformer network structure
Compared with the recurrent neural network (RNN) widely used in the previous seq2seq model, using self attention to transform input sequence to output sequence mainly has the following advantages:

Low computational complexity
For sequences with feature dimension D and length N, the computational complexity in RNN is O (n) d d) (n time steps, each time step calculates the matrix vector multiplication of D dimension), and the calculation complexity in transformer is O (n) n d) (calculate the vector dot product or other correlation function of D dimension in pairs in N time steps), n is usually less than D.
High computational parallelism
The calculation of the current time step in RNN depends on the calculation result of the previous time step; The calculation of each time step in self attention only depends on the input and does not depend on the output of the previous time step. Each time step can be completely parallel.
Easy to learn long-range dependencies
The association between two positions with a distance of N in RNN needs n steps to be established; Any two positions in self attention are directly connected; The shorter the path, the easier the signal propagation. Transformer structure has been widely used in semantic representation models such as Bert and achieved remarkable results.

Rapid practice
This example shows how the pre training model represented by transformer can complete the machine translation task with finetune.
The project is written based on the paddle paddlenlp, GitHub address:
https://github.com/PaddlePaddle/PaddleNLP
Paddlenlp official document:
https://paddlenlp.readthedocs.io
Full code stamp:
https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer
Deep learning task pipeline

The more you learn, the more interesting it becomes:

Figure 2: pipeline of deep learning task

2.1 data preprocessing
This tutorial uses the Chinese and English data in the CWMT dataset as the training corpus. The CWMT dataset contains 9 million + samples with high quality, which is very suitable for training the transformer machine translation model.
Jieba + BPE is required for Chinese and BPE is required for English.
BPE(Byte Pair Encoding)
BPE benefits:
Compressed thesaurus;
Alleviate the OOV (out of vocabulary) problem to a certain extent.

The more you learn, the more interesting it becomes:

Figure 3: Learn BPE

The more you learn, the more interesting it becomes:

Figure 4: apply BPE

The more you learn, the more interesting it becomes:

#Customize the method of reading local data
def read(src_path, tgt_path, is_predict=False):
   #Is it a test set? The test set TGT is empty
if is_predict:
    with open(src_path, 'r', encoding='utf8') as src_f:
        for src_line in src_f.readlines():
            src_line = src_line.strip()
            if not src_line:
                continue
            yield {'src':src_line, 'tgt':''}
else:
    with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f:
        for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()):
            src_line = src_line.strip()
            if not src_line:
                continue
            tgt_line = tgt_line.strip()
            if not tgt_line:
                continue
            yield {'src':src_line, 'tgt':tgt_line}

#Filtering length ≤ min_ Len or ≥ Max_ Len's data            
def min_max_filer(data, max_len, min_len=0):
    #Obtain the minimum length and maximum length of each SRC and TGT (+ 1 for or), and filter out samples that do not meet the length range
    data_min_len = min(len(data[0]), len(data[1])) + 1
    data_max_len = max(len(data[0]), len(data[1])) + 1
    return (data_min_len >= min_len) and (data_max_len <= max_len)

#Data preprocessing process, including Jieba word segmentation, BPE word segmentation and thesaurus.
!bash preprocess.sh

2.2 construct dataloader
We define create_ data_ The loader function is used to create the dataloader objects required by the training set and validation set.
The dataloader object is used to generate batch by batch data. The following is a brief description of the PaddleNLP built-in function invoked in the function:
paddlenlp.data.Vocab.load_ Vocabulary: vocab thesaurus class, which collects a series of methods for mapping between text token and IDS, and supports the construction of thesaurus from files, dictionaries, JSON and other methods
paddlenlp.datasets.load_ Dataset: when creating a dataset from a local file, it is recommended to give a read function according to the format of the local dataset and pass it in to load_ Create a dataset in dataset()
Paddlenlp.data.pad: padding operation, which is used to align the length of sentences in the same batch.

The more you learn, the more interesting it becomes:

Figure 6: process of constructing dataloader

The more you learn, the more interesting it becomes:

Figure 7: dataloader details

#Create the dataloader of training set and verification set. The test set is similar to the dataloader.
def create_data_loader(args):
#Via paddlenlp.datasets.load_ Dataset creates a dataset from a local file: the read function is given according to the format of the local dataset and passed in to load_ Create a dataset in dataset()
train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False)
dev_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False)
#Via paddlenlp.data.vocab.load_ Vocabulary creates a thesaurus locally
src_vocab = Vocab.load_vocabulary(
    args.src_vocab_fpath,
    bos_token=args.special_token[0],
    eos_token=args.special_token[1],
    unk_token=args.special_token[2])
trg_vocab = Vocab.load_vocabulary(
    args.trg_vocab_fpath,
    bos_token=args.special_token[0],
    eos_token=args.special_token[1],
    unk_token=args.special_token[2])
#Make up the size of the thesaurus as pad_ Multiple of factor for tranformer acceleration.
padding_vocab = (
    lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
)
args.src_vocab_size = padding_vocab(len(src_vocab))
args.trg_vocab_size = padding_vocab(len(trg_vocab))

def convert_samples(sample):
    source = sample['src'].split()
    target = sample['tgt'].split()
    #Convert tokens into IDS corresponding to Thesaurus
    source = src_vocab.to_indices(source)
    target = trg_vocab.to_indices(target)
    return source, target
#Training set dataloader and verification set dataloader
data_loaders = []
for i, dataset in enumerate([train_dataset, dev_dataset]):
#Convert the sample token to ID through the map method of dataset; Filter out unqualified samples through the filter method of dataset
    dataset = dataset.map(convert_samples, lazy=False).filter(
        partial(min_max_filer, max_len=args.max_length))
    #Batch samplerbatch
    batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, shuffle=True,drop_last=False)
  #The dataloader is constructed for subsequent iterations to obtain data for training / verification / testing
    data_loader = DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=partial(
            prepare_train_input,
            bos_idx=args.bos_idx,
            eos_idx=args.eos_idx,
            pad_idx=args.bos_idx),
            num_workers=0,
            return_list=True)
    data_loaders.append(data_loader)
return data_loaders

def prepare_train_input(insts, bos_idx, eos_idx, pad_idx):
   #Padding through paddlenlp.data.pad is used to align the length of samples in the same batch
word_pad = Pad(pad_idx)
src_word = word_pad([inst[0] + [eos_idx] for inst in insts])
trg_word = word_pad([[bos_idx] + inst[1] for inst in insts])
#The extended dimension is used for subsequent calculation of loss
lbl_word = np.expand_dims(
    word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2)

data_inputs = [src_word, trg_word, lbl_word]
return data_inputs

2.3 modeling
Paddlenlp provides transformer API for calling:
Paddlenlp. Transformers. Transformer model: implementation of transformer model
Paddlenlp. Transformers. Infertransformermodel: the transformer model is used to generate tasks
Paddlenlp. Transformers. Crossentropycriterion: calculate cross entropy loss
paddlenlp.transformers.position_ encoding_ Init: initialization of transformer location code

The more you learn, the more interesting it becomes:

Figure 8: Model Construction

The more you learn, the more interesting it becomes:

Figure 9: encoder decoder diagram

2.4 training model
Run do_ Train function, in do_ In the train function, configure the optimizer, loss function, and evaluation index perflexity;
Perplexity, or confusion, is often used to measure the pros and cons of language models, that is, the smoothness of sentences. It is generally used in the fields of machine translation and text generation. The smaller the perflexity, the smoother the sentence, and the better the language model.

The more you learn, the more interesting it becomes:

Figure 10: Training Model

def do_train(args):
random_seed = eval(str(args.random_seed))
if random_seed is not None:
    paddle.seed(random_seed)
#Get dataloader
(train_loader), (eval_loader) = create_data_loader(args)

#Declarative model
transformer = TransformerModel(
    src_vocab_size=args.src_vocab_size,
    trg_vocab_size=args.trg_vocab_size,
    max_length=args.max_length + 1,
    n_layer=args.n_layer,
    n_head=args.n_head,
    d_model=args.d_model,
    d_inner_hid=args.d_inner_hid,
    dropout=args.dropout,
    weight_sharing=args.weight_sharing,
    bos_id=args.bos_idx,
    eos_id=args.eos_idx)

#Define loss
criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
   #Define the decay strategy for the learning rate
scheduler = paddle.optimizer.lr.NoamDecay(
    args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)

#Define optimizer
optimizer = paddle.optimizer.Adam(
    learning_rate=scheduler,
    beta1=args.beta1,
    beta2=args.beta2,
    epsilon=float(args.eps),
    parameters=transformer.parameters())

step_idx = 0

#Iterative training according to epoch
for pass_id in range(args.epoch):
    batch_id = 0
    for input_data in train_loader:
         #Press batch to get data from the training set dataloader
        (src_word, trg_word, lbl_word) = input_data
        #Logits for obtaining model output 
        logits = transformer(src_word=src_word, trg_word=trg_word)
    #Calculate loss
        sum_cost, avg_cost, token_num = criterion(logits, lbl_word)

        #Calculated gradient
        avg_cost.backward() 
        #Update parameters
        optimizer.step() 
        #Gradient clearing
        optimizer.clear_grad() 

        batch_id += 1
        step_idx += 1
        scheduler.step()
        
        do_train(args)
        [2021-06-18 22:38:55,597] [    INFO] - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.513082,  ppl: 36793.687500 
        [2021-06-18 22:38:56,783] [    INFO] - step_idx: 9, epoch: 0, batch: 9, avg loss: 10.506249,  ppl: 36543.164062 
        [2021-06-18 22:38:58,032] [    INFO] - step_idx: 19, epoch: 0, batch: 19, avg loss: 10.464736,  ppl: 35057.187500 
        [2021-06-18 22:38:59,032] [    INFO] - validation, step_idx: 19, avg loss: 10.454649,  ppl: 34705.347656

2.5 prediction and evaluation
The final training effect of the model can generally be tested through the test set, and the Bleu value is generally calculated in the field of machine translation.
In the prediction result, the output of each row is the translation with the highest score corresponding to the input of the row. For the data using BPE, the predicted translation result will also be the data represented by BPE, which can be correctly evaluated only after being restored to the original data (here refers to the tokenized data).

The more you learn, the more interesting it becomes:

Figure 11: Forecast and evaluation

Try it yourself
Do you think it’s interesting. Xiaobian strongly recommends that beginners refer to the above code and knock it out by hand, because only in this way can you deepen your understanding of the code.
Code corresponding to this project:
https://aistudio.baidu.com/ai…
Let’s customize our own translation system.
After experiencing more information about nlpaddr, please visit
https://github.com/PaddlePadd…