Ernie | the best semantic understanding framework, naturally supported by AWS

Time:2021-2-26

Ernie | the best semantic understanding framework, naturally supported by AWS

Children’s shoes that focus on machine learning technology may have heard that in semeval 2020, the world’s largest semantic evaluation competition recently held, Ernie, a semantic understanding framework, has won five world champions, including key text segment mining of visual media, multilingual aggressive language detection and sentiment analysis of mixed languages!

Ernie (enhanced representation through knowledge integration) aims to capture semantic patterns by using plain text, and further provide richer structured representation by means of knowledge enhancement combined with knowledge relations in semantics. The proposed semantic representation model of knowledge enhancement and the semantic understanding framework of continuous learning built in version 2.0 surpass the best model in the industry in many tasks such as Chinese and English. Especially in a number of Chinese NLP tasks, the results of Ernie were equal or improved to that of Bert.

Some children’s shoes may not know that Ernie 1.0 adopted the word based modeling method for Bert (pre training of deep bidirectional transformers for language understanding) at the beginning of its release, but this method can not learn the complete semantic representation of knowledge units. Later, it was proposed that knowledge entities should be used to learn the semantic representation of complete concepts, and multi-source data should be introduced for training. Therefore, starting from version 2.0, Ernie proposed a pre training framework, which uses multiple different task sequences to train a model through continuous learning. In this way, the next task will make use of the learning results of the previous task and continuously accumulate new knowledge. In the face of new tasks, the parameters learned from historical tasks can be used to initialize the model directly to obtain better training effect.

And then there’s even better news: AWS, especiallyAmazon sagemaker service has provided full support for enrie 2.0!

Amazon sagemaker is a fully hosted machine learning platform service of Amazon Web service. Algorithm Engineers and data scientists can quickly build, train and deploy machine learning (ML) models based on this platform without paying attention to the management and operation of underlying resources. As a tool set, it provides all the end-to-end components for machine learning, including data marking, data processing, algorithm design, model training, training debugging, hyper parameter tuning, model deployment, model monitoring, etc., which makes machine learning easier and easier. At the same time, it provides high performance relying on the powerful underlying resources of AWS Abundant computing resources and sufficient computing power, such as CPU, GPU and elastic reasoning accelerator, make the model development and deployment easier and more efficient.

Ernie is designed and implemented based on the open source deep learning platform — paddlepaddle. This paper will focus on how to use Amazon sagemaker To achieve the machine learning tasks such as model pre training, incremental training and reasoning deployment based on this kind of third-party framework and user-defined algorithm, the technical details and operation methods are also suitable for other similar scenarios.

This paper will first focus on the method of using Ernie in Amazon sagemaker for model pre training task.

Introduction to Amazon sagemaker model training

Amazon sagemaker provides a basic computing power and operation platform for machine learning tasks. Starting the model training task on this platform includes the following process: enabling the machine instance with specified computing power from Amazon sagemaker’s computing cluster, loading the container image including algorithm and framework, and storing it in external storage (e.g S3), start the training script, complete the model iteration and save the model. Users only need to do the corresponding parameter configuration, which can be completed through console click or API call.

The core of training task is to select the container image including algorithm and framework. As for the source of training image, we can choose either a variety of built-in algorithm images provided by Amazon sagemaker or the built-in framework based on Amazon sagemaker (tensorflow, APACHE) Mxnet, pytorch, scikit learn, xgboost, chainer) image, combined with their own code to complete the training. If you need to use your own training code, and complete the model training based on other third-party frameworks or built-in frameworks, you can also use the built-in container method based on Amazon sagemaker.

For the last case, this paper focuses on how to complete the pre training task of Ernie model based on paddlepaddle framework and user-defined algorithm on Amazon sagemaker. This method is also suitable for the implementation of similar tasks of other custom frameworks and algorithms.

The model pre training of user-defined algorithm is realized by the method of self-contained container

For the pre training tasks of Ernie model, please refer to the chapter “pre training (Ernie 1.0)” in this link.

If the pre training task is run on the local machine, the following two tasks need to be completed:

  • Install paddlepaddle framework locally
  • Local execution of training scripts

In order to migrate the task from the local machine to Amazon sagemaker, we need to make the following adjustments for these two tasks:
1. Install paddlepaddle framework locally > build a paddlepaddle container suitable for sagemaker container
1) Amazon sagemaker implements machine learning tasks based on the container mechanism. For scenarios with its own framework and algorithm, it needs to build its own container including the target framework and related dependencies, and push it to the Amazon elastic container Registry (ECR) image registry for sagemaker to pull and call.
2) In this example, the project directory of this part is as follows, which can be deployed on sagemaker Jupiter notebook (recommended) or other computing resources.

/<for-docker-directory>/
├── ernie
│   ├── …
├── Dockerfile
├── requirements.txt
└── docker-actions.ipynb

explain:

  • Ernie: implement the code for the Ernie framework in the source code, and adjust the contents in the file directory, mainly including:

    ~Copy the config folder in the root directory to this directory;
    ~For the call of adapter interface, modify the precondition_ args.py 、 train.py And pretraining.py Some of the codes in the three files are described below for details.

  • Dockerfile: docker description file
  • requirements.txt : for the source project root directory dependency description file, here requirements.txt Remove paddlepaddle-gpu==1.6.3.post107;
  • docker- actions.ipynb : container package and upload code, call in notebook mode.

3) Write dockerfile. The dockerfile of this example and its description are as follows:

#Pull the image of the pre installed paddlepaddle. Please refer to Https://www.paddlepaddle.org.cn/documentation/docs/zh/1.7/install/install_ Docker.html#docker
 
FROM paddlepaddle/paddle:1.6.3-gpu-cuda10.0-cudnn7
 
#Mark maintainer information
 
MAINTAINER Amazon AI <[email protected]>
 
#Copy requirements.txt To container code directory
 
COPY requirements.txt /opt/ml/code/requirements.txt
 
#Install dependencies
 
RUN pip install -r /opt/ml/code/requirements.txt
 
#Install sagemaker container
 
RUN pip install sagemaker-containers==2.6.1
 
#Copy the Ernie folder in Git project to the / opt / ml / code directory of the container
 
COPY ernie /opt/ml/code
 
#Definition train.py Script for training tasks
 
ENV SAGEMAKER_PROGRAM train.py

explain:

  • As a library, sagemaker container can run scripts, train algorithms or deploy models compatible with Amazon sagemaker. It defines the storage location of our installation code, data and other resources in the container. We need to save the running code in sagemaker through dockerfile The expected location of the container (/ opt / ml / code). Please refer to here and here for details.
  • The Ernie directory is also copied to / opt / ml / code according to the requirements of sagemaker container directory;
  • Setting the environment variable sagemaker_ Program, definition train.py Script for training tasks.

4) Execute docker- actions.ipynb Container packaging and upload code in

#Set permissions and build containers
 
!chmod +x train.py
 
!docker build -t pd-ernie-pretrain:v1 .
 
#Get ECR login
 
!$(aws ecr get-login --no-include-email --region us-west-2)
 
#Labeling of containers
 
!docker tag pd-ernie-pretrain:v1 <your-aws-account>.dkr.ecr.<region>.amazonaws.com/<    
 
repository>:pd-ernie-pretrain
 
#Push container to ECR
 
!docker push "<your-aws-account>.dkr.ecr.<region>.amazonaws.com/<         
 
repository>:pd-ernie-pretrain"

In this way, a model training container integrating paddlepaddle framework and Ernie project is included in the specified repository of our AWS account ECR service. Remember the name of the container in the repository “< your AWS account > dkr.ecr .<region>. amazonaws.com/ < repository >: PD Ernie retrain “for subsequent call by Amazon sagemaker.

2. Execute training script locally → start training task on Amazon sagemaker through API
The script for local execution of training Ernie pre training is script / zh_ task/ pretrain.sh The core training orders are as follows:

python ./ernie/train.py --use_cuda True \
 
                --is_distributed False\
 
                --use_fast_executor True \
 
                --weight_sharing True \
 
                --in_tokens true \
 
                --batch_size 8192 \
 
                --vocab_path ./config/vocab.txt \
 
                --train_filelist ./data/train_filelist \
 
                --valid_filelist ./data/valid_filelist \
 
                --validation_steps 100 \
 
                --num_train_steps 1000000 \
 
                --checkpoints ./checkpoints \
 
                --save_steps 10000 \
 
                --ernie_config_path ./config/ernie_config.json \
 
                --learning_rate 1e-4 \
 
                --use_fp16 false \
 
                --weight_decay 0.01 \
 
                --max_seq_len 512 \
 
                --skip_steps 10

This command form is also the normal way for us to perform training tasks on the local machine. Now let’s take a look at how to convert this command into API parameters to start training tasks on Amazon sagemaker.

The parameters of the command mainly include three parts: basic configuration items (such as use)_ CUDA), input data path (such as vocab)_ Path) and algorithm super parameters (such as learning)_ rate)。 The input data path is divided into basic configuration data path (vocab)_ path、ernie_ config_ Path) and data set path (train)_ filelist、valid_ filelist)。 In the process of using Amazon sagemaker for model training, the best practice is to store data sets in external storage services, such as Amazon Simple Storage Service (S3), Amazon elastic file system (EFS) and Amazon FSX, and inform sagemaker to pull corresponding data sets for calculation. So, here’s the train_ Filelist and valid_ Filelist does not input a local path to the training script through this parameter passing mode, but specifies the location of the data set in S3 through the data channel mode of sagemaker. Remove the train_ Filelist and valid_ After two parameters of filelist, construct the remaining parameters as follows_ Dictionary object of hyperparameters:

_hyperparameters = {
                            "use_cuda": True,
                            "is_distributed":False,
                            "use_fast_executor":True,
                            "weight_sharing":True,
                            "in_tokens":True,
                            "batch_size":8192,
                            "vocab_path":"./config/vocab.txt",
                            "num_train_steps":10000,
                            "checkpoints":"./checkpoints",
                            "save_steps":1000,
                            "ernie_config_path":"./config/ernie_config.json",
                            "learning_rate":"0.0001",
                            "use_fp16":False,
                            "weight_decay":0.01,
                            "max_seq_len":512,
                            "skip_steps":10,
                            "validation_steps": 100
              }

For train_ Filelist and valid_ Filelist two file list files. First, let’s look at the data composition under the data path

ERNIE/data/
├── demo_train_set.gz
├── demo_valid_set.gz
├── train_filelist
└── valid_filelist

Where demo_ train_ set.gz And demo_ valid_ set.gz For the encoded data file, train_ Filelist is the list description of data files

./data/demo_train_set.gz        1.1

In this example, we expand the data set and modify the file list as follows:

ERNIE/data/
├── demo_train_set.gz
├── demo_train_set2.gz
├── demo_train_set3.gz
├── demo_train_set4.gz
├── demo_train_set5.gz
├── demo_valid_set.gz
├── demo_valid_set2.gz
├── train_filelist
└── valid_filelist

train_filelist:
 
demo_train_set.gz      1.1
 
demo_train_set2.gz     1.1
 
demo_train_set3.gz     1.1
demo_train_set4.gz     1.1
 
demo_train_set5.gz     1.1
valid_filelist:
 
demo_valid_set.gz      1.1
 
demo_valid_set2.gz     1.1

After correction, upload the dataset file and file list to S3:


<your-S3-bucket>/ernie/
├── train
│   ├── demo_train_set.gz
│   ├── demo_train_set2.gz
│   ├── demo_train_set3.gz
│   ├── demo_train_set4.gz
│   ├── demo_train_set5.gz
│   ├── train_filelist
├── valid
│   ├── demo_valid_set.gz
│   ├── demo_valid_set2.gz
└─ ├── valid_filelist

After uploading, we build a data channel dictionary to describe the storage location of the corresponding data set in S3

_train_data = 'S3://{}/{}/{}'.format(bucket, folder, 'train')
_valid_data = 'S3://{}/{}/{}'.format(bucket,  folder, 'valid')
_data_channels = {'train': sagemaker.session.S3_input(_train_data),
               'valid': sagemaker.session.S3_input(_valid_data)}

Sagemaker received this data_ After the channel, the data will be automatically pulled from the corresponding position of S3 and downloaded to the container / opt / ml / data / < channel_ Name > / path, here’s < channel_ Name > corresponds to two key: train and valid.

Therefore, to guide the training script from / opt / ml / data / < channel_ To read training data under the path name > / you need to_ args.py 、 train.py and pretraining.py Make some corrections (the source project is based on the – train in the parameter)_ Filelist and – valid_ Filelist path for data reading), the correction method is as follows:

pretrain_args.py
 
data_g.add_arg("train_filelist",           str,  "",  "Path to training filelist.")
data_g.add_arg("valid_filelist",           str,  "",  "Path to valid filelist.")

Amend to read

data_g.add_arg("train_filelist",           str,  os.environ['SM_CHANNEL_TRAIN'] + "/train_filelist",  "Path to training filelist.")
data_g.add_arg("valid_filelist",           str,  os.environ['SM_CHANNEL_VALID'] + "/valid_filelist",  "Path to valid filelist.")

Where is the environment variable os.environ [‘SM_ CHANNEL_ Train ‘] is / opt / ml / data / train, os.environ [‘SM_ CHANNEL_ Valid ‘] is / opt / ml / data / valid, which corresponds to the path of the data set stored in the container.

Pretraining.py
 
 
class ErnieDataReader(object):
 
    def __init__(self,
 
                 filelist,
 
                 vocab_path,
 
                 batch_size=4096,
 
                 in_tokens=True,
 
                 max_seq_len=512,
 
                 shuffle_files=True,
 
                 random_seed=1,
 
                 epoch=100,
 
                 voc_size=0,
 
                 is_test=False,
 
                 generate_neg_sample=False):

Add a parameter: data_ Tag, amend to

class ErnieDataReader(object):
 
    def __init__(self,
 
                 data_tag,
 
                 filelist,
 
                 vocab_path,
 
                 batch_size=4096,
 
                 in_tokens=True,
 
                 max_seq_len=512,
 
                 shuffle_files=True,
 
                 random_seed=1,
 
                 epoch=100,
 
                 voc_size=0,
 
                 is_test=False,
 
                 generate_neg_sample=False):

Add variables in the class:

self.data_tag = data_tag

Method data_ Generator added:

def data_generator(self):
 
        """
 
        data_generator
 
        """
 
        def wrapper():
 
            def reader():
 
                for epoch in range(self.epoch):
 
                    self.current_epoch = epoch + 1
 
                    files = self.files
 
                    #during training, data are sliced by trainers
 
                    if self.shuffle_files:
 
                        start = epoch * self.total_file
 
                        end = start + self.total_file
 
                        files = [file_ for index, file_ in enumerate(self.files[start:end]) \
 
                            if index % self.trainer_nums == self.trainer_id]
 
                   
 
                    for index, file_ in enumerate(files):
 
                        file_, mask_word_prob = file_.strip().split("\t")
 
                        mask_word = (np.random.random() < float(mask_word_prob))
 
                        self.current_file_index = (index + 1) * self.trainer_nums
 
                        self.current_file = file_
 
                        ############ Modify - Start ############
 
                        env_str = 'SM_CHANNEL_' + self.data_tag.upper()
 
file_ = os.environ[env_str] + '/' + file_
 
                        ############ Modify – End ############
 
                        if mask_word:
 
                            self.mask_type = "mask_word"
 
                        else:
 
                            self.mask_type = "mask_char"

train.py correct
Method predict_ wrapper:

filelist = args.test_filelist if args.do_test else args.valid_filelist
############ Modify - Start ############
tag = 'test' if args.do_test else 'valid
data_reader = ErnieDataReader(
        tag,
        filelist,
        vocab_path=args.vocab_path,
        batch_size=args.batch_size,
        voc_size=ernie_config['vocab_size'],
        shuffle_files=False,
        epoch=1,
        max_seq_len=args.max_seq_len,
        is_test=True)
############ Modify – End ############

Method train

############ Modify - Start ############
data_reader = ErnieDataReader(
        data_tag = 'train',
        filelist=args.train_filelist,
        batch_size=args.batch_size,
        vocab_path=args.vocab_path,
        voc_size=ernie_config['vocab_size'],
        epoch=args.epoch,
        max_seq_len=args.max_seq_len,
        generate_neg_sample=args.generate_neg_sample)
############ Modify – End ############

After the training, Amazon sagemaker will automatically recycle the started computing resources, so it is necessary to save the model to the environment variable “SM”_ MODEL_ In the directory (/ opt / ml / model /) corresponding to “dir”, sagemaker will automatically upload the model files in this directory to the specified directory of S3 before recycling resources.

The code is modified as follows:
train.py In the method train, add:

def train(args):
    …
    fluid.io.save_inference_model(dirname=os.environ['SM_MODEL_DIR'],
                                  feeded_var_names=['1','2','3','4'],
                                  target_vars=[next_sent_acc],
                                  executor=exe,
                                  main_program=train_program)

In the process of model training, in order to observe the changes of training indicators, the index extraction method can be set through API parameters. Sagemaker will automatically filter the index values and write them into Amazon cloudwatch monitoring platform. We can observe the changes of indicators during and after the training.

An example of index dictionary list configuration is as follows:

_metric_definitions = [{'Name': 'Training-loss' , 'Regex': 'loss: ([0-9\.]+)'}]

After preparing the above parameters, you can start the sagemaker training task through API. It is recommended to complete this part of the work in sagemaker Jupiter notebook

#Import sagemaker library and python SDK
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
 
#Get session and related information
sess = sage.Session()
client = boto3.client('sts')
account = client.get_caller_identity()['Account']
role = get_execution_role()
 
#Get S3 bucket and path
bucket = sess.default_bucket()
folder = 'ernie'
 
#Get area code
my_session = boto3.session.Session()
region = my_session.region_name
 
#Gets the name of the container image stored in ECR
ecr_image = account + ".dkr.ecr." + region + ".amazonaws.com/sm-byo:pd-ernie-pretrain"
 
from sagemaker.estimator import Estimator
#Input parameter dictionary
_hyperparameters = {"use_cuda": True,
                    "is_distributed":False,
                    "use_fast_executor":True,
                    "weight_sharing":True,
                    "in_tokens":True,
                    "batch_size":8192,
                    "vocab_path":"./config/vocab.txt",
                    "num_train_steps":30,
                    "checkpoints":"./checkpoints",
                    "save_steps":10,
                    "ernie_config_path":"./config/ernie_config.json",
                    "learning_rate":"0.0001",
                    "use_fp16":False,
                    "weight_decay":0.01,
                    "max_seq_len":512,
                    "skip_steps":5,
                    "validation_steps": 20}
#Index extraction list
_metric_definitions = [{'Name': 'Training-loss','Regex': 'loss: ([0-9\.]+)'}]
#Building sagemaker estimator
estimator = Estimator(image_ name=ecr_ Image, # container image
                      Role = role, # role
                      train_ instance_ Type ='ml.p3.2xlarge ', # current training model
                      train_ instance_ Count = 1, # number of training machines
                      hyperparameters = _ Hyperparameters, # parameter passing
                      metric_ definitions = _ metric_ Definitions, index extraction
                      output_ Path ='s3: // {} / {} / '. Format (bucket, folder)) # the path where the output model is stored in S3
 
#Input data in S3 path, build data channel dictionary
train_data = 'S3://{}/{}/{}'.format(bucket, folder, 'train')
valid_data = 'S3://{}/{}/{}'.format(bucket,  folder, 'valid')
data_channels = {'train': sage.session.S3_input(train_data),
                'valid': sage.session.S3_input(valid_data)}
 
#Start training
estimator.fit(inputs=data_channels,  logs=True)

explain

In this example, a single ml.p3.2xlarge model is selected to carry the training task, which includes a single NVIDIA V100 graphics card, 8-core CPU and 61gb memory. See here for more calculation instance types.

During the training process, you can use Amazon cloudwatch to visualize the changes of indicators in real time, and click the training task corresponding to the sagemaker console

Ernie | the best semantic understanding framework, naturally supported by AWS

Drop down to the monitoring interface and click to view algorithm indicators, output logs and instance indicators. The training loss selected in the figure is the indicator object manually output to Amazon cloudwatch.

Ernie | the best semantic understanding framework, naturally supported by AWS

After the training task, sagemaker automatically saves the model file to the specified path of S3 for subsequent deployment or iteration.

Ernie | the best semantic understanding framework, naturally supported by AWS

The above is the basic introduction of the model pre training task based on Amazon sagemaker to realize the custom algorithm through the built-in container method. This method is also applicable to the realization of the custom task based on other third parties or built-in frameworks. Please pay attention to the follow-up content of this series of articles about the implementation of model optimization and deployment through the built-in container method.

Ernie | the best semantic understanding framework, naturally supported by AWS