For data scientists, R language is no different from a sharp weapon, which can easily and intuitively solve a large number of challenges related to data analysis. How powerful can such an artifact be combined with sagemaker’s machine learning ability? At present, there are many Amazon cloud technologies (Amazon Web Services) customers began to fully introduce the popular open-source statistical computing and graphics software R into the field of big data analysis and data science.
Amazon sagemaker is a fully managed service that helps users quickly build, train and deploy machine learning (ML) models. Amazon sagemaker eliminates the heavy burden of each step in the machine learning process and significantly reduces the development threshold of high-quality models. In August 2019, Amazon sagemaker announced that R kernel will be pre installed in all regions. This function is out of the box, and the reteculate library is pre installed. It is responsible for providing r interface for Amazon sagemaker Python SDK to help users call Python modules directly from R scripts.
In this article, we will learn how to use r to realize the training, deployment and prediction result retrieval of machine learning model on Amazon sagemaker notebook instance. This model will predict the age of abalone according to the number of circles on abalone shells. You can use the reticulate toolkit to translate R and python objects, while Amazon sagemaker provides a serverless environment for large-scale training and deployment of machine learning models.
To complete this article, you need to have a basic understanding of R and be familiar with the followingtidyversePackages: dplyr, readr, stringr, and ggplot2.
Create a with RAmazon sagemaker notebook for kernelexample
To create an Amazon sagemaker notebook instance with R kernel pre installed, you need to complete the following steps:
We can create a notebook instance by selecting the instance type and storage size as needed. In addition, note the choice of identity and access management（Identity and Access Management, referred to as Iam) role, which ensures that Amazon sagemaker can run and has access to Amazon Simple Storage Service (Amazon S3) buckets used in the project. In addition, you can also choose VPC, subnet and git Library (if necessary). For more details, seeCreate Iam role。
- After confirming that the current status of the notebook is inservice, select Open Jupiter.
- In the jupyter environment, select r from the new drop-down menu.
The R kernel in Amazon sagemaker usesIRKernelSoftware package and contains more than 140 standard toolkits. For more information about creating a custom r environment for an Amazon sagemaker Jupiter notebook instance, seeCreate a persistent custom r environment for Amazon sagemaker。
When creating a new notebook, you should see the R logo in the upper right corner of the notebook environment and the R kernel below the logo. This indicates that Amazon sagemaker has successfully started the R kernel for the current notebook.
At Amazon sagemakerUpper pass RImplement end-to-end machine learning
The sample notebook in this article is available throughUsing R with Amazon SageMakerGet GitHub repo.
Next, load the retain library and import the sagemaker Python module. See the following code for details:
library(reticulate) sagemaker <- import('sagemaker')
After the module is loaded, use $in R instead of.
Create and access data stores
The session class is responsible for the following through Amazon sagemakerboto3Operate on resources:
In this use case, we create a default S3 bucket for Amazon sagemaker. Use default_ The bucket function creates a
sagemaker-_<aws-region-name>_-_<aws account number>_S3 storage bucket, see the following code for details:
session <- sagemaker$Session() bucket <- session$default_bucket()
Specify the ARN of the Iam role to allow Amazon sagemaker to access the S3 bucket. You can choose the same Iam role used when creating a notebook. See the following code for details:
role_arn <- sagemaker$get_execution_role()
Download and process datasets
Our model uses data from UCI machine learning libraryAbaloneData set. Download relevant data and start exploratory data analysis. Use the tidyverse software package to read the data, draw the data and convert it to ML format suitable for Amazon sagemaker. See the following code for details.
library(readr) data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data' abalone <- read_csv(file = data_file, col_names = FALSE) names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings') head(abalone)
The following table shows the output results.
Sex in the output result belongs to a factor data type, but it currently belongs to the character data type (F is female, M is male and I is larva). Convert the variable sex to a factor data type and use the following code to view the statistical summary of the dataset:
abalone$sex <- as.factor(abalone$sex) summary(abalone)
The following screenshot shows the output of the above code fragment, which provides a statistical summary of the abalone data frame.
The summary shows that the minimum value of height is 0. You can use the following code and library to draw the relationship between rings of different gender values (i.e. the number of shell rings mentioned at the beginning of the article) and height, so as to intuitively understand which abalone have a height value of 0.
library(ggplot2) options(repr.plot.width = 5, repr.plot.height = 4) ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()
The following figure is a plot of data.
The figure shows several outliers: two juvenile abalone with height of 0, and several female and male Abalone with height far above the average. To filter out the 2 juvenile abalone with height 0, use the following code:
library(dplyr) abalone <- abalone %>% filter(height != 0)
Prepare data sets for model training
This model needs three data sets: training data set, test data set and verification data set. You need to complete the following steps:
L convert the variable sex into a dummy variable（Dummy Variable））Then move the target rings to the first column:
abalone <- abalone %>% mutate(female = as.integer(ifelse(sex == 'F', 1, 0)), male = as.integer(ifelse(sex == 'M', 1, 0)), infant = as.integer(ifelse(sex == 'I', 1, 0))) %>% select(-sex) abalone <- abalone %>% select(rings:infant, length:shell_weight) head(abalone)
The Amazon sagemaker algorithm requires the target to be in the first column of the dataset. The following table shows the output results.
L 70% of all data samples are used for ML algorithm training, and the remaining 30% are further divided into two parts, one for testing and the other for verification:
abalone_train <- abalone %>% sample_frac(size = 0.7) abalone <- anti_join(abalone, abalone_train) abalone_test <- abalone %>% sample_frac(size = 0.5) abalone_valid <- anti_join(abalone, abalone_test)
Now, we can upload the training and verification data to Amazon S3 for model training. Note that for CSV file training, the xgboost algorithm assumes that the target variable is in the first column and the CSV has no column name. For CSV file reasoning, the algorithm assumes that the column name does not exist in the CSV input. The following code does not save the column names in the CSV file.
- Write the training and verification data set to the local file system in. CSV format:
write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE) write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)
- Upload 2 sets of data sets to the data “directory” in S3 bucket:
s3_train <- session$upload_data(path = 'abalone_train.csv', bucket = bucket, key_prefix = 'data') s3_valid <- session$upload_data(path = 'abalone_valid.csv', bucket = bucket, key_prefix = 'data')
- Define the Amazon S3 input type for the Amazon sagemaker algorithm:
s3_train_input <- sagemaker$s3_input(s3_data = s3_train, content_type = 'csv') s3_valid_input <- sagemaker$s3_input(s3_data = s3_valid, content_type = 'csv')
Amazon sagemaker uses containersDockerTo train. To train the xgboost model, complete the following steps:
- In the area usedAmazon Elastic Container Registry(Amazon ECR) specifies the training container. See the following code for details:
registry <- sagemaker$amazon$amazon_estimator$registry(session$boto_region_name, algorithm='xgboost') container <- paste(registry, '/xgboost:latest', sep='') container '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest'
- Define an Amazon sagemakerEstimator, it can train any containerized algorithm. When creating an estimator, use the following parameters:
- image_ Name – container image used in training
- Role – Amazon sagemaker service role
- train_ instance_ Count – number of EC2 instances used in training
- train_ instance_ Type – the type of EC2 instance used in the training
- train_ volume_ Size – used to store input data during trainingAmazon Elastic Block Store(Amazon EBS) the capacity size of the storage volume in GB
- train_ max_ Run – training timeout (in seconds)
- input_ Mode – the input mode supported by the algorithm
- output_ Path – Amazon S3 location where the training results (including model artifacts and output files) are saved
- output_ kms_ Key – the key used to encrypt the training outputAmazon Key Management Service(Amazon KMS) key
- base_ job_ Name – the name prefix of the training assignment
- sagemaker_ Session – the session object responsible for managing interactions with the Amazon sagemaker API
s3_output <- paste0('s3://', bucket, '/output') estimator <- sagemaker$estimator$Estimator(image_name = container, role = role_arn, train_instance_count = 1L, train_instance_type = 'ml.m5.large', train_volume_size = 30L, train_max_run = 3600L, input_mode = 'File', output_path = s3_output, output_kms_key = NULL, base_job_name = NULL, sagemaker_session = NULL)
None in Python is equivalent to null in R.
- appointXgboost super parameterAnd model fitting
- Set the number of training rounds to 100, which is also the default value when using the xgboost library outside Amazon sagemaker.
- Specify the input data and job name based on the current timestamp.
estimator$set_hyperparameters(num_round = 100L) job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-') input_data <- list('train' = s3_train_input, 'validation' = s3_valid_input) estimator$fit(inputs = input_data, job_name = job_name)
After the training, Amazon sagemaker will copy the binary file (gzip compressed file) of the model to the specified Amazon S3 output location. Use the following code to get the full Amazon S3 path:
Amazon sagemaker provides users with an endpoint. We can use HTTPS requests to complete model deployment through secure and simple API calls. To deploy the trained model on the ml.t2.medium instance, enter the following code:
model_endpoint <- estimator$deploy(initial_instance_count = 1L, instance_type = 'ml.t2.medium')
Prediction using models
Now, you can use the test data for prediction. Please complete the following steps:
- By specifying text / CSV and CSV for the prediction entry (endpoint)_ Serializer to transfer comma separated text into JSON format. See the following code for details:
model_endpoint$content_type <- 'text/csv' model_endpoint$serializer <- sagemaker$predictor$csv_serializer
- Delete the target column and convert the first 500 rows of data to a matrix without column names:
abalone_test <- abalone_test[-1] num_predict_rows <- 500 test_sample <- as.matrix(abalone_test[1:num_predict_rows, ]) dimnames(test_sample)[] <- NULL
This article only uses 500 rows of data to avoid exceeding the upper limit of the prediction endpoint.
- Predict and convert to comma separated strings:
library(stringr) predictions <- model_endpoint$predict(test_sample) predictions <- str_split(predictions, pattern = ',', simplify = TRUE) predictions <- as.numeric(predictions)
- Bind the predicted number of cycles to the columns of the test dataset:
Convert the predicted value to an integer
- abalone_test <- cbind(predicted_rings = as.integer(predictions),
- abalone_test[1:num_predict_rows, ])
The following table shows the output of the code, which will be predicted_ Add rings to abalone_ Test table. Please note that the actual code output may be different from this, because the data set training / verification / test split in the “prepare data set for model training” phase in step 2 is a random split, so the split result is likely to be different from this example.
Delete forecast entry (endpoint)）
After using the model, delete the prediction endpoint to avoid unnecessary deployment costs. See the following code for details:
This paper guides you to complete an end-to-end machine learning project, which comprehensively covers the steps of data collection, data processing, model training, deploying the model as an endpoint, reasoning using the deployed model and so on. For more information about creating a custom r environment for an Amazon sagemaker Jupiter notebook instance, seeCreate a persistent custom r environment for Amazon sagemaker。 For an example of R notebook on Amazon sagemaker, seeAmazon sagemaker example GitHub repo。
You can also refer to the developer’s Guide《Amazon sagemaker r User Guide》To learn more about using the features of Amazon sagemaker through R.