N ways to access data sources in sagemaker r environment

Time:2022-1-4

N ways to access data sources in sagemaker r environment
Amazon SageMaker notebooksOut of the box support for R, and there is no need to manually install the R kernel on the instance. In addition, these notebooks are pre installedreticulateLibrary, which isAmazon SageMaker Python SDKProvide R interface and allow us to call Python module from R script. You can easily access multiple data sources and run machine learning (ML) models using the Amazon sagemaker r kernel. At present, all regions that provide Amazon sagemaker services have provided r kernel by default.

RIt is a programming language specially built for statistical analysis. At present, it is very popular in the field of data science. In this article, we’ll learn how to useJava database connection (JDBC)Access the following data sources from Amazon sagemaker r kernel:

For more details on using Amazon sagemaker functionality through R, seeR user guide for Amazon sagemaker

Solution overview

To build this solution, we first need to create a VPC containing public and private subnets to ensure that different resources and data sources in the isolated network can realize secure communication. Next, we use the necessary configuration to create data sources in custom VPC and notebook instances, and use r to access each data source.

In order to ensure that the data sources are not exposed to the public Internet, we need to ensure that each data source completely resides in the private subnet of VPC. Here we need to create the following resources:

useAmazon Systems Manager Session ManagerConnect the Amazon EMR cluster in the private subnet and create the hive table.

To run code using the R kernel in Amazon sagemaker, you also need to create an Amazon sagemaker notebook. Note that download the jdbc driver for the data source. For a notebook that contains the R package installation scriptConfiguration lifecycleAnd attach the lifecycle configuration to the notebook at creation and startup to ensure the smooth completion of the installation.

Finally, we can use the Amazon cloud technology management console to navigate to the notebook, run the code using the R kernel and access data from various sources. We can passGitHub repoGet this complete solution.

Solution architecture

The following architecture diagram shows how to establish connections with various data sources and run code through the R kernel using Amazon sagemaker. You can also use Amazon redshift query editor or Amazon Athena query editor to create data resources. We also need to useAmazon Systems ManagerThe session manager in connects to the Amazon EMR cluster through SSH to create hive resources.

N ways to access data sources in sagemaker r environment

Launch Amazon cloudformation template

To automatically create resources, we can run a set of Amazon cloudformation templates. This template will help us specify the Amazon EMR cluster, Amazon redshift cluster or MySQL compatible Amazon Aurora cluster that needs to be created automatically without having to perform each step manually. All resources can be created in a few minutes.

  • Select the link below to start the cloudformation stack. This stack will create all the necessary resources to implement this solution:
    N ways to access data sources in sagemaker r environment
  • stayCreate stackOn page, selectNext
    N ways to access data sources in sagemaker r environment
  • Enter the stack name.
  • We can adjust the default values of the following stack details:
Stack details Default value
Select class B network address as VPC IP address (10.xxx.0.0 / 16) s 0
Sagemaker Jupiter notebook instance type ml.t2.medium
Create EMR cluster automatically? “Yes”
Create redshift cluster automatically? “Yes”
Automatically create Aurora MySQL DB cluster? “Yes”
  • choiceNext
  • In configure stack optionsOn page, selectNext
  • choiceI acknowledge that Amazon CloudFormation might create IAM resources(I confirm that Amazon cloudformation can create Iam resources).
  • choiceCreate stack
    N ways to access data sources in sagemaker r environment

Now we can see the stack being created, as shown in the following screenshot.

N ways to access data sources in sagemaker r environment

After the stack is created, the status will be displayed as create_ COMPLETE。

N ways to access data sources in sagemaker r environment

  • stayOutputsTab, record each key and its corresponding value.
    N ways to access data sources in sagemaker r environment
    In this article, we will use the following keys:
  • AuroraClusterDBName– Aurora cluster database name
  • AuroraClusterEndpointWithPort– Aurora cluster endpoint address and port number
  • AuroraClusterSecret– Aurora cluster credential secret ARN
  • EMRClusterDNSAddress– EMR cluster DNS name
  • EMRMasterInstanceId– EMR cluster primary instance ID
  • PrivateSubnets– private subnet
  • PublicSubnets– public subnet
  • RedshiftClusterDBName– Amazon redshift cluster database name
  • RedshiftClusterEndpointWithPort– Amazon redshift cluster endpoint address and port number
  • RedshiftClusterSecret– Amazon redshift cluster credential secret ARN
  • SageMakerNotebookName– Amazon sagemaker notebook instance name
  • SageMakerRS3BucketName– Amazon sagemaker S3 data bucket
  • VPCandCIDR– VPC ID and CIDR address block

Use the necessary RSoftware package and jarCreating a notebook from a fileexample

JDBCIt is an application programming interface for Java programming language(API), which is responsible for defining the specific access method to the database. Rjdbc is a software package in R, which can help you access various data sources using JDBC interface. The notebook instance created by the cloudformation template will ensure that necessary jar files are provided for hive, presto, Amazon Athena, Amazon redshift and MYSQL to establish a JDBC connection.

  • Under the Amazon sagemaker consoleNotebookSection, selectNotebook instances
  • Search for notebooks that match the previously recorded sagemaker notebookname key.
    N ways to access data sources in sagemaker r environment
  • Select the notebook instance.
  • Click “open Jupiter” under “actions” and navigate to the “JDBC” directory.
    N ways to access data sources in sagemaker r environment

The cloudformation template will be downloaded from the “JDBC” directoryHivePrestoAthenaAmazon Redshiftas well asAmazon Aurora MySQLCompatible jar files.

N ways to access data sources in sagemaker r environment

Through lifecycle configuration, we can install software packages or sample notebooks on the notebook instance, configure network and security for it, or customize other configurations using shell scripts. Life cycle configuration is responsible for providing shell scripts that need to be run when we create a notebook instance or start this notebook.

N ways to access data sources in sagemaker r environment

  • stayLifecycle configurationSection, selectView scriptTo view the lifecycle configuration script responsible for setting up the R kernel in Amazon sagemaker to point the JDBC connection to the data source through R.

N ways to access data sources in sagemaker r environment

This lifecycle configuration will be in the Amazon sagemaker notebookInstalling rjdbc packages and dependencies in Anaconda environment

Access hiveWith Presto

Amazon EMR is an industry-leading cloud big data platform, which can use various open source tools (e.gApache Spark、Apache Hive、Apache HBaseApache FlinkApache HudiAnd PRESTO) to process large amounts of data.

You can use the session manager function in system manager to log in to the EMR master node from the Amazon cloud technology console, so as to create a test table in hive. Through systems manager, you can view and control the infrastructure on Amazon cloud technology. Systems manager also provides a unified user interface for us to view management data from multiple Amazon services and automatically perform management tasks across multiple resources. Session manager is a fully managed systems manager feature that helps us to manage the session through the browser based one click interactive shell or mayxun cloud technology command line interface (Amazon CLI)Amazon Elastic Compute Cloud(Amazon EC2) instances, local instances, and virtual machines.

We can use Amazon cloudformation in this stepOutputsThe following values are provided in the tabs:

  • EMRClusterDNSAddress– EMR cluster DNS name
  • EMRMasterInstanceId– EMR cluster primary instance ID
  • SageMakerNotebookName– Amazon sagemaker notebook instance name

Then do the following:

  • staySystems Manager ConsoleofInstances & Nodesunder, selectSession Manager
    N ways to access data sources in sagemaker r environment
  • choiceStart Session
    N ways to access data sources in sagemaker r environment
  • Use emrmasterinstanceidThe value of the key is used as the instance ID,SSHTo EMRMaster node.

N ways to access data sources in sagemaker r environment

This will launch the browser based shell.

  • Run the following SSH command:
# change user to hadoop
whoami
sudo su - hadoop
  • Log in to the EMR master node and create a test table in hive:
# Run on the EMR master node to create a table called students in Hive
hive -e "CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));"
# Run on the EMR master node to insert data to students created above
hive -e "INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);"
# Verify
hive -e "SELECT * from students;"
exit
exit

The following screenshot shows an example of a view in a browser based shell.

N ways to access data sources in sagemaker r environment

  • After exiting the shell, close the browser.

To use Amazon sagemaker r to check the data in Amazon EMR for query, open the notebook previously created by the cloudformation template.

  • stayAmazon sagemaker consoleofNotebookUnder, selectNotebook instances
  • Locate the notebook specified by the value of the sagemaker notebookname key.
  • choiceOpen Jupyter

N ways to access data sources in sagemaker r environment

  • To demonstrate connecting EMR in Amazon sagemaker r kernel, selectUploadAnd uploadipynb notebook。

N ways to access data sources in sagemaker r environment

Ø or inNewFrom the drop-down menu, selectRTo open a new notebook.
N ways to access data sources in sagemaker r environment
Ø enter the code in “hive_connect. Ipynb” to EMR_ Replace the DNS value with the value provided by the emrclusterdnsaddress key:

  • Run all units in the notebook and access hive on Amazon EMR using the Amazon sagemaker r console.

N ways to access data sources in sagemaker r environment

We can access Presto through similar steps:

Ø alternatively, you can select r in the new drop-down menu to open a new notebook.
Ø enter the code in “presto_connect. Ipynb” to EMR_ Replace the DNS value with the value provided by the emrclusterdnsaddress key:

  • Run all units in the notebook and access prestodb on Amazon EMR using the Amazon sagemaker r console.

N ways to access data sources in sagemaker r environment

Access Amazon Athena

Amazon Athena is an interactive query service that can be easily analyzed using standard SQLAmazon Simple Storage Service(Amazon S3). Amazon Athena also has the attribute of no server. You don’t need to manage any infrastructure. You just need to pay for the actual query. To access Amazon Athena from the Amazon sagemaker r kernel using rjdbc, we need to useAmazon Athena JDBC Driver。 This driver has been downloaded to the notebook instance through the lifecycle configuration script.

You also need to set the query result location in Amazon S3. For more details, seeHow to use query results, output files and query history

  • stayAmazon Athena consoleOn, selectGet Started
  • choiceSet up a query result location in Amazon S3(on Amazon S3)(set query result location in)
  • stayQuery result locationSection, enter the Amazon S3 location specified by the value of the sagemaker rs3bucketname key.
  • Or you can add a prefix directly, such as results.
  • choiceSave

N ways to access data sources in sagemaker r environment

Ø or you can open a new notebook and enter the code in “athena_connect. Ipynb” and S3_ Replace the value of bucket with the value of sagemaker rs3bucketname key:

  • Run all units in the notebook to access Amazon Athena from the Amazon sagemaker r console.

N ways to access data sources in sagemaker r environment

Access Amazon redshift

Amazon redshift is a fully managed cloud data warehouse with excellent speed. It enables simple and cost-effective data analysis with standard SQL and your existing business intelligence (BI) tools. Redshift can query large-scale structured data at TB or even Pb level, optimize complex queries, realize column storage on high-performance storage, and support large-scale concurrent query execution. To access Amazon redshift from Amazon sagemaker r kernel using rjdbc, we can useAmazon redshift JDBC Driver, the driver has been downloaded to the notebook instance through the lifecycle configuration script.

We need from Amazon cloudformationOutputsTab to get the following keys and their corresponding values:

  • RedshiftClusterDBName– Amazon redshift cluster database name
  • RedshiftClusterEndpointWithPort– Amazon redshift cluster endpoint address and port number
  • RedshiftClusterSecret– Amazon redshift cluster credential secret ARN

The cloudformation template will be displayed in theAmazon Secrets ManagerA secret is created for Amazon redshift cluster to protect the secret we use to access applications, services and various IT resources. Secrets manager also allows users to easily rotate, manage, and retrieve database credentials, API keys, and even other secrets throughout the life cycle.

N ways to access data sources in sagemaker r environment

  • In secret valuepart, selectRetrieve secret valueTo get the user name and password of Amazon redshift cluster.

N ways to access data sources in sagemaker r environment

N ways to access data sources in sagemaker r environment

  • Run the following SQL statement to create a table and insert several records:
CREATE TABLE public.students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO public.students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

Ø or open a new notebook and enter the code in “redshift_connect. Ipynb”. Pay attention to replace the values of redshiftclusterendpointwithport, redshiftclusterdbname and redshiftclustersecret:

  • Run all units in the notebook and access Amazon redshift from the Amazon sagemaker r console.

N ways to access data sources in sagemaker r environment

Access MySQLCompatible Amazon Aurora

Amazon aurora is a MySQL compatible relational database built specifically for the cloud environment, which can combine the performance and availability of traditional enterprise databases with the convenience and cost-effectiveness of open source databases. To access Amazon Aurora from Amazon sagemaker r kernel using rjdbc, we need to useMariaDB JDBC Driver, the driver has been downloaded to the notebook instance through the lifecycle configuration script.

You need to use Amazon cloudformationOutputsThe following keys and their corresponding values are provided in the tab:

  • AuroraClusterDBName– Aurora cluster database name
  • AuroraClusterEndpointWithPort– Aurora cluster endpoint address and its port number
  • AuroraClusterSecret– Aurora cluster credential secret ARN

The cloudformation template will create a secret for Aurora cluster in secrets manager.

N ways to access data sources in sagemaker r environment

  • In secret valuepart, selectRetrieve secret valueTo get the user name and password of Aurora cluster.

N ways to access data sources in sagemaker r environment

To access the cluster, follow the steps similar to other services.

Ø alternatively, you can open a new notebook and enter the code in “aurora_connect. Ipynb”. Please note to replace the values of Aurora clusterendpointwithport, Aurora clusterdbname and Aurora clustersecret:

  • Run all units in the notebook to access Amazon aurora on the Amazon sagemaker r console.

N ways to access data sources in sagemaker r environment

summary

In this article, we demonstrate how to access various data sources in the operating environment, including hive and prestodb on Amazon EMR, Amazon Athena, Amazon redshift, MySQL compatible Amazon Aurora cluster, etc., so as to analyze, analyze and run statistical calculations through Amazon sagemaker. You can also extend the same method to other data sources through JDBC.