Amazon SageMaker notebooksOut of the box support for R, and there is no need to manually install the R kernel on the instance. In addition, these notebooks are pre installedreticulateLibrary, which isAmazon SageMaker Python SDKProvide R interface and allow us to call Python module from R script. You can easily access multiple data sources and run machine learning (ML) models using the Amazon sagemaker r kernel. At present, all regions that provide Amazon sagemaker services have provided r kernel by default.
RIt is a programming language specially built for statistical analysis. At present, it is very popular in the field of data science. In this article, we’ll learn how to useJava database connection (JDBC)Access the following data sources from Amazon sagemaker r kernel:
For more details on using Amazon sagemaker functionality through R, seeR user guide for Amazon sagemaker。
Solution overview
To build this solution, we first need to create a VPC containing public and private subnets to ensure that different resources and data sources in the isolated network can realize secure communication. Next, we use the necessary configuration to create data sources in custom VPC and notebook instances, and use r to access each data source.
In order to ensure that the data sources are not exposed to the public Internet, we need to ensure that each data source completely resides in the private subnet of VPC. Here we need to create the following resources:
- Create a set of Amazon EMR clusters in the private subnet and install hive and presto at the same time. For specific operation instructions, refer toStart now: analyze big data with Amazon EMR
- Athena resources. For specific operation instructions, refer toGetting Started Guide。
- Create a set of Amazon redshift clusters in the private subnet. For specific operation instructions, refer toCreate a set of sample Amazon redshift clusters。
- Create a MySQL compatible Amazon Aurora cluster in the private subnet. For specific operation instructions, refer toCreate a set of Amazon Aurora DB clusters。
useAmazon Systems Manager Session ManagerConnect the Amazon EMR cluster in the private subnet and create the hive table.
To run code using the R kernel in Amazon sagemaker, you also need to create an Amazon sagemaker notebook. Note that download the jdbc driver for the data source. For a notebook that contains the R package installation scriptConfiguration lifecycleAnd attach the lifecycle configuration to the notebook at creation and startup to ensure the smooth completion of the installation.
Finally, we can use the Amazon cloud technology management console to navigate to the notebook, run the code using the R kernel and access data from various sources. We can passGitHub repoGet this complete solution.
Solution architecture
The following architecture diagram shows how to establish connections with various data sources and run code through the R kernel using Amazon sagemaker. You can also use Amazon redshift query editor or Amazon Athena query editor to create data resources. We also need to useAmazon Systems ManagerThe session manager in connects to the Amazon EMR cluster through SSH to create hive resources.
Launch Amazon cloudformation template
To automatically create resources, we can run a set of Amazon cloudformation templates. This template will help us specify the Amazon EMR cluster, Amazon redshift cluster or MySQL compatible Amazon Aurora cluster that needs to be created automatically without having to perform each step manually. All resources can be created in a few minutes.
- Select the link below to start the cloudformation stack. This stack will create all the necessary resources to implement this solution:
- stayCreate stackOn page, selectNext。
- Enter the stack name.
- We can adjust the default values of the following stack details:
Stack details | Default value |
---|---|
Select class B network address as VPC IP address (10.xxx.0.0 / 16) s | 0 |
Sagemaker Jupiter notebook instance type | ml.t2.medium |
Create EMR cluster automatically? | “Yes” |
Create redshift cluster automatically? | “Yes” |
Automatically create Aurora MySQL DB cluster? | “Yes” |
- choiceNext。
- In configure stack optionsOn page, selectNext。
- choiceI acknowledge that Amazon CloudFormation might create IAM resources(I confirm that Amazon cloudformation can create Iam resources).
- choiceCreate stack。
Now we can see the stack being created, as shown in the following screenshot.
After the stack is created, the status will be displayed as create_ COMPLETE。
- stayOutputsTab, record each key and its corresponding value.
In this article, we will use the following keys: - AuroraClusterDBName– Aurora cluster database name
- AuroraClusterEndpointWithPort– Aurora cluster endpoint address and port number
- AuroraClusterSecret– Aurora cluster credential secret ARN
- EMRClusterDNSAddress– EMR cluster DNS name
- EMRMasterInstanceId– EMR cluster primary instance ID
- PrivateSubnets– private subnet
- PublicSubnets– public subnet
- RedshiftClusterDBName– Amazon redshift cluster database name
- RedshiftClusterEndpointWithPort– Amazon redshift cluster endpoint address and port number
- RedshiftClusterSecret– Amazon redshift cluster credential secret ARN
- SageMakerNotebookName– Amazon sagemaker notebook instance name
- SageMakerRS3BucketName– Amazon sagemaker S3 data bucket
- VPCandCIDR– VPC ID and CIDR address block
Use the necessary RSoftware package and jarCreating a notebook from a fileexample
JDBCIt is an application programming interface for Java programming language(API), which is responsible for defining the specific access method to the database. Rjdbc is a software package in R, which can help you access various data sources using JDBC interface. The notebook instance created by the cloudformation template will ensure that necessary jar files are provided for hive, presto, Amazon Athena, Amazon redshift and MYSQL to establish a JDBC connection.
- Under the Amazon sagemaker consoleNotebookSection, selectNotebook instances。
- Search for notebooks that match the previously recorded sagemaker notebookname key.
- Select the notebook instance.
- Click “open Jupiter” under “actions” and navigate to the “JDBC” directory.
The cloudformation template will be downloaded from the “JDBC” directoryHive、Presto、Athena、Amazon Redshiftas well asAmazon Aurora MySQLCompatible jar files.
- The of the notebook instance was foundLifecycle configuration。
Through lifecycle configuration, we can install software packages or sample notebooks on the notebook instance, configure network and security for it, or customize other configurations using shell scripts. Life cycle configuration is responsible for providing shell scripts that need to be run when we create a notebook instance or start this notebook.
- stayLifecycle configurationSection, selectView scriptTo view the lifecycle configuration script responsible for setting up the R kernel in Amazon sagemaker to point the JDBC connection to the data source through R.
This lifecycle configuration will be in the Amazon sagemaker notebookInstalling rjdbc packages and dependencies in Anaconda environment。
Access hiveWith Presto
Amazon EMR is an industry-leading cloud big data platform, which can use various open source tools (e.gApache Spark、Apache Hive、Apache HBase、Apache Flink、Apache HudiAnd PRESTO) to process large amounts of data.
You can use the session manager function in system manager to log in to the EMR master node from the Amazon cloud technology console, so as to create a test table in hive. Through systems manager, you can view and control the infrastructure on Amazon cloud technology. Systems manager also provides a unified user interface for us to view management data from multiple Amazon services and automatically perform management tasks across multiple resources. Session manager is a fully managed systems manager feature that helps us to manage the session through the browser based one click interactive shell or mayxun cloud technology command line interface (Amazon CLI)Amazon Elastic Compute Cloud(Amazon EC2) instances, local instances, and virtual machines.
We can use Amazon cloudformation in this stepOutputsThe following values are provided in the tabs:
- EMRClusterDNSAddress– EMR cluster DNS name
- EMRMasterInstanceId– EMR cluster primary instance ID
- SageMakerNotebookName– Amazon sagemaker notebook instance name
Then do the following:
- staySystems Manager ConsoleofInstances & Nodesunder, selectSession Manager。
- choiceStart Session。
- Use emrmasterinstanceidThe value of the key is used as the instance ID,SSHTo EMRMaster node.
This will launch the browser based shell.
- Run the following SSH command:
# change user to hadoop
whoami
sudo su - hadoop
- Log in to the EMR master node and create a test table in hive:
# Run on the EMR master node to create a table called students in Hive
hive -e "CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));"
# Run on the EMR master node to insert data to students created above
hive -e "INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);"
# Verify
hive -e "SELECT * from students;"
exit
exit
The following screenshot shows an example of a view in a browser based shell.
- After exiting the shell, close the browser.
To use Amazon sagemaker r to check the data in Amazon EMR for query, open the notebook previously created by the cloudformation template.
- stayAmazon sagemaker consoleofNotebookUnder, selectNotebook instances。
- Locate the notebook specified by the value of the sagemaker notebookname key.
- choiceOpen Jupyter。
- To demonstrate connecting EMR in Amazon sagemaker r kernel, selectUploadAnd uploadipynb notebook。
Ø or inNewFrom the drop-down menu, selectRTo open a new notebook.
Ø enter the code in “hive_connect. Ipynb” to EMR_ Replace the DNS value with the value provided by the emrclusterdnsaddress key:
- Run all units in the notebook and access hive on Amazon EMR using the Amazon sagemaker r console.
We can access Presto through similar steps:
- stayAmazon sagemaker consoleOn, open the notebook we created earlier.
- choiceOpen Jupyter。
- choiceUploadTo uploadipynb notebook。
Ø alternatively, you can select r in the new drop-down menu to open a new notebook.
Ø enter the code in “presto_connect. Ipynb” to EMR_ Replace the DNS value with the value provided by the emrclusterdnsaddress key:
- Run all units in the notebook and access prestodb on Amazon EMR using the Amazon sagemaker r console.
Access Amazon Athena
Amazon Athena is an interactive query service that can be easily analyzed using standard SQLAmazon Simple Storage Service(Amazon S3). Amazon Athena also has the attribute of no server. You don’t need to manage any infrastructure. You just need to pay for the actual query. To access Amazon Athena from the Amazon sagemaker r kernel using rjdbc, we need to useAmazon Athena JDBC Driver。 This driver has been downloaded to the notebook instance through the lifecycle configuration script.
You also need to set the query result location in Amazon S3. For more details, seeHow to use query results, output files and query history。
- stayAmazon Athena consoleOn, selectGet Started。
- choiceSet up a query result location in Amazon S3(on Amazon S3)(set query result location in)。
- stayQuery result locationSection, enter the Amazon S3 location specified by the value of the sagemaker rs3bucketname key.
- Or you can add a prefix directly, such as results.
- choiceSave。
- Using sample data from Amazon S3Create a database or schema and its corresponding tables in Athena。
- Similar to the way of accessing hive and presto, you can uploadipynbNotebook to establish a connection between Athena and Amazon sagemaker through the R kernel.
Ø or you can open a new notebook and enter the code in “athena_connect. Ipynb” and S3_ Replace the value of bucket with the value of sagemaker rs3bucketname key:
- Run all units in the notebook to access Amazon Athena from the Amazon sagemaker r console.
Access Amazon redshift
Amazon redshift is a fully managed cloud data warehouse with excellent speed. It enables simple and cost-effective data analysis with standard SQL and your existing business intelligence (BI) tools. Redshift can query large-scale structured data at TB or even Pb level, optimize complex queries, realize column storage on high-performance storage, and support large-scale concurrent query execution. To access Amazon redshift from Amazon sagemaker r kernel using rjdbc, we can useAmazon redshift JDBC Driver, the driver has been downloaded to the notebook instance through the lifecycle configuration script.
We need from Amazon cloudformationOutputsTab to get the following keys and their corresponding values:
- RedshiftClusterDBName– Amazon redshift cluster database name
- RedshiftClusterEndpointWithPort– Amazon redshift cluster endpoint address and port number
- RedshiftClusterSecret– Amazon redshift cluster credential secret ARN
The cloudformation template will be displayed in theAmazon Secrets ManagerA secret is created for Amazon redshift cluster to protect the secret we use to access applications, services and various IT resources. Secrets manager also allows users to easily rotate, manage, and retrieve database credentials, API keys, and even other secrets throughout the life cycle.
- stayAmazon secrets Manager ConsoleOn, selectSecrets。
- Select the secret represented by the redshiftclustersecret key value.
- In secret valuepart, selectRetrieve secret valueTo get the user name and password of Amazon redshift cluster.
- stayAmazon redshift consoleOn, selectEditor(in essence)Amazon redshift query editor)。
- stayDatabase nameSection, inputredshiftdb。
- stayDatabase passwordSection, enter the password.
- choiceConnect to database。
- Run the following SQL statement to create a table and insert several records:
CREATE TABLE public.students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO public.students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
- stayAmazon sagemaker consoleOn the, open the notebook.
- choiceOpen Jupyter。
- uploadipynb notebook。
Ø or open a new notebook and enter the code in “redshift_connect. Ipynb”. Pay attention to replace the values of redshiftclusterendpointwithport, redshiftclusterdbname and redshiftclustersecret:
- Run all units in the notebook and access Amazon redshift from the Amazon sagemaker r console.
Access MySQLCompatible Amazon Aurora
Amazon aurora is a MySQL compatible relational database built specifically for the cloud environment, which can combine the performance and availability of traditional enterprise databases with the convenience and cost-effectiveness of open source databases. To access Amazon Aurora from Amazon sagemaker r kernel using rjdbc, we need to useMariaDB JDBC Driver, the driver has been downloaded to the notebook instance through the lifecycle configuration script.
You need to use Amazon cloudformationOutputsThe following keys and their corresponding values are provided in the tab:
- AuroraClusterDBName– Aurora cluster database name
- AuroraClusterEndpointWithPort– Aurora cluster endpoint address and its port number
- AuroraClusterSecret– Aurora cluster credential secret ARN
The cloudformation template will create a secret for Aurora cluster in secrets manager.
- stayAmazon secrets Manager ConsoleOn the, find the secret represented by the auroraclustersecret key value.
- In secret valuepart, selectRetrieve secret valueTo get the user name and password of Aurora cluster.
To access the cluster, follow the steps similar to other services.
- stayAmazon sagemaker consoleOn the, open the notebook.
- choiceOpen Jupyter。
- uploadipynb notebook。
Ø alternatively, you can open a new notebook and enter the code in “aurora_connect. Ipynb”. Please note to replace the values of Aurora clusterendpointwithport, Aurora clusterdbname and Aurora clustersecret:
- Run all units in the notebook to access Amazon aurora on the Amazon sagemaker r console.
summary
In this article, we demonstrate how to access various data sources in the operating environment, including hive and prestodb on Amazon EMR, Amazon Athena, Amazon redshift, MySQL compatible Amazon Aurora cluster, etc., so as to analyze, analyze and run statistical calculations through Amazon sagemaker. You can also extend the same method to other data sources through JDBC.