External tools connect SaaS mode cloud data warehouse maxcompute – ETL tools

Time:2021-11-25

Introduction:This article mainly explains how the open source ETL tool connects to maxcompute

For live video, please clicklive broadcastWatch.

This sharing will be explained from five aspects.

01 enter maxcompute ecology

02 open source ETL tools

03 client introduction

04 introduction to pyodps

05 practical demonstration

1、 Enter maxcompute ecology

First, let’s take a lookMaxComputeThe external tools supported by the product can be roughly divided into business intelligence, development management, transmission scheduling and programming interface. This sharing focuses on business intelligence (BI) tools. You can see that maxcompute officially integrates tableau, finereport, finebi and quick Bi. Among them, tableau, finebi and finereport have built-in maxcompute drivers in specific versions. If you need to connect to maxcompute through JDBC or manually load maxcompute JDBC drivers, quick Bi, as Alibaba cloud products, can be directly connected through Alibaba cloud account and AK information, At the same time, Yonghong desktop of version 8.6 and above can also be connected to maxcompute through the built-in driver. There are also open source Bi tools in the business intelligence part. Superset and DaVinci can also be connected to maxcompute.

In the development management part, it is our second lecture, including dbeaver, DataGrid and SQL workbench / J.

At the same time, our products also integrate Kafka and Flink open source engines. The supported ETL open source tools include kettle, airflow and Azkaban. This part is introduced in this sharing. The supported programming interfaces are python, JDBC and Sqlalchemy.

In addition to supporting external tools, maxcompute also has its own open ecology, including built-in open source engine spark, migration tool MMA, development ecology pyodps, Mars, web console, etc. Meanwhile, maxcompute and Alibaba cloud’s internal products have jointly built a rich solution ecology and data application ecology.

2、 Open source ETL tools

This section mainly introduces how the open source ETL tool connects to maxcompute. Open source ETL tools include airflow, Azkaban and kettle.

First, take a look at airflow. Airflow is a scheduling tool written in Python. There are Python operators, bash operators and other operators inside. It also supports the development of custom plug-ins. Airflow controls the maxcompute client to submit SQL tasks through the command line through the command operator. For Python SDK, it can be submitted through Python py file, and Java SDK can be submitted through Java jar. Because airflow supports Python operator, pyodps can be directly integrated and Python code can be written directly. The second part is Azkaban. Azkaban mainly submits our tasks through command, and can submit SQL tasks through the programming interface provided by maxcompute. Kettle can directly connect to maxcompute through JDBC.

External tools connect SaaS mode cloud data warehouse maxcompute - ETL tools

3、 Introduction to maxcompute cli client

Maxcompute client supports running on Linux / MAC / window systems.

install

• JDK above 1.8.

• maxcompute project has been created, and the account with the project permission

to configure

• modify ODPs under conf folder\_ Config.ini file

• fill in AK, project name and endpoint

use

• execute odpscmd under bin directory on Linux / Mac and odpscmd.bat under bin directory on windows

• support the execution of a single SQL statement, the execution of SQL files, the uploading of resources, the uploading and downloading of data (tunnel), authorization and other operations

External tools connect SaaS mode cloud data warehouse maxcompute - ETL tools

4、 Introduction to maxcompute Python SDK (pyodps)

install

• PC client installation depends on Python environment. Execute PIP install pyodps

• dataworks has built-in pyodps support to submit Python tasks through a new pyodps node

Pyodps initialization

from odps import ODPS

o = ODPS(‘**your-access-id**’, ‘**your-secret-access-key**’, project=’**your-project**’, endpoint=’**your-end-point**’)

Pyodps interface

• table interface: o.get\_ table o.create\_ table

• SQL interface: o.run\_ SQL (asynchronous execution) o.execute \ _sql (synchronous execution)

•PyOdpsDataFrame:DataFrame(o.get\_table)、o.get\_table().to\_df()

• upload and download data: create\_ upload\_ session() create\_ download\_ session()

External tools connect SaaS mode cloud data warehouse maxcompute - ETL tools

5、 Practical demonstration

Airflow practice display

Please click video View the actual operation part

Azkaban practice demonstration

Please click video View the actual operation part

Kettle practice demonstration

Please click video View the actual operation part

Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.

Recommended Today

Apache sqoop

Source: dark horse big data 1.png From the standpoint of Apache, data flow can be divided into data import and export: Import: data import. RDBMS—–>Hadoop Export: data export. Hadoop—->RDBMS 1.2 sqoop installation The prerequisite for installing sqoop is that you already have a Java and Hadoop environment. Latest stable version: 1.4.6 Download the sqoop installation […]