Dataworks relocation scheme: Azkaban job is migrated to dataworks

Time:2021-9-15

Introduction:The dataworks Migration Assistant provides the function of task relocation and supports the rapid migration of tasks from open source scheduling engines oozie, Azkaban and airflow to dataworks. This paper mainly introduces how to migrate jobs from open source Azkaban workflow scheduling engine to dataworks.

The dataworks Migration Assistant provides the function of task relocation and supports the rapid migration of tasks from open source scheduling engines oozie, Azkaban and airflow to dataworks. This paper mainly introduces how to migrate jobs from open source Azkaban workflow scheduling engine to dataworks.

Azkaban version that supports migration

All versions of Azkaban migration are supported.

Overall migration process

The migration assistant supports the migration of big data development tasks from the open source workflow scheduling engine to the dataworks system. The basic process is shown in the figure below.

Dataworks relocation scheme: Azkaban job is migrated to dataworks

For different open source scheduling engines, the dataworks migration assistant will provide a related task export scheme.

The overall migration process is as follows: export the jobs in the open source scheduling engine through the job export capability of the migration assistant scheduling engine; Then upload the job export package to the migration assistant, and import the mapped job into dataworks through task type mapping. During job import, you can set to convert tasks to maxcompute type jobs, EMR type jobs, CDH type jobs, etc.

Azkaban job export

Azkaban tool has the ability to export workflow and has its own web console, as shown in the following figure:

Dataworks relocation scheme: Azkaban job is migrated to dataworks

The Azkaban interface supports direct downloading of a flow. Flow export process:

Dataworks relocation scheme: Azkaban job is migrated to dataworks

Operation steps:

1. Enter the project page

2. Click flows to list all workflows under project

3. Click download to download the export file of project

Azkaban export package format: native Azkaban can be used to export all tasks (jobs) and relationship information of a project with azakaban inside the package zip file.

Azkaban job import

After obtaining the exported task package of the open source scheduling engine, users can take the zip package to the migration assistantMigration Assistant – > cloud on task – > scheduling engine job importUpload and import packages on the page for package analysis.

Dataworks relocation scheme: Azkaban job is migrated to dataworks

After importing package analysis successfully, clickconfirm, enter the import task setting page, where the analyzed scheduling task information will be displayed.

Open source scheduling import settings

Users can click Advanced settings to set the conversion relationship between Azkaban task and dataworks task. The setting interfaces of different open source scheduling engines in the advanced settings are basically the same, as shown in the following figure:

Dataworks relocation scheme: Azkaban job is migrated to dataworks

Introduction to advanced settings:

  • Sparkt submit conversionThis is: the import process will analyze whether the user’s task is a spark submit task. If so, the spark submit task will be converted to the corresponding dataworks task type, such as ODPs\_ SPARK/EMR\_ SPARK/CDH\_ Spark et al
  • Command line SQL taskConversion: many task types of open source engine are command line running SQL, such as hive – E, beeline – E, impala shell, etc. the migration assistant will make corresponding conversion according to the target type selected by the user. For example, it can be converted to ODPs\_ SQL, EMR\_ HIVE, EMR\_ IMPALA, EMR\_ PRESTO, CDH\_ HIVE, CDH\_ PRESTO, CDH\_ Impala, etc
  • Target computing engine type: this mainly affects the data write configuration of the destination of sqoop synchronization. We will convert the sqoop command to a data integration task by default. The type of computing engine determines which computing engine project is used by the data source at the destination of the data integration task.
  • Shell type conversionThere are many types of shell nodes in dataworks according to different computing engines, such as EMR\_ SHELL,CDH\_ Shell, dataworks, its own shell node, and so on.
  • Unknown task conversionFor tasks that cannot be handled by the migration assistant at present, we use a task type by default. You can select shell or virtual node
  • SQL node conversionFor: there are many types of SQL nodes on dataworks because of the different binding calculation engines. E.g. EMR\_ HIVE,EMR\_ IMPALA、EMR\_ PRESTO,CDH\_ HIVE,CDH\_ IMPALA,CDH\_ PRESTO,ODPS\_ SQL,EMR\_ SPARK\_ SQL,CDH\_ SPARK\_ SQL, etc. you can choose which task type to convert to.

Note: the conversion values of these import mappings change dynamically and are related to the calculation engine bound in the current project space. The conversion relationship is as follows.

Import to dataworks + maxcompute

Set item Optional value
Convert sparkt submit to ODPS_SPARK
<span>Convert command line SQL tasks to</span> ODPS_SQL、ODPS_SPARK_SQL
<span>Target computing engine type</span> ODPS
<span>Shell type converted to</span> DIDE_SHELL
<span>Unknown task converted to</span> DIDE_SHELL、VIRTUAL
<span>Convert SQL node to</span> ODPS_SQL、ODPS_SPARK_SQL

###Import to dataworks + EMR

Set item Optional value
Convert sparkt submit to EMR_SPARK
Convert command line SQL tasks to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL
Target computing engine type EMR
Shell type converted to DIDE_SHELL, EMR_SHELL
Unknown task converted to DIDE_SHELL、VIRTUAL
Convert SQL node to EMR_HIVE, EMR_IMPALA, EMR_PRESTO, EMR_SPARK_SQL

###Import to dataworks + CDH

Set item Optional value
Convert sparkt submit to CDH_SPARK
Convert command line SQL tasks to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL
Target computing engine type CDH
Shell type converted to DIDE_SHELL
Unknown task converted to DIDE_SHELL、VIRTUAL
Convert SQL node to CDH_HIVE, CDH_IMPALA, CDH_PRESTO, CDH_SPARK_SQL

##Execute importDataworks data integrationorMMAhttps://help.aliyun.com/document\_detail/181296.html

> Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.