How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

Time:2021-5-4
  1. Magic box is a big data development and collaboration platform developed by Xiyun. The previous article introduced how magic box can improve the consumption speed of rabbitmq in the process of offline task packaging;
  2. Through magic box, data developers can not only package, test and launch offline tasks, but also set up serial and parallel workflow scheduling of offline tasks;
  3. This paper is based onCreating a workflow that relies on multiple parallel jobsAs an example, to introduce the implementation of magic box integration AzkabanOffline task workflow schedulingThe idea and process of the project.

1、 Offline computing

Magic box management offline task

  • Xiyun offline computing supports hive, spark and other computing frameworks;
  • After data developers write analysis code with spark, they can package, test and go online by using magic box (see this article for details).

2、 Task scheduling

Why workflow scheduler

  • A complete data analysis system is usually composed of a large number of task units: shell script program, Java program, MapReduce program, hive script, etc;
  • Xiyun runs hundreds of offline analysis tasks every day, and the scheduling time of tasks with different priorities is also different;
  • In order to organize such a complex execution plan well, a workflow scheduling system is needed to schedule the execution.

crontab+shell

In order to solve the above problems, we used crontab + shell in the early stage, but the disadvantages of this method are as follows:

  • The dependencies between tasks are completely controlled by scripts;
  • In the case of more tasks, management and maintenance is more troublesome;
  • Problems are also difficult to investigate.

Azkaban

Azkaban is a batch workflow task scheduler open source by LinkedIn. Its advantages are as follows:

  • It can run a group of work and processes in a specific order in a workflow;
  • A kV file format can be used to establish the dependency between tasks;
  • And provide an easy-to-use web user interface maintenance, through which you can track your workflow.

3、 Workflow scheduling based on magic box

The offline computing part of the magic box integrates Azkaban and interacts with Azkaban by calling the interface through Ajax. Users do not need to log in to Azkaban’s Web UI, but can complete the following tasks directly through the magic box:

  • Workflow creation;
  • Delete workflow;
  • Execute workflow;
  • Cancel the execution workflow;
  • View the workflow execution record, execution status, execution duration, etc.

Now I will introduce how to create a magic boxMultiple jobs in parallelThe workflow to be createdTask dependency graphAs follows:
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

remarks

  • The Azkaban process name is defined as the last job that has no dependencies.

1. Create spark task

Workflow depends on one or more tasks. Therefore, before creating a workflow, you need to prepare tasks:

  • In the magic box, the spark task is created by specifying the task processing class and setting the parameters needed to execute the task;
  • After the task is successfully created, the jar package needed to run the task will be automatically uploaded to HDFS after the magic box project is built and tested.

How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

The execution parameters will be stored in the table in the format of JSON string. The approximate format is as follows:

{

"type":"spark",

"conf.spark.yarn.am.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.history.fs.logDirectory":"hdfs://dconline/spark/eventlog",

"conf.spark.driver.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.eventLog.enabled":"true",

"master":"yarn",

"conf.spark.dynamicAllocation.executorIdleTimeout":"60",

"deploy-mode":"cluster",

"queue":"develop",

//Task processing class filled in the form when creating spark task

"class":"...Main",

//It is empty by default and will be assigned value when creating workflow

"name":"",

//It is empty by default and will be assigned value when creating workflow

"execution-jar":"",

//It is empty by default and will be assigned value when creating workflow

"dependencies":"",

}
Explanation of main execution parameters
  • Name: the name of the application when the spark task is submitted to the yarn platform;
  • Execution jar: it is the jar package that the task depends on;
  • Dependencies: set dependencies here (for example:task_a,task_b, which indicates that the workflow depends ontask_aandtask_b);

The default values of these three parameters are empty. When creating a workflow, the assignment will be updated dynamically according to the configured workflow dependency.

remarks

  • In the figure above, I created a spark task:Offline task test 001(offline task test 001 is the task name recorded in the magic box, and the job name corresponding to Azkaban is:spark_task_10046);
  • According to this step, you also need to create two spark tasks, corresponding to jobs:spark_task_10006andspark_task_10008No more screenshots here.

2. Authentication

All Azkaban API calls need to be authenticated. In fact, it simulates a user login process. Therefore, before interacting with Azkaban, you need to authenticate first.

Request parameters
Parameter Description
action=login The fixed parameter indicating the login action.
username The Azkaban username.
password The corresponding password.
Main code
/**

 *Log in to Azkaban and return to sessionid

 * official account is on the road.

 *

 * @return string

 * @throws Exception

 */

@Override

public String login() throws Exception {

SSLUtil.turnOffSslChecking();

HttpHeaders hs = new HttpHeaders();

hs.add("Content-Type", CONTENT_TYPE);

hs.add("X-Requested-With", X_REQUESTED_WITH);

LinkedMultiValueMap<String, String> linkedMultiValueMap = new LinkedMultiValueMap<String, String>();

linkedMultiValueMap.add("action", "login");

linkedMultiValueMap.add("username", username);

linkedMultiValueMap.add("password", password);

HttpEntity<MultiValueMap<String, String>> httpEntity = new HttpEntity<>(linkedMultiValueMap, hs);

RestTemplate client = new RestTemplate();

String result = client.postForObject(AzkabanUrl, httpEntity, String.class);

Log. Info ("--- Azkaban returns login information" + result) ";

return new Gson().fromJson(result, JsonObject.class).get("session.id").getAsString();

}

remarks

  • By performing authentication, a session will be provided to the user (a session. ID will be returned in the response);
  • Any API request can be executed before the session expires (default is 24 hours);
  • Of course, if you log off, change the machine, change the browser, or restart the Azkaban service, the session will expire.

3. Create workflow

When creating a workflow, you can select the task units that you depend on, either serial or parallel.

Here’s how to create the workflow in the previous requirement (relying on multiple parallel jobs) in the magic box.

3.1 create workflow (multi job parallel)

Select task:
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

Select dependent tasks:
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

Click the [add] button:
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

remarks

  • Magic box uses vis.js to configure and display the process topology (this article is not the focus of the introduction).
3.2 front end processing: work flow reference

The front end assembles the created workflow (containing the task unit dependent data) into adependListTo the server:

[

{

"id":"task_100046",

"depId":"task_100008"

},

{

"id":"task_100046",

"depId":"task_100006"

}

]

How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

remarks

  • If the child also has tasks to rely on, the data structure will be a typical tree structure (the dependency shown in this demo has only two levels).
3.3 server processing: get dependency

The server receives the workflow data submitted by the front end this time(dependList)After that, the data will be processed to get a method to store the dependencies of workflow task unitsdependencies

[

{

"task_100046": [

"spark_task_10008",

"spark_task_10006"

]

}

]

remarks

  • If the child also has tasks to rely on, there will be multiple elements in the array, and the data format is similar (the dependency shown in this demo has only two levels);
  • When dealing with the dependency data of tree structure workflow, we need to use multiple recursive methods to get all the dependencies.
3.4 server processing: update execution parameters

Research on cyclic storage dependencydependenciesThe main logic is as follows:

  1. By intercepting the task name spark_ task_ 10008 get task ID: 10008;
  2. The execution parameters in JSON format saved when creating spark task are retrieved from the table by task IDconfig_params;
  3. to updateconfig_params, forconfig_paramsPart of the parameters (name, execution jar, dependencies) are assigned.

Offline task test 001Final execution parametersconfig_paramsFor:

{

"type":"spark",

"conf.spark.yarn.am.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.history.fs.logDirectory":"hdfs://******/eventlog",

"conf.spark.driver.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.eventLog.enabled":"true",

"master":"yarn",

"conf.spark.dynamicAllocation.executorIdleTimeout":"60",

"deploy-mode":"cluster",

"queue":"develop",

//Task processing class filled in the form when creating spark task

"class":"...Main",

//The spark task is submitted to the application name in the yarn platform

"DataCube-SparkTask[100046]",

//The storage path of the jar package of the spark task in HDFS

"execution-jar":"hdfs://******/20190801/prod-***_feature_***_20190724165218.jar",

//The dependent tasks of this workflow

"dependencies":"spark_task_100008,spark_task_100006",

}

Lyf create task 1Execution parameters forconfig_paramsFor:

{

"type":"spark",

"conf.spark.yarn.am.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.history.fs.logDirectory":"hdfs://******/eventlog",

"conf.spark.driver.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.eventLog.enabled":"true",

"master":"yarn",

"conf.spark.dynamicAllocation.executorIdleTimeout":"60",

"deploy-mode":"cluster",

"queue":"develop",

//Task processing class filled in the form when creating spark task

"class":"...Main",

//The spark task is submitted to the application name in the yarn platform

"DataCube-SparkTask[100006]",

//The storage path of the jar package of the spark task in HDFS

"execution-jar":"`hdfs://******/20190801/prod-***_feature_***_20190724165218.jar`",

//The dependent tasks of this workflow

"dependencies":"",

}

Lyf create task 2Execution parameters forconfig_paramsFor:

{

"type":"spark",

"conf.spark.yarn.am.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.history.fs.logDirectory":"hdfs://******/eventlog",

"conf.spark.driver.extraJavaOptions":"-Dhdp.version=3.1.0.0-78",

"conf.spark.eventLog.enabled":"true",

"master":"yarn",

"conf.spark.dynamicAllocation.executorIdleTimeout":"60",

"deploy-mode":"cluster",

"queue":"develop",

//Task processing class filled in the form when creating spark task

"class":"...Main",

//The spark task is submitted to the application name in the yarn platform

"DataCube-SparkTask[100008]",

//The storage path of the jar package of the spark task in HDFS

"execution-jar":"`hdfs://******/20190801/prod-***_feature_***_20190724165218.jar`",

//The dependent tasks of this workflow

"dependencies":"",

}
3.5 data required for preparing job resource file

After updating the execution parameters required by each spark task, start to assemble joblists to store the data required by the job resource file. The format is as follows:

[

{

"newId": "spark_task_100046",

"Config": the updated config of the task_ Params variable value,

},

{

"newId": "spark_task_100008",

"Config": the updated config of the task_ Params variable value,

},

{

"newId": "spark_task_100006",

"Config": the updated config of the task_ Params variable value,

},

]
3.6 creating Azkaban project

Request parameters

Parameter Description
session.id The user session id.
action=create The fixed parameter indicating the create project action.
name The project name to be uploaded.
description The description for the project. This field cannot be empty.

Main code

/**

 *Create project

 * official account is on the road.

 *

 *@ param ProjectName project name

 *@ param description item description

 * @throws Exception

 */

@Override

public void createProject(String projectName, String description) throws Exception {

SSLUtil.turnOffSslChecking();

HttpHeaders hs = new HttpHeaders();

hs.add("Content-Type", CONTENT\_TYPE);

hs.add("X-Requested-With", X\_REQUESTED\_WITH);

LinkedMultiValueMap<String, String\> linkedMultiValueMap = new LinkedMultiValueMap<String, String\>();

linkedMultiValueMap.add("session.id", login());

linkedMultiValueMap.add("action", "create");

linkedMultiValueMap.add("name", projectName);

linkedMultiValueMap.add("description", description);

HttpEntity<MultiValueMap<String, String\>> httpEntity = new HttpEntity<>(linkedMultiValueMap, hs);

String result = restTemplate.postForObject(azkabanUrl + "/manager", httpEntity, String.class);

Log. Info ("--- Azkaban returns the information of creating project" + result) ";

//Successful creation and existence indicate successful creation

JsonObject jsonObject = new Gson().fromJson(result, JsonObject.class);

String status = jsonObject.get("status").getAsString();

if (!AZK\_SUCCESS.equals(status)) {

String message = jsonObject.get("message").getAsString();

if (!"Project already exists.".equals(message)) {

Throw new exception ("failed to create Azkaban project");

}

}

}

remarks

  • In this method, if the ProjectName already exists, the creation is successful.

Effect display

How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

3.7 generating compressed package

Loop joblists, generate one or more. Job files needed by the workflow, and package the files. The main logic is as follows:

  1. Generate the file with newid as the name and. Job as the suffix;
  2. Write the value of config to the file;
  3. Compress all files in the current directory into a. Zip file named by a random number that will not repeat.

Main code

/**

 *Loop the execution parameters in the joblist, write them to the file (newid. Job), and then write the file to the compressed package

 * official account is on the road.

 *

 * @param jobLists

 * @return

 * @author liuyongfei

 * @date 2019/04/03

 */

Map<String, Object> zipJobFile(List<Map<String, String>> jobLists) {

int randomNumber = (int) Math.round(Math.random() * (9999 - 1000) + 1000);

String zipName = "jobList" + randomNumber + ".zip";

//Define compressed files

String zipFilePath = CommonUtils.handleMultiDirectory("data/zip") + "/" + zipName;

//Create output stream

FileOutputStream fos = null;

try {

fos = new FileOutputStream(zipFilePath);

ZipOutputStream zipOut = new ZipOutputStream(fos);

for (Map<String, String\> map : jobLists) {

try {

//Take out the configparams content and write it to the newid. Job file

File file = new File(map.get("newId") + ".job");

if (!file.exists()) {

file.createNewFile();

}

//Get execution parameters

String configParams = map.get("config");

//Write the execution parameters to the file

FileWriter fw = new FileWriter(file.getAbsoluteFile());

BufferedWriter bw = new BufferedWriter(fw);

bw.write(configParams);

bw.close();

//Write the file to the package

FileInputStream fis = new FileInputStream(file);

ZipEntry zipEntry = new ZipEntry(file.getName());

zipOut.putNextEntry(zipEntry);

byte\[\] bytes = new byte\[1024\];

int length;

while ((length = fis.read(bytes)) >= 0) {

zipOut.write(bytes, 0, length);

}

fis.close();

file.delete();

} catch (IOException e) {

e.printStackTrace();

}

}

zipOut.close();

fos.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

Map<String, Object> zipFileMap = new HashMap<>();

zipFileMap.put("zipFilePath", zipFilePath);

zipFileMap.put("zipFile", new File(zipFilePath));

return zipFileMap;

}

/**

 *Create directory

 *Multi level directory creation is supported

 * official account is on the road.

 *

 * @date 2019/04/03

 * @return String

*/

public static String handleMultiDirectory(String multiDirectory) {

File savePath = null;

try {

savePath = new File(getJarRootPath(), multiDirectory);

//Determine whether the saved directory of the uploaded file exists

if (!savePath.exists() && !savePath.isDirectory()) {

Log. Info (savepath + "directory does not exist, need to be created");

//Create directory

boolean created = savePath.mkdirs();

if (!created) {

Log. Error ("path: '" + savepath. Getabsolutepath() + "'creation failed");

Throw new runtimeException ("path: '" + savepath. Getabsolutepath() + "'creation failed');

}

}

Log. Info ("file storage path: {}", savepath. Getabsolutepath());

} catch (FileNotFoundException e) {

e.printStackTrace();

}

return savePath.getAbsolutePath();

}

Generated job file and compressed package
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

At this point, the zip package containing the workflow task dependencies is ready to complete. ​

4. Upload the compressed package to Azkaban

Request parameters
Parameter Description
session.id The user session id.
ajax=upload The fixed parameter to the upload action.
project The project name to be uploaded.
file The project zip file. The type should be set as application/zip or application/x-zip-compressed.
Main code
/**

 *Upload files to Azkaban

 * official account is on the road.

 *

 * @param projectName

 * @param file

 * @return

 * @throws Exception

 */

@Override

public String uploadZip(String projectName, File file) throws Exception {

SSLUtil.turnOffSslChecking();

FileSystemResource resource = new FileSystemResource(file);

LinkedMultiValueMap<String, Object> linkedMultiValueMap = new LinkedMultiValueMap<String, Object>();

linkedMultiValueMap.add("session.id", login());

linkedMultiValueMap.add("ajax", "upload");

linkedMultiValueMap.add("project", projectName);

linkedMultiValueMap.add("file", resource);

String result = restTemplate.postForObject(AzkabanUrl + "/manager", linkedMultiValueMap, String.class);

if (result.length() < 10) {

Throw new businessexception ("failed to upload package to Azkaban, please check whether the workflow is correct!");

}

Log. Info ("--- Azkaban returns the uploaded file information" + result) ";

String projectId = new Gson().fromJson(result, JsonObject.class).get("projectId").getAsString();

if (StringUtils.isEmpty(projectId)) {

Throw new exception ("failed to upload file to Azkaban");

}

return projectId;

}

remarks

  • After the zip package is uploaded successfully, the project ID created successfully in Azkaban will be returned.

5. View the created workflow

5.1 view in magic box

You can see the newly created workflow in the workflow list. Click workflow to enter the details page
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

remarks

  1. Set and clear cron scheduling rules for the workflow in the details page;
  2. In the details page, you can execute the workflow and view the workflow execution record and execution status;
  3. Return to the workflow list to delete the created workflow (and delete the data in Azkaban at the same time).

We integrate the common main functions provided by Azkaban Web UI in the magic box. If there are unusual operations, we can go to Azkaban Web UI to execute them.

5.2 view in Azkaban’s Web UI

After the zip package is uploaded successfully, we can view the created workflow in the Web UI of Azkaban.
In the projects column, you can see the workflow that has just been successfully created through the magic box
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

4、 Summary

Benefits of Azkaban integrated into magic box

  • Data developers in the magic box this big data collaborative development platform, can be very convenient to completesparkPackage and launch of tasks;
  • Data developers can easily completeSerial, parallel and other complex workflow settingsTo make the offline computing task management of Xiyun more orderly and reliable;
  • Data developers don’t have to work between the magic box and Azkaban Web UI platformsFrequent switchingIn the case of more tasks, the management is more convenient, which improves the efficiency of data developers;
  • With the flexible and perfect abnormal monitoring and alarm mechanism of magic box,The stability of data quality assurance has been greatly improvedIn order to better play the value of big data platform support system.

more

By calling Azkaban API and combining with Azkaban meta database data query, we can also complete the following operations in the magic box:

  • Delete workflow;
  • Set cron timing task;
  • Execute workflow and cancel workflow;
  • View the workflow execution log;
  • Obtain the dependent tasks and task information of the workflow;
  • Monitoring and alarming of workflow execution.

WeChat official account

Welcome to my WeChat official account for more articles:
How does magic box (big data collaboration platform) realize workflow scheduling of offline computing tasks

Recommended Today

Large scale distributed storage system: Principle Analysis and architecture practice.pdf

Focus on “Java back end technology stack” Reply to “interview” for full interview information Distributed storage system, which stores data in multiple independent devices. Traditional network storage system uses centralized storage server to store all data. Storage server becomes the bottleneck of system performance and the focus of reliability and security, which can not meet […]