A brief analysis of the millions of stack distributed scheduling engine dagschedulex

Time:2021-12-30

Buses can be seen everywhere with our daily life. Buses of different routes are sent out orderly according to their respective schedules, arrive at the station, pick up the passengers on the platform, and then slowly drive to the next station… There will be short-range extra buses in the morning peak, with shorter departure intervals, and longer intervals in the middle of the night. All this is subject to the dispatch of the bus terminal.

In the big data platform, there will also be various tasks that need to be carried out orderly according to a certain time interval and order, and the scheduling engine manages all this. It not only makes the task execute on time and on point, but also faces various complex scenarios, such as:

The cycle task executed every 10 minutes has been executed for 11 minutes. Do you want to start calculation directly in the next cycle

For Task B that needs to be executed after task a is completed, it has been waiting for a day, but it has not been executed until task a is completed. Should we continue to wait

100000 tasks are submitted at the same time. In what order should they be executed

There are many kinds of problems. Without a robust and intelligent scheduling engine, it is impossible to support the task execution of a big data platform like an orderly bus system.

There are many scheduling frameworks in the market, such as quartz, elastic job, XXL job, etc., but they only support submitting tasks regularly, just like buses with fixed shifts. Although they can arrive at the station on time, they are difficult to face the rush hour in the morning and evening. Such a single scheduling method is far from meeting the business scenario of “twists and turns, complexity and variability”. At this time, dagschedulex, a million level self-developed distributed scheduling engine, will play. It not only meets the timing function, but also has built-in rich strategies to deal with different scenarios, such as resource constraints, rapid failure, dynamic priority adjustment, rapid expiration, and upstream and downstream scheduling state dependence.

The data stack supports basic timing scheduling and complex cross cycle dependency strategies.

In the whole stack architecture, dagschedulex, as the link between the stack platform application and the underlying big data cluster, plays a connecting role. Within the cluster resources, dagschedulex coordinates the task resource allocation, arranges the task submission, operation and periodic scheduling.

1、 Main processes of dagschedulex

2、 Multi cluster configuration and multi tenant isolation

In actual data development, we may have multiple environments such as development and testing. To submit tasks to the corresponding cluster, we only need to configure different cluster environments on the stack console and bind different tenants. At this time, task submission will realize cluster isolation according to different tenants.

1. The console can bind different types of clusters, such as production environment a Hadoop and production environment B Libra

2. Multiple tenants can be bound to a cluster

3. When submitting a task, target clusters are distinguished by tenantid

3、 Instance generation and submission

Dagschedulex currently supports a variety of computing components, such as Flink, spark, tensorflow, python, shell, Hadoop Mr, kylin, ODPs, RDBMS (multiple relational databases), etc. all upper layer application submission tasks can be executed as long as the corresponding plug-in types are found.

Dagschedulex supports custom task types, and it is very convenient to extend new plug-ins. Just define the corresponding plug-in typename and implement the interface methods defined in iclient. The interface methods are as follows:

Init method

Judgeslots (resource judgment) method

Submitjob method

Getjobstatus (get task status) method

Getjoblog (get task execution log) method

Canceljob method

When a task is submitted to dagschedulex, it will generate the job (instance) task of the next day one day in advance. On the day of execution, they will run according to the specified scheduling time, and then obtain the execution results. Of course, data replenishment and immediate operation are not limited. Dagschedulex also supports upstream and downstream dependency of tasks across tenants, task self dependency, task priority adjustment, console task queue management, task monitoring of operation and maintenance center, etc.

4、 Task alarm

When the upstream and downstream dependent links are long, an upstream job (instance) failure may lead to downstream data problems. In this case, dagschedulex supports monitoring alarms in a variety of scenarios:

Execution exceeds the specified time

Execution failed

Task not running

Task stop

The console alarm channel not only supports common alarm modes such as nailing, SMS and email, but also supports user-defined alarm channels:

Introduce the alarm SDK of dagschedulex

Implement custom alarm logic in icustomizechannel

Upload the packaged jar to the console alarm channel

Configure the corresponding alarm scenario in the application

5、 Summary

Dagschedulex is a distributed task scheduling engine that can generate, schedule, submit, maintain and alarm instances. The off-line computing, flow computing, algorithm development and other suites of the stack all rely on the scheduling engine to perform tasks, which is a very important hub.

————————————————

This article was first published in:Digital stack Institute

The data stack is a cloud native one-stop data platform PAAS. We have an interesting open source project on GitHub:FlinkX

Flinkx is a unified batch stream data synchronization tool based on Flink. It can collect both static data, such as MySQL and HDFS, and real-time changing data, such as mysql, binlog and Kafka. It is a global, heterogeneous and batch stream integrated data synchronization engine. If you are interested, please come to GitHub community to play with us~