From open source users to Apache PPMC


Recently, Wu Baoqi, the co-founder and chief architect of Guanyuan data, as the PPMC of Apache helpingscheduler, participated in the first user conference of Apache helpingscheduler, and shared the road from open source users to Apache PPMC at the conference. The following is the main content of sharing.

Table of Contents
• 1. Part 1. Origin
O 1.1. In phase 1, airflow itself is very powerful, and we have also made a lot of operator extensions
O 1.2. Phase 2, Apache nifi and streamsets data collector (SDC)
O 1.3. Phase 2.5, kettle and talend di
O 1.4. Phase 3, start to investigate various open source scheduling projects, and finally select the helpingscheduler
• 2. Part 2. Commencement
O 2.1. Contribution to the project
O 2.2. Why open source
O 2.3. The benefits of open source
• 3. Part 3. Future
O 3.1. Some functions to be explored

1 Part 1. Origin

Guanyuan data is a bi + AI data technology company. For example, for Bi (Business Intelligence), it is not simply cool visualization, but involves a large number of external system docking and data fusion, which involves complex data cleaning and task scheduling. Although our Bi also has a built-in light data processing module, but, We are also looking for more suitable open source tools for more complex task scheduling / data supplement and some data cleaning / Feature Engineering / scheduling in AI products

1.1 in phase 1, air flow itself is very powerful, and we have also made a lot of operator extensions

From open source users to Apache PPMC

However, there is a major problem with airflow: “relying too much on python programming, a lot of Python extensions need to be done, and task dependent choreography needs to be realized by writing python.”
The main positioning of our scheduling tool is: it needs consultants to be able to implement it. For consultants who can’t program, it’s too difficult to ask everyone to write python. So we come to the conclusion that we need a web tool with a good visual interface, and we can’t pretend that users can program

1.2 phase 2, Apache nifi and streamsets data collector (SDC)

From open source users to Apache PPMC

Main conclusions:
• nifi supports unstructured data and has more functions than SDC
• but: streamsets SDC is easier to use and better to look at! (it is also important to look at it better), especially the following three points:
O real time metrics support (see the running information of pipeline in real time, and it is a visual graphic display)
O great code!
O great plug-in design! It’s easier to write custom plug-ins!
Although SDC is very attractive, the main scene of SDC is real-time data extraction and transformation. Bi is still the main offline timing task, so it is not exactly matched

1.3 phase 2.5, kettle and talend di

Main conclusions:
• both of them are traditional ETL, and the scheduling function is weak (Note: talend Di is the evaluation open source version, and the commercial version has more complex scheduling ability, but the price is not cheap)
• plug in extensions are a bit complicated
Talend can translate jobs into Java projects, like! (in this way, you can run jar packages directly without installing talend on every machine, but there are also problems, such as: many error reports are Java exceptions, and many custom extensions require users to know basic Java)

But the code of these two projects is very complex, so it’s very difficult to read / master the code. An important indicator for whether to use an open source project is whether the colleagues of their own company can master the project

1.4 phase 3, start to investigate various open source scheduling projects, and finally select dolphin scheduler

From open source users to Apache PPMC

Main conclusion: the helpingscheduler (named easyscheduler before joining APACHE) is more suitable for our scenario. Main reason:
• Apache License
• the definition and instance of process / task are separated, supporting supplementary data, and the concept is clear. If you go the right way, you won’t be afraid to go far
There is a good graphical configuration interface, instead of writing JSON configuration for everything, or setting DAG for Python
Based on JVM, it is convenient for Java shop to extend in the future

2 Part 2. Commencement

As the chief architect of the company, part of his work is to think about and try out the future direction of some companies, so I started to contribute to the dolphin scheduler. I was the main one fighting inside the company (of course, I was not alone, I fought with many small partners in the open source community), but now there are other partners in the company who are working with me to contribute to open source

2.1 contribution in the project

The main way is: from simple to complex, gradually integrate into the community

At the very beginning:
Familiar with project code and build local environment
• fix some minor bugs

Next, you can do some simple functions:
• add Clickhouse support
• increased Oracle support
• add SQL Server support

Next, you can do some more complex functions:
Add pre / post statement support for SQL tasks
• support Minio / S3 as a “resource center” file storage
• support combinedserver: multiple servers are started together to facilitate local development
• using sifting appender to solve the problem of task log disorder

Of course, contribution does not only refer to the pull request of merger, but also includes:
• Pull Request Review
• community answers questions
• it also includes: Sharing in user meeting, promoting helpingscheduler, etc.:)

2.2 why contribute to open source

It’s also an inevitable choice to contribute to open source. I’ve met such a project in my previous company: (the specific company / project name is hidden)

At first, everything was good,
• we implemented a function based on an open source software
A large number of extensions and modifications have been made on it

Until one day, PM came and said:
• this software has been upgraded from 1. X to 2.0
• this 2.0 has done a lot of code refactoring and greatly improved the performance
• support for new international standards, with many new script functions!
Let’s upgrade and support

As a result, the colleague in charge of promotion has been promoted for 6 months. His daily work is as follows:
• compile c + + to various platforms and repair various build errors
• study modern C + + design carefully to understand various special C + + template writing methods and patterns
• in version control software, review and understand the reason / modification of each commit, and then try to apply it to the new version

Later, we reflect that in addition to the reasons of C + + and cross platform compilation, another important conclusion is: for the open source software used, we must find ways to integrate the extension / bugfix into the official repository, which will greatly reduce the maintenance cost in the long run

2.3 benefits of open source

Open source software development to today
• it’s no longer just geek’s personal projects
The development of open source in the past decade is too fast. Open source is the future trend
• open source does not contradict business itself. Open source is just a form of business
• code is only a small part of open source, more importantly, the community around this open source project
To contribute to open source, you can not only get Apache mailbox, but also:
• improve the code quality and write more comments (thinking that the code I write will be watched by thousands of people in the future)
• solutions need to be more generic
• like minded friends
• more users, faster problem discovery
• for the company, it is also more conducive to attracting talents

3 Part 3. Future

3.1 some functions to be explored
• plug in
• air flow like pool function, which limits the number of specified tasks executed at the same time
• workflow scheduling adds time-based triggering, complex rules, webhook triggering and other mechanisms
• task metrics support to view some metrics of each component in real time (such as: number of input records, number of output records, execution time, and change curve of nearly 30 runs, etc.)
• simple multi version management of workflow definition, resource file, etc. (view history, roll back to specified version)
Data lineage reporting component

The future architecture of dolphin scheduler in my mind, of course, is just an assumption. The current architecture of dolphin scheduler is not the same, and it will not be the same in the future. It’s just a personal idea about the general data scheduling platform

From open source users to Apache PPMC

The above is a simple sharing of my experience of participating in the dolphin scheduler open source project. Because it is mainly for technical personnel, I didn’t introduce too much virtual project background, project significance, future direction and other contents. According to the article’s feedback, I will also introduce some future products based on dolphin scheduler at a proper time in the future. Stay tuned!