China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

Time:2022-5-14

By 2022, China Unicom has reached 460 million users, accounting for 30% of China’s population. With the popularization of 5g, operators’ IT systems are generally facing the impact of a series of changes such as massive users, massive bills, diversified services, networking mode and so on.

At present, China Unicom handles more than 40 billion calls a day. On this basis, improving the service level and providing more targeted services to customers has also become the ultimate goal of China Unicom brand. China Unicom has emerged in the technology and application of massive data collection, processing, desensitization and encryption, and has a certain first mover advantage in the industry. It is bound to become an important promoter of big data enabling the development of digital economy in the future.

At Apache dolphin scheduler’s meetup in April, we invited Bai Xuesong from China Unicom Software Research Institute, who shared with us the application of dolphin scheduler in China Unicom billing environment.

This speech mainly includes three parts:

  • Overall usage of dolphin
  • Special sharing of China Unicom billing service
  • Next step planning

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

Big Data Engineer of Bai Xuesong Unicom Software Research Institute

Graduated from China Agricultural University, engaged in the construction of big data platform and AI platform, contributed Apache seatunnel (incubation) plug-in for Apache dolphin scheduler, and shared alluxio plug-in for Apache seatunnel (incubation)

01. Overall usage

First of all, let’s explain the overall use of China Unicom in dolphin scheduler:

  • Now our business mainly runs in 3 places and 4 clusters
  • The total number of task flows is about 300
  • The average daily task operation is about 5000

The dolphin scheduler components we use include spark, Flink, seatunnel (formerly waterdrop), Presto and some shell scripts in stored procedures. The businesses covered include audit, revenue allocation, billing, and other businesses that need automation.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

02 business topic sharing

01 cross cluster dual active service call

As mentioned above, our business runs in three places and four clusters, so we can’t avoid mutual data exchange and business calls between clusters. How to uniformly manage and schedule these cross cluster data transmission tasks is an important problem. Our data in the production cluster is very sensitive to the cluster network bandwidth, so we must manage the data transmission in an organized way.

On the other hand, we have some businesses that need to be called across clusters. For example, cluster a needs to start statistical tasks after the data of cluster B is in place. We choose Apache dolphin scheduler as scheduling and control to solve these two problems.

First of all, let’s explain that our cross cluster data transmission process is carried out on AB and ab clusters. We use HDFS for underlying data storage. In the cross cluster HDFS data exchange, according to the size and purpose of the data, we divide the data used into small batch and large batch data, to the structure table, configuration table, etc.

For small batch data, we directly mount it to the same alluxio for data sharing, so that the version problem caused by untimely data synchronization will not occur.

  • Like schedules and other large files, we use a mixture of distcp and spark for processing;
  • For structure table data, use the method of seatunnel on spark;
  • Speed limit setting through yarn queue;
  • Unstructured data is transmitted by distcp, and the speed is limited by the built-in parameter bandwidth;

These transmission tasks are all run on the dolphin scheduler platform. Our overall data flow mainly includes data arrival detection of cluster A, data integrity verification of cluster A, data transmission between clusters AB, data audit and arrival notification of cluster B.

One thing to emphasize: we mainly use the complement run of dolphin scheduler to repair failed tasks or incomplete data.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

After completing data synchronization and access across clusters, we will also use dolphin scheduler to call tasks across regions and clusters.

We have two clusters in place a, test A1 and production A2 respectively, and production B1 cluster in place B. We will take out two machines with intranet IP on each cluster as interface machines, and establish a virtual cluster by building dolphin scheduler on six interface machines, so that we can operate the contents of the three clusters on the unified page;

Q: How to realize from test to production?

A: Carry out task development on A1 test, and directly change the worker node to A2 production after passing the test;

Q: What should I do if there is a problem with A2 production and the data is not in place?

A: We can directly switch to B1 production to realize manual dual active disaster recovery switching;

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

Finally, we also have some large tasks. In order to meet the timeliness of tasks, we need to use two clusters to calculate at the same time. We will split the data into two parts and put them on A2 and B1 respectively. Then we will run the tasks at the same time, and finally send the running results back to the same cluster for merging. These task processes are basically called through dolphin scheduler.

Please note that in this process, we used dolphin scheduler to solve several problems:

  • Task dependency verification of projects across clusters;
  • Control node level task environment variables;

02 AI development synchronization task operation

1. Unified data access

We now have a simple AI development platform, which mainly provides users with some tensorflow and spark ml computing environments. Under the business requirements, we need to connect the local file model of user training with the cluster file system, and provide a unified access method and deployment method. To solve this problem, we use two tools, alluxio fuse and dolphin scheduler.

  • Alluxio fuse connects local and cluster storage
  • Dolphin scheduler shares local and clustered storage

Because the AI platform cluster and data cluster we built are two data clusters, we store data in the data cluster, use spark SQL or hive to preprocess some data, then mount the processed data to alluxio, and finally map it to local files through alluxio fuse cross group, so that we can directly access these data based on CONDA development environment, In this way, the access mode of data can be unified, and the data of the cluster can be accessed by accessing local data.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

2. One stop access to data scripts

After separating the resources, we preprocess the big data content, through the data cluster, and through our AI cluster to process the training model and prediction model. Here, we use alluxio fuse to make a secondary change to the resource center of dolphin scheduler. We connect the resource center of dolphin scheduler to alluxio, and then mount the local files and cluster files through alluxio fuse at the same time, In this way, the local training reasoning script and the training reasoning data stored on HDFS can be accessed on the dolphin schedule at the same time, so as to realize one-stop access to the data script.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

03 business query logic persistence

The third scenario is that we use Presto and hue to provide users with a front-end real-time query interface. Because some users write SQL through the front-end, and after the test is completed, they need to run some processing logic and stored procedures regularly, so it is necessary to get through the process from front-end SQL to back-end scheduled running tasks.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

Another problem is that Presto native does not have resource isolation between tenants. After comparing several schemes of Presto with the actual situation, we finally chose several schemes of presto.

Because we are a multi tenant platform, the initial solution provided to users is that the front end uses the hue interface, and the back end directly uses the native Presto to run on the physical cluster, which leads to the problem of user resource contention. When there is a long waiting time for some tenants or other businesses, it will lead to a long waiting time.

Therefore, we compared Presto on Yan and presto on spark. After comprehensively comparing the performance, we found that Presto on spark will be more efficient in resource utilization. Here, you can also choose the corresponding scheme according to your own needs.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

On the other hand, we use the coexistence of native Presto and presto on spark. For some SQL with small amount of data and simple processing logic, we directly run it on native presto, while for some SQL with complex processing logic and long running time, we run it on Presto on spark, so that users can switch to different underlying engines with a set of SQL.

In addition, we also got through the scheduling process from hue to dolphin scheduler. After SQL development modulation on hue, we connect to git for version control by storing it in the local serve file.

We mount the local files on alluxio fuse as the synchronous mount of SQL. Finally, we use hue to create tasks and scheduled tasks through the API of dolphin scheduler to realize the process control from SQL development to scheduled operation.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

04 unified management of data Lake data

The last scenario is the unified management of data Lake data. On our self-developed data integration platform, we use hierarchical governance to uniformly manage and access the data Lake data, in which dolphin scheduler is used as the lake scheduling and monitoring engine.

On the data integration platform, we use dolphin scheduler to schedule batch and real-time tasks such as data integration, data entering the lake and data distribution.

The bottom layer runs on spark and Flink. For the business needs that need immediate feedback such as data query and data exploration, we use the method of embedding hue into spark and presto to explore and query the data; For data asset registration synchronization and data audit, directly query the data source file information and directly synchronize the underlying data information.

The last scenario is the unified management of data Lake data. On our self-developed data integration platform, we use hierarchical governance to uniformly manage and access the data Lake data, in which dolphin scheduler is used as the lake scheduling and monitoring engine.

On the data integration platform, we use dolphin scheduler to schedule batch and real-time tasks such as data integration, data entering the lake and data distribution.

The bottom layer runs on spark and Flink. For the business needs that need immediate feedback such as data query and data exploration, we use the method of embedding hue into spark and presto to explore and query the data; For data asset registration synchronization and data audit, directly query the data source file information and directly synchronize the underlying data information.

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

At present, our integrated platform basically manages the quality management of 460 data sheets and provides unified management of data accuracy and punctuality.

03 next step plan and requirements

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

01 Resource Center

At the resource center level, in order to facilitate file sharing among users, we plan to provide resource authorization for all users, and allocate tenant level shared files according to its belonging tenants, making it more friendly to a multi tenant platform.

02 user management

Secondly, related to the user transfer permission, we only provide the tenant level administrator account, and the subsequent user account is created by the tenant administrator account. At the same time, the user management in the tenant group is also controlled by the tenant administrator to facilitate the internal management of the tenant.

03 task node

Finally, the plan related to our task node is now in progress: on the one hand, it is to complete the optimization of SQL node, so that users can select an SQL file in the resource center without manually copying SQL; On the other hand, the HTTP node judges the returned JSON custom parsing and extraction fields, and handles complex return values more friendly.

04 participation and contribution

With the rapid rise of domestic open source, the Apache dolphin scheduler community is booming. In order to do better and easy-to-use scheduling, we sincerely welcome partners who love open source to join the open source community, contribute their own strength to the rise of China’s open source and make local open source go global.

There are many ways to participate in and contribute to the dolphin scheduler community, including:

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call of billing environment and data script

We also hope that the first pr (document and code) is simple. The first PR is used to get familiar with the submission process, community collaboration and feel the friendliness of the community.

The community summarizes the following list of questions for novices:https://github.com/apache/dolphinscheduler/issues/5689

List of non novice questions:https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A”volunteer+wanted”

How to participate in contribution links:https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html

Come on, the dolphin scheduler open source community needs your participation to contribute to the rise of open source in China. Even if it is only a small tile, the power gathered is huge.

If you want to contribute, we have a seed incubation group of contributors. You can add a community assistant, Leonard DS, to teach you by hand (contributors can answer questions regardless of their level. The key is to have a heart willing to contribute).

When adding a small assistant wechat, please explain that you want to participate in the contribution.

Come on, the open source community is looking forward to your participation.

05 activity recommendation

When data resources become an essential element in the process of production development and even survival, how can enterprises help enterprises implement data services in the whole life cycle through data integration? On May 14, the data integration framework Apache seatunnel (incubation) will invite technical experts and open source contributors of the one-stop data integration platform Apache inlong (incubation) to come to the live studio to talk with you about your practical experience and experience after using Apache seatunnel (incubation) and Apache inlong (incubation).

Affected by the epidemic, this activity is still carried out in the form of online live broadcast. The activity is now open for free registration. Please scan the QR code below or click “read the original” for free registration!

Live link:https://www.slidestalk.com/m/777