Fuxi2.0 – Feitian big data platform scheduling system upgrade, debut in 2019 double 11

Time:2020-10-24

Fuxi is one of the three major services (distributed storage Pangu, distributed computing ODPs, distributed scheduling Fuxi) when the flying platform was founded ten years ago. The original intention of the design at that time was to solve the scheduling problem of large-scale distributed resources (essentially, the optimal matching problem of multi-objective).

With the continuous enrichment of business requirements of Alibaba economy and Alibaba cloud (especially the double 11), the connotation of Fuxi is also expanding, from a single resource scheduler (yarn of benchmarking open source system) to the core scheduling service of big data, covering data placement, resource management, and application scheduling Manager), local micro (autonomous) scheduling and other fields, and in each subdivision area, we are committed to building differentiation capabilities beyond the mainstream of the industry.

In the past ten years, Fuxi has made new progress and breakthroughs in technical capability every year, including 5K in 2013, sortbenchmark world champion in 2015, super large scale off / off-line hybrid ability in 2017, Yugong released in 2019 and papers accepted by vldb2019, etc. With the first appearance of Fuji 2.0 in 2019 double 11, this year, Feitian big data platform has successfully completed its goals in the aspects of mixed side support and baseline support. Among them, the mixed part supports 60% of the online transaction peak flow on the double 11, and the super large-scale mixed dispatching meets the expectation. In terms of baseline support, 970pb of data were processed in a single day, an increase of more than 60% over last year. In the tens of millions of jobs, no additional user tuning is required, and the system automation without human intervention is basically achieved.

New challenges

With the continuous and high-speed growth of business and data, the annual growth rate of maxcompute’s double 11 workload and calculation data volume is above 60%.
In the double 11 of 2019, the daily calculation data volume of maxcompute is close to EB level, and the workload has reached 10 million. In such a large scale and resource shortage situation, to ensure the stable operation of the double 11, the pressure of all important baseline operations on time is quite great.

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

Under the unique promotion scenario of double 11, the challenges of the 2019 double 11 mainly come from the following aspects:

  1. How to further improve the overall performance of the platform to cope with the sustained high-speed growth of business in the context of ultra large-scale computing and resource shortage.
  2. The double-10 meeting brings extreme scenarios of all-round overpressure to maxcompute, such as hundreds of millions of hot keys and thousands of times of data expansion. This is a challenge to the stability of cluster disk IO, data file read-write performance, long tail job rerun, etc.
  3. Under the scale of nearly 10 million jobs, how to achieve agile, reliable and efficient distributed job scheduling and execution.
  4. And resource support means for high priority operations (such as important business baselines).
  5. This year is also the first time that cloud clusters have participated in the “double 101” campaign, and have begun to support hybrid departments.

How to deal with challenges

In order to meet the above challenges, compared with previous years, in addition to conventional adjustments such as HBO, Feitian big data platform has accelerated the launch of technology accumulation achievements in the past 1-2 years, especiallyFuji 2.0 debuts on double 11Finally, under the pressure of nearly 10 million yuan of tasks per day and nearly 1000 Pb of calculation per day, all the baselines were produced on time.

  • In terms of platform performance optimization, for challenges 1 and 2,StreamlineX + Shuffle ServiceAccording to the characteristics of real-time data, it can automatically and intelligently match efficient processing modes and algorithms, mine hardware features, and deeply optimize the processing efficiency of IO, memory, CPU, etc., while reducing the use of resources,Increase the average processing speed of full SQL by nearly 20%The error retrial rate is reduced to one tenth of the original, which greatly improves the overall efficiency of maxcompute platform.
  • In the aspect of distributed job scheduling and execution, for the challenge ා 3,DAG 2.0 It provides more agile scheduling execution ability and comprehensive de blocking ability, which can bring about large-scale Mr JobsNearly 50% performance improvement。 At the same time, the upgrade of DAG dynamic framework also brings more flexible dynamic ability for the scheduling and execution of distributed jobs, which can dynamically adjust the process of job execution according to the characteristics of data.
  • In the aspect of resource guarantee, in order to meet the challenge ා 4, Fuji has adopted more strict and fine-grained resource guarantee measures for high priority jobs (mainly high priority jobs), such as resource schedulingInteractive preemption function, and operation priority guarantee control, etc. Currently, the highest priority jobs on the line can beSeize resources within 90 seconds.
  • Other such as business tuning support, such as business data pressure testing, and job tuning.

StreamlineX + Shuffle Service

Challenge

As mentioned above, this year’s double 11 data volume is close to EB level, and the workload is close to 10 million, and the overall resource utilization is relatively tight. Through the analysis of past experience, the most critical module of double 11 impact is streamline (also known as shuffle or exchange in other data processing engines), various extreme scenarios emerge in endlessly. Tasks with more than 50000 concurrency, as many as hundreds of millions of hot keys, and the data expansion of a single worker by thousands of times, etc., will greatly affect the stable operation of the streamline module, thus affecting the stability of the cluster disk IO and the read-write performance of data files If any situation is not solved automatically in time, it may lead to the failure of baseline operation.

Overview of streamline and shuffle service

  • Streamline

In other OLAP or MPP systems, there are similar components called shuffle or exchange. In maxcompute SQL, this component involves more perfect functions and better performance, mainly including but not limited to data serialization, compression, read / write transmission, grouping and merging, sorting and other operations between distributed tasks. The distributed implementation of some time-consuming operators in SQL basically needs this module, such as join, groupby, window, and so on. Therefore, it is absolutely a big consumer of CPU, memory, IO and other resources. In most jobs, the running time accounts for more than 30% of the whole SQL running time, and some large-scale jobs can even reach more than 60%, which is for maxcompute The average daily task volume of SQL is nearly 10 million, and the average daily processing data is close to EB level services. For each performance improvement of more than 1 percentage point, more than 1000 machine resources are saved. Therefore, the continuous reconfiguration and optimization of this component has always been the top priority of the performance improvement index of maxcompute SQL team. This year’s double 11 application of SLX is a completely rewritten high-performance streamline architecture.

  • Shuffle Service 

In maxcompute SQL, it is mainly used to manage the underlying transmission mode and physical distribution of the stream data of all jobs in the cluster. Thousands of workers scheduled to different machines need accurate data transmission to complete the whole task cooperatively. When serving large data users such as maxcompute, the efficiency and stability of shuffling hundreds of petabytes of data between tens of millions of workers per day on 100000 machines determines the overall performance and resource utilization efficiency of the cluster. Shuffle service gives up the mainstream shuffle file storage mode based on disk file, and breaks through the short board of performance and stability of random reading of files on mechanical hard disk; based on the idea of dynamic scheduling of shuffle data, shuffle process becomes a dynamic decision of real-time optimized data flow direction, arrangement and reading during job running. By deconstructing the upstream and downstream scheduling of DAG, the performance of network shuffle is equivalent, and the resource consumption is reduced by 50% +.

Key technologies of streamline + shuffle service

  • Streamline x (SLX) architecture and optimization design

The logic function architecture of SLX is shown in the figure, which mainly includes the optimization of data processing logic reconstruction at SQL runtime level, including optimization of data processing mode and algorithm performance

In addition, the base combines the newly designed Fuji shuffleservice service, which is responsible for data transmission, to optimize IO read-write and worker fault tolerance, so that SLX can have good performance improvement and efficient and stable operation under various data modes and data scales.

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

SQL runtime SLX mainly includes two parts: writer and reader. The following is a brief introduction to the optimization design of some of them

  1. Reasonable division of framework structure: runtime streamline and Fuxi SDK are decoupled, runtime is responsible for data processing logic, and Fuji SDK is responsible for underlying data stream transmission. Both of them are scalable, scalable, and functional.
  2. Support graysort mode: Streamline writer only groups and does not sort, with simple logic, saves data memory copy overhead and related time-consuming operations, and reader sorts full data. The overall data processing process pipeline is more efficient and its performance is significantly improved.
  3. Support adaptive mode: Streamline reader supports unsorted and sorting mode switching to support the requirements of some adaptive operators without additional IO overhead and low backoff cost. The adaptive scenario optimization effect is remarkable.
  4. CPU computing efficiency optimization: redesign the data structure and algorithm of CPU cache optimization for time-consuming computing module, and improve the efficiency of computing by reducing cache miss, reducing function call overhead, reducing CPU cache thrashing, and improving the effective utilization of cache.
  5. IO Optimization: it supports a variety of compression algorithms and adaptive compression methods, and redesigns the storage format of shuffle transmission data to effectively reduce the amount of IO transferred.
  6. Memory optimization: for streamline writer and reader, the memory allocation is more reasonable. Memory will be allocated on demand according to the actual amount of data, and the possible dump operations will be reduced as much as possible.

According to the experience of the past double eleven, the amount of baseline task data increased significantly on November 12, and the shuffle process will be greatly challenged, which is usually the main reason for the baseline delay of the day. The main problems of traditional disk shuffle are listed as follows:

  • Fragment read: for a typical 2K * 1K shuffle pipe, when each upstream mapper processes 256MB of data, the average amount of data written by a mapper to a reducer is 256Kb. However, it is uneconomical to read less than 256Kb data from HDD disk at a time. High IOPs and low throughput severely affect job performance.
  • Stability: due to the serious fragment read phenomenon on HDD, the high error rate in the reduce input stage is caused, and the upstream rerun is triggered to generate shuffle data, which makes the execution time of the job lengthen by times.

Shuffle service completely solves the above problems. After the double 11 test, the results show that the pressure bottleneck of online cluster has been transferred from the previous disk to the CPU resource. On the day of November 11, the baseline operation was carried out smoothly, and the overall operation of the cluster remained stable.

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

The main functions of shuffle service include:

  • agent merge: thoroughly solve the problem of fragment reading in traditional disk shuffle mode;
  • Flexible exception handling mechanismCompared with the traditional disk shuffle mode, it is more stable and efficient in dealing with abnormal environment;
  • Data dynamic scheduling: the runtime selects the most appropriate shuffle data source for the task
  • Memory & prelaunch real time alignment support: less resource consumption when performance is equivalent to network shuffle

Streamlinex + shuffle service

In order to meet the above challenges and break through the bottleneck of existing resources, maxcompute SQL started the performance continuous limit optimization project more than a year ago, one of the most critical is streamlinex (SLX) project, which completely reconstructs the existing streamline framework, from the reasonable design of high scalability architecture, intelligent matching of data processing patterns, dynamic memory allocation and performance optimization, adaptive algorithm matching, CPU cache access friendly structure design, and based on Fuxi shuffle service Service optimization, data reading, writing, IO and transmission are reconstructed to optimize the upgraded high-performance framework. Before the double 11, more than 95% of the daily SQL operations in the bomb were all SLX, and more than 90% of the shuffle traffic came from SLX. With zero failure and zero fallback, the users’ transparent and unconscious hot upgrade was completed. On the basis of ensuring the smooth online operation, the performance and stability of SLX exceeded the expected improvement effect, which was fully reflected on the day of double 11. The statistical score was based on the full number of online high priority baseline operations Analysis:

  • In terms of performance, the running speed of full quasi real-time SQL job E2E is increased by 15% +, and the throughput (GB / h) of the streamline module for full offline jobs is increased by 100%
  • In terms of IO read-write stability, based on the fujishuffleservice service, the average effective IO read / write size of the overall cluster size is increased by 100% +, and the disk pressure is reduced by 20% +;
  • In terms of long tail and fault tolerance, the probability of PVC occurrence in job workers decreased to only one tenth of that before;
  • In terms of resource priority preemption, shuffleservice ensures that the data transmission of shuffle for high priority jobs is 25% lower than that for low priority jobs;

It is precisely these super expected optimization effects that maxcompaute has nearly 10 million operations on double 11, involving nearly 100000 servers, saving about 20% of computing power. Moreover, it can intelligently match the optimal processing mode for various extreme scenarios, so as to fully control the stable output of super large-scale operations with increasing data volume in the future. The above performance data statistics are based on the average results of the full number of high priority jobs. In fact, SLX is more effective in many large-scale data scenarios. For example, when the resources of TPCH and tpcds 10TB test sets are very tight, the running speed of SQL E2E is nearly doubled, and the shuffle module is doubled.

Streamlinex + shuffle service outlook

The high performance SLX framework is not an end but a beginning after this year’s double 11 National Congress. We will continue to improve the function and polish the performance. For example, we continue to introduce efficient sorting, coding, compression and other algorithms to adaptively match various data partitions, intelligently select efficient data read-write and transmission modes based on different data scales combined with shuffleservice, optimize range partition, accurately control memory, deeply mine hardware performance of hot modules, continuously save company costs, and maintain a large amount of technology Industry leading products.

DAG 2.0

Challenge

In the “double 11” promotion scenario, in addition to the data flood peak and the scale exceeding the daily operation, the distribution and characteristics of the data are also very different from those in the ordinary. This special scenario presents multiple challenges to the scheduling and execution framework of distributed jobs

  • Processing double 11 scale data, a single job scale more than hundreds of thousands of computing nodes, and more than 10 billion physical edge connections. To ensure the agility of scheduling on this scale of jobs, it is necessary to reduce the overhead of the full scheduling link and schedule without blocking.
  • During the baseline period, the cluster is extremely busy, and the network / disk / CPU / memory of each machine will be under greater pressure than usual, resulting in a large number of computing node exceptions. At this time, the distributed scheduling computing framework not only needs to be able to timely monitor the abnormal logical computing nodes for the most effective retrial, but also need to be able to intelligently judge / isolate / predict the physical machines that may have problems, so as to ensure that the jobs can still be completed correctly under the large cluster pressure.
  • In the face of data with different characteristics, many usual implementation plans may not be applicable in the double 11 scenario. At this time, the scheduling execution framework needs to be intelligent enough to select a reasonable physical execution plan, and enough dynamic to make timely necessary adjustments to all aspects of the job according to the characteristics of real-time data. Only in this way can a large number of manual intervention and temporary human flesh operation and maintenance be avoided.

This year’s double 11 coincides with the new architecture upgrade of the core scheduling execution framework of the computing platform – DAG 2.0 is being fully promoted online, DAG 2.0 has well solved the above challenges.

Overview of DAG 2.0

The workflow of modern distributed system is usually described by a DAG. Dag scheduling engine is the only component in distributed system that needs to interact with almost all upstream and downstream (resource management, machine management, computing engine, shuffle components, etc.), which plays an important role in coordination and management in distributed system. As the base of various upper computing engines (maxcompute, Pai, etc.) of the computing platform, Fuxi’s DAG component has supported millions of distributed jobs and hundreds of petabytes of data processing every day in the last decade. Under the background of the increasing ability of computing engine and the diversity of job types, higher requirements are put forward for the dynamic, flexibility, stability and other aspects of DAG architecture. In this context, Fuxi team launched DAG 2.0 architecture upgrade. In order to better support the development of computing platform in the next decade, a new DAG engine is introduced, both in code and function.

This new architecture endows DAG with more agile scheduling execution capability, and brings qualitative improvement in the dynamic and flexibility of distributed job execution. It can provide more accurate dynamic execution plan adjustment ability after being closely combined with the upper level computing engine, thus providing better guarantee for supporting various large-scale jobs. For example, in the simplest Mr job test, the agility of DAG 2.0 scheduling system and the overall ability to remove blocking in the whole process can bring about nearly 50% performance improvement for large-scale Mr Jobs (100000 concurrent). In the medium-sized (1TB tpcds) jobs which are closer to the characteristics of online sql workload, the improvement of scheduling ability can reduce the E2E time by 20% +.

The architecture design of DAG 2.0 combines the experience of supporting various computing tasks inside and outside the group in the past 10 years. The system considers and designs the real-time machine management framework, backup instance strategy and fault-tolerant mechanism, which lays an important foundation for supporting a variety of practical online cluster environments. Another challenge is that the 2.0 architecture needs to be fully online on the volume of millions of daily distributed operations, and change engines in flight. Since the beginning of FY20 fiscal year, dag2.0 has promoted online upgrade. So far, it has fully covered millions of jobs every day, such as maxcompute offline operation, tensorflow CPU / GPU operation on Pai platform. Through the joint efforts of all members of the project team, a satisfactory answer sheet was handed over on the day of double 11.

Key technologies of DAG 2.0

Being able to obtain the above online results is closely related to dag2.0’s numerous technological innovations. Due to the space limitation, this paper mainly introduces two aspects related to the operation stability of double 11.

  • Perfect error handling ability

In the distributed environment, due to the large number of machines, the probability of single machine failure is very high, so fault tolerance is an important ability of scheduling system. In order to better manage the state of the machine, discover the faulty machine in advance and merge it actively, dag2 improves the ability to handle machine errors through complete machine state management

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

As shown in the figure above, dag2 divides the machine into multiple states. And according to a series of different indicators to trigger the transition between different states. According to the health status of machines in different states, active evasion, migration of computing tasks, and active re running of computing tasks are carried out. Machine problems caused by job running time is prolonged, and even the possibility of job failure is minimized.

On the other hand, on a DAG, when the downstream reading data fails, the upstream rerun needs to be triggered. When a serious machine problem occurs, the chained rerun of multiple tasks will cause long-time delay of the job and seriously affect the timely output of the baseline job. Dag2.0 implements a set of active fault tolerance strategy based on blood relationship backtracking (as shown in the figure below). Such intelligent blood relationship backtracking can avoid layer by layer trial and layer by layer re run. When the cluster pressure is large, it can effectively save running time and avoid resource waste.

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

  • Flexible dynamic logic diagram execution strategy — conditional join

In distributed SQL, map join is a common optimization. Its implementation principle avoids shuffling small tables, but directly broadcasts its full data to each distributed computing node that processes large tables, and completes the join operation by directly establishing hash tables in memory. Map join optimization can greatly reduce the extra shuffle and sorting overhead, avoid the possible data skew in the shuffle process, and improve the job performance. But its limitation is also obvious: if the small table cannot fit into the single memory, then the whole distributed job will fail because of oom when trying to build the hash table in memory. Therefore, although the map join can bring great performance improvement when it is used correctly, in fact, the optimizer needs to be conservative when generating the map join plan, which leads to a lot of missed optimization opportunities. Even so, there is still no way to completely avoid the problem of map join oom.

Based on the dynamic logic diagram execution capability of DAG 2.0, maxcompute supports the developed conditional join function: when the algorithm used for the join cannot be determined in advance, the optimizer is allowed to provide a conditional DAG. Such DAG includes different execution plan branches corresponding to two different join methods. In actual execution, am dynamically selects a branch to execute (plan a or plan B) according to the amount of upstream output data. This dynamic logic diagram execution process can ensure that the optimal execution plan can be selected according to the actual job data characteristics each time the job runs. See the following figure for details:

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

Due to the consideration of online rhythm control, the conditional join has not covered high priority jobs during the double 11. During the “double 11” period, we also saw that mapjoin hint failed due to data expansion on important baselines, and that the job oom needed to be temporarily adjusted; and that the job could not select the optimized execution plan because the mapjoin was not correctly selected, which delayed the completion of the job. After the conditional join is online in the important baseline, this situation can be effectively avoided and the baseline output can be smoother.

Dag 2.0 double ten achievements

As a test for all the technical lines of Ali group, double 11 is also an important test for dag2.0, a brand-new component, and an important milestone in dag2 online upgrade

  • On the day of double 11, dag2.0 supports online 80% + project. Up to now, it has been fully online, supporting millions of offline jobs per day. For the baseline job with the same signature, the overhead of instance running under DAG 2.0 is reduced by 1 to 2 times.
  • On the day of double 11, the high priority baseline of DAG 2.0 was used, without any manual intervention, without any operation failure and rerun. Among them, dag2.0 provides real-time machine management, backup instance strategy and other intelligent fault-tolerant mechanisms play an important role.
  • It supports nearly one million jobs in the development environment. On the premise of larger average job size, the proportion of distributed jobs with millisecond level (jobs with execution time less than 1s) during double-11 is 20% higher than that of 1.0. The more efficient resource flow rate in the new framework also brings about a significant improvement in resource utilization: the proportion of online jobs waiting for online resources to time out and back down has decreased by nearly 50%.
  • Dag 2.0 also supports Pai engine, which provides strong support for the model training of search, recommendation and other services during the double 11 period. All tensorflow CPU / GPU jobs of Pai platform before the double 11 have been fully migrated to DAG 2.0. Its more effective fault tolerance and improved resource utilization ensure the timely output of models on each business line.

In addition to the performance bonus brought by the improvement of scheduling ability, dag2 has also made a new breakthrough in the dynamic graph highlight function. Including dynamic parrellism, limit optimization, conditional join and other dynamic graph functions, which are completed or in the process of online promotion. Conditional join ensures that the optimized execution plan can be selected as much as possible, and at the same time, it also ensures that oom will not lead to job failure due to wrong selection. It can dynamically decide whether to use mapjoin or not through runtime data statistics to ensure more accurate decision. Before the double 11, we launched a gray Online Condition node within the group with daily average effect100000+The proportion of nodes applying map join exceeds90%0 oom occurs。 In the process of promotion, we also received real feedback from multiple users of each bu. After using conditional join, we can choose the optimal execution plan and the running time of jobs in multiple scenarios,From a few hours to less than 30 minutes

Dag 2.0 outlook

In the process of the double 11 shift, we still see that the tilt / expansion of data has a great impact on the overall completion time of distributed operations because of different data distribution characteristics. These problems can be well solved in DAG 2.0’s complete dynamic graph scheduling and operation ability, and the related functions are being scheduled and online.

A typical example is the scenario of dynamic partition insert. In a high priority job scenario, an important business table directly imports data in the way of dynamic partition, resulting in excessive number of table files. Subsequent baselines frequently access the table to read data, resulting in the continuous burst of Pangu master and the cluster in an unavailable state. After adopting the adaptive shuffle function of dag2.0, the running time of offline verification job is verifiedFrom 30 + hours to less than 30 minutesAnd the number of files generated is compared to the way reshuffle is turned offReduced by an order of magnitudeOn the premise of ensuring the timely output of business data, it can greatly relieve the pressure of Pangu master. Dynamic partition scenario has a wide range of application scenarios in bomb production and public cloud production. With the launch of adaptive shuffle, dynamic insert will be the first relatively thorough data skew scenario to be solved. In addition, dag2.0 also continues to explore other data skew processing, such as join skew, etc. it is believed that with the development of more optimization functions on 2.0, our execution engine can be more dynamic and intelligent, and a large number of online pain points, including data skew, will be better solved. The best performance today is the lowest requirement tomorrow. We believe that next year’s double 11, in the face of a larger amount of data processing, the double 11 guarantee of the computing platform can be more automated, through the dynamic adjustment in the operation of distributed jobs, under the premise of less manual intervention.

Interactive preemption of resource scheduling

Challenge

Fuximaster is the resource scheduler of Fuxi, which is responsible for allocating computing resources to different computing tasks. In view of the diverse resource requirements among different applications in the context of maxcompute, in the past few years, the resource scheduling team has made extreme performance optimization on the core scheduling logic, and the scheduling delay is controlled at the level of 10 microseconds. The efficient flow of cluster resources provides a strong guarantee for the stable operation of maxcomputer promoted by the “double 11” campaign in the past few years.

Among them, the timely completion of the high-level baseline operation is an important symbol of the success of the DOUBLE-11 National Congress, and it is also the top priority in the resource guarantee. In addition to the priority allocation of idle resources, it is also necessary to vacate the resources occupied by low priority jobs, and quickly allocate resources to high priority baseline jobs without affecting the overall utilization of the cluster.

Overview of interactive preemption

In a cluster with high load, if the high priority jobs can’t get the resources in time, the traditional way is to kill the low priority jobs directly through preemption, and then allocate the resources to the high priority jobs. This kind of “violence” grabs the resources quickly. However, the preemptive “killing people” will also lead to the user’s job being killed in the middle of the way and the computing resources are wasted. Interactive preemption means that on the premise that resources flow from low priority to high priority, the low priority jobs are not killed immediately, but the low priority jobs can be quickly completed in an acceptable time (currently 90 seconds) through the protocol, which not only does not waste the computing resources of the cluster, but also ensures the resource supply of high priority jobs.

At present, the intra group interactive preemption for high priority Su (schedule unit, which is the basic unit of resource management) can ensure the resource supply of baseline operation in most cases. However, at present, even through interactive preemption, some jobs can not get resources in time. For example, high priority interactive preemption triggers every 30 seconds. The number of high priority Su is constrained by the global configuration. During this period, there are a large number of other high priority Su that have already been submitted, which will cause the Su of the job to be empty. In addition, after the interactive preemption instruction is issued, the resource needs to be returned after the end of the corresponding instance. However, the running time of the corresponding instance is very long, resulting in the interactive failure to retrieve the corresponding resource in time. Based on the above problems, we further optimize the interactive preemption strategy.

Key technologies of interactive preemption

Aiming at the problems mentioned above, interactive preemption is optimized as follows:

  • Through performance optimization, the number of Su in each round of high priority processing is relaxed
  • After the interactive preemption time-out, the reserved low priority resources are forced to be recovered. For the low priority jobs that start first, occupy a large number of resources, and the instance runs for a long time, it is necessary to force the recovery of resources.
  • By using resources other than reserved to supply high priority resources, we can continue to allocate resources for the interactive preemptive Su through other resources, and offset the corresponding interactive preemption part.

Achievements on the Double Tenth front line

During the double-11 period of 2019, in the face of much more data than before, all high priority jobs are smoothly produced on schedule. Resource scheduling ensures the supply of baseline resources smoothly, and its silky smoothness makes the whole process of baseline support almost feel the existence of resource scheduling. Among them, the interactive preemption and acceleration function of baseline operation provides effective resource guarantee ability, which can seize the required resources timely and effectively. The following shows the resource supply of a cloud cluster.

  • Interactive preemption acceleration provides fast available resources for baseline jobs

As can be seen from the following figures, in the baseline period (00:00 ~ 09:00), the frequency of initiating interactive preemption revoke is significantly higher than that in other periods, which means that the baseline job can get the required resources smoothly by means of interactive preemption acceleration. The online operation during the “double 11” period also proves that under the condition of high resource pressure, the high priority baseline jobs obtain resources through interactive preemptive revoke.

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11

  • Time proportion distribution of Su acquisition resources for baseline operations

The next part is the time distribution of Su getting resources (Fuxi basic scheduling unit) of the main clusters. It can be found that the 90% quantile time of these clusters is about 1 minute(Meet the online baseline job, wait for resources to reach 90 seconds for preemption configuration)。

Fuxi2.0 - Feitian big data platform scheduling system upgrade, debut in 2019 double 11


Author: Jin Heng

Read the original

This article is the content of Alibaba cloud and can not be reproduced without permission.

Recommended Today

Kafka learning materials

Kafka 1、 Benefits of message middleware 1. Decoupling It allows you to extend or modify processes on both sides independently, as long as you make sure they comply with the same interface constraints. It would be a great waste to put resources on standby to handle such peak visits. The use of message queue can […]