Take it as open source and use it as open source — an in-depth analysis of Alibaba’s optimization and improvement of Apache Flink


Apache Flink overview

Apache Flink (hereinafter referred to as Flink) is a big data research project born in Europe, formerly known as stratosphere. The project is a research project of Berlin University of technology, which focused on batch computing in the early stage. In 2014, the core members of stratosphere project hatched Flink and donated Flink to Apache in the same year. Later, Flink successfully became Apache’s top big data project. At the same time, the mainstream direction of Flink computing is positioned as stream computing, that is, using stream computing to do all big data computing, which is the background of the birth of Flink technology.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

In 2014, Flink, as a big data engine focusing on stream computing, began to emerge in the open source big data industry. Different from storm, spark streaming and other streaming computing engines, it is not only a computing engine with high throughput and low latency, but also provides many advanced functions. For example, it provides stateful computing, supports state management, supports strongly consistent data semantics, and supports event time, watermark’s handling of message disorder, etc.

Flink’s popularity is also inseparable from its many labels, including excellent performance (especially in the field of stream computing), high scalability and fault tolerance. It is a pure memory computing engine, which has made a lot of optimization in memory management. In addition, it also supports eventime processing and super large state jobs (it is very common in Alibaba that the state size of a job exceeds TB) and supports the processing of exactly once.

Alibaba and Flink

With the advent of the era of artificial intelligence and the explosion of data volume, under the typical big data business scenario, the most common method of data business is to select batch processing technology to process full data and stream computing to process real-time incremental data. In most business scenarios, the user‘s business logic is often the same in batch and stream processing. However, the two sets of computing engines used by users for batch and stream processing are different.

Therefore, users usually need to write two sets of code. There is no doubt that this brings some additional burdens and costs. Alibaba’s commodity data processing often needs to face two different sets of business processes, incremental and full. Therefore, Alibaba is wondering whether there can be a unified big data engine technology, and users only need to develop a set of code according to their own business logic. In this way, in various scenarios, whether full data or incremental data, or real-time processing, a set of solutions can be fully supported. This is the background and original intention of Alibaba’s choice of Flink.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

The platform built by Flink on Alibaba was officially launched in 2016, starting from Alibaba’s search and recommendation. At present, all Alibaba businesses, including all Alibaba subsidiaries, have adopted the real-time computing platform based on Flink. At the same time, Flink computing platform runs on the open source Hadoop cluster. Yarn of Hadoop is used as resource management scheduling and HDFS is used as data storage. Therefore, Flink can seamlessly connect with Hadoop, an open source big data software.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

At present, the real-time computing platform based on Flink not only serves Alibaba group, but also provides Flink based cloud product support to the entire developer ecosystem through Alibaba cloud’s cloud product API.

At that time, Flink had not yet experienced practice in terms of scale or stability, and its maturity remained to be discussed. Alibaba’s real-time computing team decided to establish a Flink branch blink within Alibaba, and made a lot of modifications and improvements to Flink to adapt it to Alibaba’s super large-scale business scenario. In this process, the team not only made many improvements and optimizations on the performance and stability of Flink, but also made a lot of innovations and improvements on the core architecture and functions, and will gradually push back to the community, such as Flink’s new distributed architecture, incremental checkpoint mechanism, credit based network flow control mechanism and streaming SQL. Next, we mainly analyze what Alibaba has optimized Flink from two aspects?

Take it as open source, use it as open source

1、 SQL layer

In order to truly enable users to develop a set of code according to their own business logic and run in a variety of different scenarios at the same time, Flink first needs to provide users with a unified API. After some research, Alibaba real-time computing believes that SQL is a very suitable choice. In the field of batch processing, SQL has been tested for decades and is recognized as a classic. In the field of flow computing, theories such as the duality of flow table and the changelog of flow table have emerged in recent years. On the basis of these theories, Alibaba proposed the concept of dynamic table, so that stream computing can also be described by SQL like batch processing, and it is logically equivalent. In this way, users can use SQL to describe their business logic. When executing the same query statement, it can be a batch task, or a stream computing task with high throughput and low delay. It can even use batch technology to calculate historical data, and then automatically turn it into a stream computing task to process the latest real-time data. Under this declarative API, the engine has more choice and optimization space. Next, we will introduce some of the more important optimizations.

The first is to upgrade and replace the technical architecture of SQL layer. Developers who have investigated or used Flink should know that Flink has two basic APIs, one is datastream and the other is dataset. Datastream API is provided for streaming processing users and dataset API is provided for batch processing users. However, the execution paths of these two APIs are completely different, and even different tasks need to be generated for execution. After a series of optimizations, Flink’s native SQL layer will call dataset or datastream API according to whether users want batch processing or stream processing. This will cause users to face two sets of almost completely independent technology stacks in daily development and optimization, and many things may need to be done twice. This will also lead to optimization on one side of the technology stack and not on the other side. Therefore, Alibaba has proposed a new quyer processor in the SQL layer, which mainly includes an optimization layer (query optimizer) for reusing streams and batches as much as possible and an operator layer (query executor) based on the same interface. In this way, more than 80% of the work can be reused on both sides, such as some public optimization rules, basic data structure and so on. At the same time, flow and batch also retain their own unique optimization and operators to meet different job behaviors.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

After the technical architecture of SQL layer is unified, Alibaba began to seek a more efficient basic data structure to make blink’s execution in SQL layer more efficient. In the native Flink SQL, a data structure called row is used uniformly. It consists of some Java objects to form a row in the relational database. If the current row of data consists of an integer, a floating point and a string, the row will contain a Java integer, double and string. As we all know, these Java objects have a lot of extra overhead in the heap, and unnecessary boxing and unpacking operations will be introduced in the process of accessing these data. Based on these problems, Alibaba has proposed a new data structure binaryrow. Like the original row, it also represents a row in a relational data, but the difference is that it completely uses binary data to store these data. In the above example, three different types of fields are uniformly represented by byte [] of Java. There are many benefits:

  • Firstly, in terms of storage space, a lot of unnecessary additional consumption is removed, making the storage of objects more compact;
  • Secondly, when dealing with network or state storage, you can also omit a lot of unnecessary serialization and deserialization overhead;
  • Finally, after removing all kinds of unnecessary packing and unpacking operations, the whole execution code is also more friendly to GC.

By introducing such an efficient basic data structure, the execution efficiency of the whole SQL layer has been more than doubled.

At the implementation level of operators, Alibaba has introduced a wider range of code generation technology. Thanks to the unification of technical architecture and basic data structure, many code generation technologies can achieve wider reuse. At the same time, due to the strong type guarantee of SQL, users can know the type of data to be processed by the operator in advance, so as to generate more targeted and efficient execution code. In native Flink SQL, only simple expressions such as a > 2 or C + D will apply code generation technology. After Alibaba optimization, some operators will generate overall code, such as sorting, aggregation, etc. This makes the user more flexible to control the logic of the operator, and can also directly embed the final running code into the class, eliminating the expensive function call overhead. Some basic data structures and algorithms of application code generation technology, such as sorting algorithm and HashMap based on binary data, can also be shared and reused between stream and batch operators, so that users can really enjoy the benefits brought by the unification of technology and architecture. After optimizing the data structure or algorithm for some scenarios of batch processing, the performance of stream computing can also be improved. Next, let’s talk about what drastic improvements Alibaba has made to Flink at the runtime layer.

2、 Runtime layer

In order to make Flink take root in Alibaba’s large-scale production environment, the real-time computing team has encountered various challenges as scheduled. The first thing to bear the brunt is how to integrate Flink with other cluster management systems. Flink’s native cluster management mode is not perfect, and it is unable to use other relatively mature cluster management systems natively. Based on this, a series of thorny problems emerge one after another: how to coordinate resources among multi tenants? How to dynamically apply for and release resources? How to specify different resource types?

In order to solve this problem, the real-time computing team has experienced a lot of research and analysis. The final solution is to transform the Flink resource scheduling system so that Flink can run on the Yan cluster; And reconstruct the master architecture so that a job corresponds to a master. From then on, the master is no longer the bottleneck of the cluster. Taking this opportunity, Alibaba and the community jointly launched a new flip-6 architecture, turning Flink resource management into a pluggable architecture, laying a solid foundation for Flink’s sustainable development. Today, Flink can run seamlessly on yarn, mesos and k8s, which is a strong illustration of the importance of this architecture.

After solving the problem of large-scale deployment of Flink cluster, the next thing is reliability and stability. In order to ensure the high availability of Flink in the production environment, Alibaba focused on improving Flink’s failover mechanism. The first is the master failover. Flink’s native master failover will restart all jobs. After improvement, any master failover will not affect the normal operation of jobs; Secondly, region based task failover is introduced to minimize the impact of any task failover on users. With these improvements, a large number of Alibaba’s business parties began to migrate real-time computing to Flink.

Stateful streaming is the biggest highlight of Flink. The checkpoint mechanism based on the chance Lamport algorithm enables Flink to have the computing power of exactly once consistency. However, in the early version of Flink, the performance of checkpoint has a certain bottleneck under the large amount of data. Alibaba has also made a lot of improvements on checkpoint, such as:

  • Incremental checkpoint mechanism: there are dozens of terabytes of large jobs encountered in Alibaba’s production environment
    State is a common thing. The cost of doing a full-scale CP is very high. Therefore, Alibaba has developed an incremental checkpoint mechanism. Since then, CP has changed from a storm to a long stream;
  • Checkpoint small file merging: it’s all caused by scale. With the increasing number of Flink jobs in the whole cluster, the number of CP files has also increased. Finally, the pressure on HDFS namenode is overwhelmed. Alibaba finally reduces the pressure on namenode by dozens of times by merging several CP small files into one large file.

Although all data can be placed in state, due to some historical reasons, users still have some data to be stored in some external kV storage such as HBase. Users need to access these external data in Flink job. However, Flink has always been a one-way processing model, resulting in the delay of accessing external data becoming the bottleneck of the whole system, Obviously, asynchronous access is a direct means to solve this problem, but it is not easy for users to write multithreads in UDF and ensure the exactlyonce semantics. Alibaba proposed asyncoperator in Flink, which allows users to write asynchronous calls in Flink job as simple as “Hello word”, which has greatly improved the throughput of Flink job.

Flink is designed as a unified batch flow computing engine. After using lightning fast stream computing, batch users are also interested in staying in Flink cell. But batch computing also brings new challenges. First, in terms of task scheduling, Alibaba has introduced a more flexible scheduling mechanism, which can schedule more efficiently according to the dependencies between tasks; The second is the data shuffle. Flink’s native shuffle service is bound to TM. After the task is executed, TM cannot release resources; In addition, the original batch shuffle does not merge files, so it can hardly be used in production. Alibaba has solved the above two problems while developing the function of Yan shuffle service. When developing the Yan shuffle service, Alibaba found it very inconvenient to develop a new set of shuffle service, which needs to invade many places of Flink code. In order to allow other developers to easily expand different shuffles, Alibaba also transformed the Flink shuffle architecture to turn Flink’s shuffle into a pluggable architecture. At present, Alibaba’s search business is already using Flink batch job and has begun to serve production.

After more than three years of polishing, blink has begun to thrive in Alibaba, but the optimization and improvement of runtime is endless, and a large wave of improvement and optimization is on the way.

Flink’s future direction

At present, Flink is already a mainstream stream computing engine. The next important work of the community is to make a breakthrough in batch computing, land in more scenarios and become a mainstream batch computing engine. Then, the seamless switching between flow and batch is further carried out, and the boundary between flow and batch becomes more and more blurred. With Flink, in a calculation, there can be both flow calculation and batch calculation.

Next, alibaba will strive to promote the ecological support of Flink in more languages, not only Java and Scala, but also Python and go languages used in machine learning.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

On the other hand, I have to say AI, because now many big data computing needs and data volume are supporting popular AI scenarios, Flink will continue to improve the upper machine learning algorithm library on the basis of improving the flow batch ecology. At the same time, Flink will also integrate into more mature machine learning and in-depth learning. For example, tensorflow on Flink can be used to integrate ETL data processing of big data, feature calculation and feature calculation of machine learning, training calculation, etc., so that developers can enjoy the benefits of a variety of ecology at the same time.

Take it as open source and use it as open source -- an in-depth analysis of Alibaba's optimization and improvement of Apache Flink

Finally, in terms of ecology and community activity, one thing Alibaba is promoting is to prepare for the first conference to be held at the National Convention Center from December 20 to 21, 2018Flink forward China Summit(thousands of people), participants will have the opportunity to understand why Alibaba, Tencent, Huawei, Didi, meituan, byte beat and other companies take Flink as the preferred stream processing engine.