Author Wang Feng
This article is shared by Wang Feng, founder of Apache Flink Chinese community and head of real-time computing and open platform Department of Alibaba cloud computing platform division. It mainly introduces the development status and future planning of Flink as a unified flow batch engine. The outline is as follows:
- 2020: a year of accelerated ecological prosperity of Apache Flink community
- Technological innovation: the core driving force of Apache Flink community development
- The status quo and future of Flink in Alibaba
2020: a year of accelerated ecological prosperity of Apache Flink community
1. Flink is the most active project in Apache community
Let’s first introduce the ecological development trend of Flink community in 2020. Overall, the community is in a very healthy and rapid development process, especially in 2020, we have achieved very good results. From the report of Apache Software Foundation in fiscal year 2020, we can see some key data
- Flick user and developer mailing list activity Top1
- Top 2 times of Flink code Submission on GitHub
- Top 2 user visits to Flink on GitHub
Based on these data, we can conclude that Flink is one of the most active open source projects in Apache. The number of stars we have on GitHub and the number of Flink contributors are also very encouraging. In recent years, we have been in a process of accelerating growth, with an average annual data growth of more than 30%. We can see the prosperity and rapid development of Flink’s whole ecology.
2. Apache Flink annual release summary
Let’s review the technological achievements of the whole community in 2020. The Flink community has released three major versions in 2020, namely, flink-1.10, flink-1.11 and the latest version of flink-1.12 in December. These three versions have made great progress compared with the final version of flink-1.9 last year.
In flink-1.9, we have completed the integration of blink code contributions into the Flink community, making the Flink stream batch architecture officially launched. This year, we have made an important upgrade and landing of the Flink stream batch architecture through version 1.10, 1.11 and 1.12. At the same time, in the development scenario of Flink SQL, we not only support the flow batch integration of SQL, but also support the CDC to read the database binlog, and docking with the new generation of data Lake architecture. Flink is more and more widely used in AI scenarios, so we also provide a lot of support in Python language. Pyflink can fully support the development of Flink. We have also done a lot of work on the ecology of k8s.
After the iteration of the three versions this year, Flink has been able to run completely in the cloud native way on the ecosystem of k8s, removing the dependence on Hadoop. In the future, the deployment of Flink can be better mixed with other online services based on k8s ecology.
3. Apache Flink Chinese community continues to be hot
I would like to share with you the development of Flink Chinese community.
First of all, judging from the mailing list, the Flink project may be the only one in Apache’s top-level projects to open the Chinese user mailing list. Apache, as an international software foundation, basically uses English as the main way of communication. Since Flink is unprecedentedly active in China, we have also opened a Chinese mailing list. At present, the Chinese mailing list is even more active than the English mailing list, becoming the most active region of Flink in the world.
Secondly, the Chinese community public official account (the left side of the map) is launched in the community. Weekly information, activity information and best practices are sent to developers to provide windows for understanding the progress of the community. More than 30 thousand active developers have subscribed to us, pushing over 200 latest information related to Flink technology, ecology and practice throughout the year.
Some time ago, we launched the official Chinese learning website of Flink community（https://flink-learning.org.cn/）We hope to help more developers learn the Flink technology conveniently and understand the industry practice of Flink. At the same time, the pinning group of our Flink community also provides a platform for technical exchange. Welcome to join us for technical exchange.
4. Apache Flink becomes the real-time computing fact standard
Now Flink has become the de facto standard of real-time computing. I believe that all kinds of mainstream it or technology driven companies at home and abroad have adopted Flink for real-time computing. Flink forward Asia 2020 also invited more than 40 first-class companies at home and abroad to share their technology and practice of Flink. Thank you very much for the lecturers and experts from these companies to share. I believe that in the future, more companies from all walks of life will adopt Flink to solve the problem of real-time data.
Technological innovation: the core driving force of Apache Flink community development
1. Kernel technology innovation of stream computing engine
Next, I will introduce the development of Flink community in 2020. We believe that technological innovation is the core driving force for the sustainable development of open source projects and open source communities. This part will be divided into three directions to share. First, we will introduce some technical innovations of Flink in streaming computing engine kernel.
Unaligned checkpoint – optimize acceleration
The first example is a non aligned checkpoint. Checkpoint technology needs to constantly insert barrier into the real-time data stream and do regular snapshot, which is one of the most basic concepts of Flink. In the existing checkpoint mode, due to the need to align the barrier, checkpoint may not be able to do it when the back pressure or data calculation pressure is very high. So we made a non aligned checkpoint in the Flink community this year, so that in the case of back pressure, the checkpoint can be made more quickly.
The non aligned checkpoint and the existing aligned checkpoint can be automatically switched by setting the alignment timeout: the aligned checkpoint is made normally, and the non aligned checkpoint is switched when back pressing.
Approximate failure – a more flexible fault tolerance mode
The second technological innovation is in fault tolerance. As we all know, Flink data support strong consistency. However, in order to ensure strong consistency, there are some trade-offs in the availability of the whole system. In order to ensure strong data consistency, the failure of any Flink node will cause all Flink nodes to roll back to the last checkpoint. In this process, the entire DAG graph needs to be restarted. In the process of restart, the business will be interrupted and rolled back for a short time. In fact, the strong consistency of data is not necessary in many scenarios, and the loss of a small amount of data is acceptable. For the statistics of some sampling data or feature calculation in machine learning scenarios, it does not mean that one piece of data can not be lost. On the contrary, these application scenarios have higher requirements on the availability of data.
So we innovate a new fault-tolerant mode in the community, approximate failure, which is a more flexible fault-tolerant mode. If any node fails, only the node itself will be restarted and restored. In this way, the whole graph will not be restarted, that is to say, the whole data flow will not be interrupted.
Nexmark – Streaming Benchmark
At the same time, we find that there is a lack of a standard benchmark tool in the direction of flow computing. In the traditional batch computing, there are a variety of TPC benchmarks that can perfectly cover the traditional batch computing scenarios. However, there is no standard benchmark in real-time stream computing. Based on a paper by nexmark, we have launched the first version of nexmark, a benchmark tool with 16 SQL queries. Nexmark has three characteristics
First, the coverage of the scene is more comprehensive
- Business model design based on online auction system
- 16 queries, covering common stream computing scenarios
- ANSI SQL, standardized, easier to extend
Second, it is more convenient and easy to use
- Pure memory data source generator, flexible load control
- No external system dependency
- Performance index collection automation
Third, open source and openness
Nexmark is open sourcehttps://github.com/nexmark/ne…If you want to compare the differences between different versions of Flink, or compare the differences between different stream computing engines, you can use this tool.
2. The evolution of Flink architecture
New flow batch integrated architecture
Let’s talk about the evolution of Flink architecture. Flink is a stream computing driven engine, and its core is streaming. But it can be based on the kernel of streaming to achieve a more versatile architecture of streaming batch integration.
In 2020, Flink has made a solid step in the integration of stream and batch, which can be summarized as two major versions of Flink 1.10 and 1.11 abstractly, mainly to complete the integration of stream and batch in SQL layer and realize production availability. We have realized the unified stream batch, the expression ability of SQL and table, the unified query processor and the unified runtime.
In the newly released version 1.12, we also implemented stream batch integration for the datastream API. Add batch operator to the original stream operator of datastream. That is to say, datastream can also have two execution modes. Batch operator and stream operator can also be mixed in batch mode and stream mode.
In the planned version 1.13, the operator of stream batch integration of datastream will be completely realized. The whole computing framework, like SQL, is the computing power of stream batch integration. In this way, the old API dataset in Flink can be removed, and the real flow batch architecture can be completely realized.
Under the new flow batch integrated architecture, the entire Flink mechanism is also clearer. We have two kinds of APIs, one is the relational API of table or SQL, and the other is datastream, which can control physical execution more flexibly. Both high-level API (table or SQL) and low-level API (datastream) can realize the unified expression of stream batch integration. We can also transform the diagram of user’s requirement expression into a set of unified implementation DAG diagram. In this set of DAG execution diagrams, you can use either bound stream or unbounded stream, that is, finite stream and infinite stream. Our unified connector framework is also a unified framework of streaming and batch: it can read streaming storage or batch storage. The whole architecture will truly integrate streaming and batch.
In the core runtime layer, the integration of stream and batch is also realized. Scheduling and shuffle are the two core parts of the runtime layer. In the scheduling layer, pluggable plug-in mechanism is supported, which can realize the scenario that different scheduling strategies should be stream, batch or even mixed stream and batch. At the level of shuffle service, streaming and batch shuffles are also supported.
At the same time, we are working on a new generation of shuffle service framework: remote shuffle service. Remote shuffle service can be deployed in k8s to separate storage and computing. In other words, Flink’s computing layer and shuffle are similar to a storage service layer. The completely decoupled deployment makes Flink more flexible.
How to evaluate the performance of the batch is a matter of concern. After three versions of the effort, flink-1.12 than flink-1.9 (last year’s version) has tripled. It can be seen that in the case of 10TB data volume and 20 machines, the running time of our tpc-ds has converged to less than 10000 seconds. Therefore, the batch processing performance of Flink has reached the production standard, no less than any mainstream batch processing engine in the industry.
Data integration of flow batch integration
Stream batch integration is not only a technical problem. I want to explain in more detail how stream batch integration architecture can change the way of data processing and the architecture of data analysis in different typical scenarios.
Let’s first look at the first one. In the big data scenario, data synchronization or data integration is often required, that is, to synchronize the data in the database to the data warehouse or other storage of big data. On the left side of the figure above is one of the traditional classic data integration modes. Full synchronization and incremental synchronization are actually two sets of technologies. We need to merge the full synchronized data with the incremental synchronized data regularly to synchronize the database data to the data warehouse through continuous iteration.
But based on Flink flow batch integration, the whole data integration architecture will be different. Since Flink SQL also supports the CDC semantics of databases (such as MySQL and PG), you can use Flink SQL to synchronize database data to open source databases such as hive, Clickhouse and tidb or open source kV storage. On the basis of Flink’s stream batch integrated architecture, Flink’s connector is also a stream batch hybrid. It can read the full amount of data in the database first, synchronize to the data warehouse, and then automatically switch to incremental mode. It can read binlog through CDC for incremental and full amount synchronization. Flink can automatically coordinate internally, which is the value of stream batch integration.
Flow batch integrated data warehouse architecture based on Flink
The second change is the data warehouse structure. At present, the mainstream data warehouse architecture is a set of typical offline data warehouse and a set of new real-time data warehouse, but these two sets of technology stacks are separated. In the offline data warehouse, we are still used to hive or spark, and in the real-time data warehouse, we use Flink and Kafka. However, three problems need to be solved
- Two sets of development process, high cost.
- Data link redundancy. We all know the classic architecture of data warehouse: ODS layer, DWD layer, DWS layer. As like as two peas of DWD, we can see that real time and offline numbers often do exactly the same thing, such as data cleaning, data completion, data filtering, etc. the two links do two things above.
- It is difficult to guarantee the consistency of data caliber. The real-time report needs to be viewed in real time, and the offline report will be made again every night for the next day’s analysis. However, the data of these two reports may be inconsistent in the time dimension, because it is calculated by two sets of engines. There may be two sets of user codes, two sets of UDF, two sets of SQL, and two sets of data warehouse construction models, which cause great confusion in business and are difficult to be made up by resources or manpower.
If we use the new flow batch architecture to solve the above problems, it will greatly reduce.
- First of all, Flink is a set of Flink SQL development, there are no two sets of development costs. A development team and a set of technology stack can do all offline and real-time business statistics.
- Second, there is no redundancy in the data link. The calculation of the detail layer can be done once, and there is no need to calculate again offline.
- Third, the data caliber is naturally consistent. Both offline process and real-time process are a set of engines, a set of SQL, a set of UDF and a set of developers, so they are naturally consistent, and there is no inconsistency between real-time and offline data caliber.
Flink based stream batch data Lake architecture
If we go one step further, we usually drop the data to hive storage layer. However, as the data scale gradually increases, there are also some bottlenecks. For example, when the size of data files increases, metadata management may be a bottleneck. Another important problem is that hive does not support real-time data update. Hive has no way to provide data warehouse capability in real time or quasi real time. Now the relatively new data Lake architecture can solve the problem of hive as a data warehouse to a certain extent. Data lake can solve the problem of more extensible metadata, and the storage of data Lake supports the update of data, which is a stream batch integrated storage. The combination of data Lake storage and Flink can transform the real-time offline integrated data warehouse architecture into the real-time offline integrated data Lake architecture. For example:
Flink + Iceberg：
- Universal design, decoupling computing engine, open data format
- Provide basic acid guarantee and snapshot function
- Storage stream batch unified, support batch and fine-grained update
- Low cost metadata management
- 0.10 has released Flink real-time write and batch read analysis functions
- 0.11 plan automatic small file merge and upsert support.
In addition, for the integration of Flink and Hudi, we are also working closely with Hudi community. In the next few months, we will launch a complete solution of Flink and Hudi.
Flink + Hudi：
- Upsert function support is relatively mature
- Flexible table Organization (choose copy on write or merge on read according to the scene)
- The integration of Flink and Hudi is in active docking
3. Integration of big data and AI
The last mainstream technology direction is AI. Now AI is a very popular scene. At the same time, AI has a strong computing power demand for big data. Next, I’d like to share with you some things that Flink has done in the AI scene, as well as the future planning.
Pyflink gradually matures
First of all, let’s look at the language layer. Because AI developers like to use Python very much, Flink provides Python language support. In 2020, the community has done a lot of work, and our pyflink project has also made a lot of achievements.
Python version of the table and datastream API:
- Python UDX supports logging, metrics and other functions to facilitate job debugging and monitoring
- Users can use pure Python language to develop Flink program
Python UDX is supported in SQL
- Including Python UDF, Python udtf and python udaf
- SQL developers can also use Python libraries directly
Add pandas class library support:
- Support pandas UDF, pandas udaf and other functions
- Support the conversion between Python table and pandas dataframe
- Users can use the pandas class library in the Flink program.
Alink adds dozens of open source algorithms
At the algorithmic level, Alibaba opened up alink last year (2019), a set of traditional machine learning algorithms based on stream batch integration on Flink. This year, Alibaba’s machine learning team continued to open up 10 new algorithms on alink to solve the problem of algorithm components in more scenarios and further improve the development experience of machine learning. We hope that in the future, with the API of Flink’s new datastream also supporting the iterative ability of stream batch integration, we will contribute alink’s iterative ability based on the new datastream to Flink’s machine learning, so that the standard Flink machine learning can have a big breakthrough.
Integrated process management of big data and AI
The integration of big data and AI is one of the issues worthy of discussion recently. Big data and AI technology are in perfect harmony. Through the integration of many core technologies of big data and AI, we can solve the whole online problem, such as real-time recommendation, or other complete process of online machine learning. In this process, big data focuses on data processing, data validation and data analysis, while AI technology focuses on model training, model prediction and so on.
But this whole process, in fact, requires everyone to work together to really solve business problems. Alibaba has a strong gene to do this. Flink was first born in the scene of search recommendation, so our online search and online recommendation is a background machine learning process realized by the technology of Flink and tensorflow. We have also abstracted the set of processes accumulated by Alibaba, removed all business attributes, and left only the open source pure technology system. It has been abstracted into a set of standard templates and standard solutions, and has been open source, called Flink AI extended. This project is mainly composed of two parts.
First, deep learning on Flink: the integration of Flink computing engine and deep learning engine
- Tensorflow / PyTorch on Flink
- Big data computing task and machine learning task are connected seamlessly.
Second, Flink AI flow: real time machine learning workflow based on Flink
- Event based flow batch hybrid workflow
- The whole link integration of big data and machine learning.
We hope that through the open source mainstream big data plus AI technology system, we can quickly apply it to business scenarios, and make a set of online machine learning services, such as real-time recommendation. This project is also very flexible at present. It can run standalone stand-alone, Hadoop, horn or kubernetes.
Flink Native on K8S
K8s is now a standardized behavior, cloud native. We believe that the future of k8s will be broader. At least Flink must support the native operation under k8s and realize the cloud native deployment mode. After the efforts of three versions this year, we have supported the deployment of Flink to k8s. Flink’s job manager can directly communicate with k8s master, dynamically apply for resources, and dynamically expand and shrink capacity according to the running load. At the same time, we fully docking with the HA scheme of k8s, also supporting the scheduling of GPU and CPU. So now the solution of Flink native on k8s is very mature. If enterprises have demands for the deployment of Flink in k8s, they can use the version of flink-1.12.
The status quo and future of Flink in Alibaba
The innovation and value of technology must be tested by business, and business value is the final criterion. Alibaba is not only the biggest promoter and supporter of Apache Flink, but also the largest user. The following is an introduction to the current situation of Flink application in Alibaba and its follow-up planning.
1. The development of Flink in Alibaba
First of all, take a look at the growth route of Flink in Alibaba, which is very rhythmic.
- In 2016, we ran Flink on a large scale in the double 11 scenario, the earliest of which was the landing of search recommendation, supporting the full link real-time of search recommendation and the real-time of online learning.
- In 2017, we identified Flink as a group wide real-time data processing engine to support the business of the entire Alibaba group.
- In 2018, we started to go to the cloud. For the first time, we accumulated technology and served more small and medium-sized enterprises by pushing Flink to the cloud.
- In 2019, we have taken a step towards internationalization by acquiring the founding company of Flink. Alibaba has invested more resources and manpower to promote the development of Flink community.
By this year, we’ve seen Flink become a de facto international standard for real-time computing. In the world, many cloud vendors and big data software vendors have built Flink into their products, becoming one of the forms of standard cloud products.
2. Double eleven full link data real-time
In this year’s double 11, the Flink based real-time computing platform has fully supported the real-time data services of all scenarios in Ali. On the data scale, there are more than millions of CPU cores running. This year, with no increase in resources, the computing power has doubled compared with last year. At the same time, through technical optimization, the whole link data real-time of the whole Ali economy is realized.
3. “Full link data real-time” to “real-time offline integration”
The next step is to realize the demand of real-time offline integration. In the scenario of e-commerce promotion, it is necessary to compare the real-time data with the offline data. If the real-time and offline data are inconsistent, or do not know whether they are consistent, it will cause great interference to the business. The business has no way to judge whether the result caused by the technical error does not meet the expectation or the business effect does not meet the expectation. So this year’s double 11, Alibaba’s first large-scale implementation of streaming batch integration scene and real-time offline integration business scene.
The landing scene of this year’s double 11 streaming batch integration is tmall’s double 11 marketing big screen analysis. Through the analysis of large screen data, we can see the data of different dimensions, and compare the transaction volume of users on the day of double 11 with that of double 11 a month ago, or even last year, to see whether its growth is in line with expectations. We can ensure that the results of streaming batch are consistent.
In addition, we combine the storage capacity of hologres stream batch integration developed by Alibaba itself with the computing capacity of Flink stream batch integration to realize the full link stream batch integration data architecture and the entire business architecture. Under this framework, we not only maintain the natural consistency of the data, there is no interference in the business, but also improve the development efficiency of Taobao’s small two development data report by 4 ~ 10 times.
On the other hand, Flink’s stream task and batch task run in a cluster. The huge traffic of double 11 may become a trough in the evening. At this time, we will run a large number of offline batch analysis tasks to prepare for the next day’s report. Therefore, the application of peak shaving and valley filling has doubled our resources, which is a very considerable data.
At present, in addition to Alibaba, there are many close partners in the community, such as byte beat, Xiaomi, Netease and Zhihu, who are exploring the solution of using Flink as the unified architecture of streaming and batching. I believe that 2020 will be the first year for the implementation of Flink’s new generation data architecture, from full link data real-time to real-time offline integration, and Alibaba has been implemented in the core dual 11 business scenario.
Next year, more enterprises will try and contribute to the community to improve the new architecture, and promote the community to evolve in a new direction: stream batch integration, offline real-time integration, big data and AI integration. Truly let technological innovation serve the business, change the big data processing architecture, the way of big data and AI integration, and release its value in all walks of life.