On Thursday, Alibaba announced part of the information of Apache Flink, a new version of big data processing engine, at the Flink forward Asia 2019 conference, and announced the official open-source machine learning platform alink based on Flink.
According to the official introduction, the new Flink 1.10 version has received all the functions of blink, Alibaba’s internal real-time computing platform built on Flink, and is expected to be officially released in January next year. In addition to the benefits of completing the merge, Flink 1.10 also features improved hive integration compatibility, better Python support, native kubernetes integration support, and the addition of multiple mainstream machine learning algorithm libraries.
Alibaba has established the internal version blink platform based on Flink since 2015, which has long served the core real-time businesses such as search, recommendation and advertisement within Alibaba. After three years of practice and polishing, and the growing gap between blink and the open source version of Flink, Alibaba announced that it would open source blink at the Flink forward China Summit last December.
In fact, this is the second time blink has been incorporated into Flink’s official version since it was officially open-source in January this year, and the last time it was released in Flink 1.9 three months ago. Alibaba has invested a lot of manpower and resources in a short period of time, with more than 1.5 million lines of code contributed by members of Apache Community Project Management Committee and code submitters.
Apache Flink is a distributed big data processing engine, which can calculate the state of finite data flow and infinite data flow. It is deployed in various cluster environments, and can calculate the size of data quickly.
Apache Flink originated from a big data batch computing research project named “stratosphere: information management on the cloud”, which was jointly initiated and cooperated by Berlin University of technology, Humboldt University, and hasO Plattner Institute. Later, the core developer took Flink from stratosphere The purpose of the separation is to try to do all the big data calculation work through flow calculation. In March 2014, Flink entered the Apache incubator and became the top Apache project in December of the same year. So far, a large number of enterprises, including Tencent, Huawei, Netease, Xiaomi, Didi, Shunfeng, etc., have become Flink users.
Flink’s core is the flow computing data processing engine, which provides data distribution, communication, fault tolerance mechanism and other functions for the distributed computing of data flow, and supports both flow processing and batch processing. Based on the flow computing engine, Flink can provide more computing power and easier to use programming interface for developers to create distributed tasks. In addition, Flink also provides different libraries for specific application areas, such as machine algorithm library flinkml, which can provide scalable machine learning algorithms and intuitive APIs and tools.
Alink, the officially open-source machine learning platform announced at this conference, is different from flinkml. It is a general algorithm library developed by Alibaba computing platform Pai team based on the new version of Flink. It is a part of Pai algorithm platform and supports a series of open-source data storage platforms such as Kafka, HDFS, HBase, etc. alink may also be incorporated into flinkml in the future.
Alink, as a machine learning algorithm platform supporting both flow computing and batch computing, provides more than 200 commonly used algorithms and convenient operation frameworks in machine learning, statistics, etc. Meanwhile, it optimizes the implementation of the algorithm to further improve the operation efficiency of the algorithm. At present, alink has officially launched GitHub. Developers will be able to easily complete the whole process from data processing to model training, real-time prediction, visual display without knowing Flink, or use alink to handle many tasks such as statistical analysis, machine learning, real-time prediction, personalized recommendation, exception detection, etc.
It is understood that alink is also used in search, recommendation, advertising and other core real-time businesses within Alibaba. In this year’s “double 11”, alink successfully overcame the pressure of large-scale real-time data training, with a daily data processing capacity of 970pb and a peak data processing capacity of more than 2.5 billion pieces per second. Finally, it achieved a 4% increase in commodity click conversion rate.
So far, Alibaba has opened 283 code bases on GitHub, while Alibaba cloud has opened 278. Alibaba group is the largest open source enterprise in China.
GitHub related project address:
Universal algorithm platform alink