Introduction:Alibaba cloud realtime compute for Apache Flink (powered by ververica) is an enterprise level, high-performance real-time big data processing system built by Alibaba cloud based on Apache Flink. It is officially produced by the founding team of Apache Flink. It has a global unified commercial brand, is fully compatible with the open source Flink API, and provides rich enterprise level value-added functions.
This article is compiled from the live broadcast “general introduction to Flink version of real-time computing”
Apache Flink technology development
The rapid development of big data has been more than 10 years, and big data is also evolving from large-scale computing to a more real-time trend.
For example, the shopping mania double 11 held by Alibaba can display the real-time transaction volume and turnover of the whole double 11 through a large real-time screen, and can realize millisecond level update; CCTV’s Spring Festival Gala, which is watched by Chinese all over the world, can make real-time statistics on the national ratings and audience portraits through the large screen of the Spring Festival Gala; At present, there are urban brain projects in many cities, which can capture the traffic, vehicles, people flow and other information in each city in real time through IOT camera information for traffic monitoring and management; In addition, in the financial industry, under the core business scenarios of banks, stock exchanges and other institutions, they are also monitoring trading behavior in real time through big data real-time computing ability, and detecting anti cheating, anti money laundering and other behaviors; In addition, in the whole scenario of Taobao e-commerce transaction, personalized recommendation is made in real time according to the user‘s behavior. Based on the user’s browsing of goods in the previous minute or 30 seconds, in the subsequent browsing, the system will calculate the user’s portrait according to the algorithm, and then recommend relevant goods that the user may like in real time. It can be said that behind so many scenes involved in daily life, real-time computing promotes productivity day and night.
Real time computing requires a set of extremely powerful big data computing power in the background. Apache Flink, as an open source big data real-time computing technology, came into being.It has been started from the beginning of design by stream computing, because the traditional Hadoop, spark and other computing engines are batch computing engines in essence. Through data processing on limited data sets, its processing delay can not be guaranteed. As a streaming computing engine, Apache Flink can subscribe to real-time data generated in real time, analyze and process the data in real time and produce results, so as to make the data valuable at the first time.
At present, Apache Flink also gradually has the computing power of stream batch integration from the stream computing engine. It can carry out stream analysis and processing through log stream, click stream and IOT data stream. At the same time, it can also carry out batch data processing on limited data sets such as files in database and file system to quickly analyze the results.Apache Flink is now a very popular open source big data technology in the open source community, and has become one of the most active projects in the world among Apache open source projects for three consecutive years.It has strong consistent computing power, large-scale scalability, excellent overall performance, supports SQL, Java, Python and other languages, and has rich API interfaces to facilitate business use in various scenarios. At present, among Internet enterprises at home and abroad, Flink has become the mainstream real-time big data computing technology and is the de facto technical standard in the field of real-time computing.
Alibaba cloud’s Flink version of real-time computing products has been tempered and verified within Alibaba Group for many years, and has accumulated rich technologies and products. Now it has been provided to the cloud to provide cloud computing services for small and medium-sized enterprises in all walks of life. As early as 2016, the third year after Apache Flink was donated to Apache, Alibaba has started to launch real-time computing products on a large scale. This product was first launched in the core search recommendation and advertising business scenario of Alibaba. In this scenario, we need a lot of real-time data processing, such as real-time recommendation, real-time sorting and real-time advertising, which greatly improves the core business of the whole e-commerce.
In 2017, Flink based real-time computing platform products began to serve the whole Alibaba group. In the same year, double 11 served the real-time data of the whole group, including the core double 11 large screen. In 2018, the product was officially launched into the cloud, not only serving the group, but also serving small and medium-sized enterprises on the cloud. This is also the first time that Flink’s products of real-time computing are provided with services in the form of public cloud.
In early 2019, Alibaba acquired ververica, the Founder Company of Flink, and the real-time computing technology team of Ali’s Flink technology team successfully met with the Flink founder team of the German headquarters, becoming the strongest team of Flink technology in the world and jointly promoting the development and contribution of the whole Apache Flink open source community. At present, more than 20W developers in China’s Apache Flink community have participated in the community, and Flink has become one of the most active projects in the big data field of Apache foundation.
Last year, mainstream cloud computing companies and big data companies around the world launched their own Flink products using Flink technology. For example, cloudera, which started with Hadoop, has also launched a fully integrated CDP / CDH of Flink, and domestic big data companies have also launched real-time computing products based on Flink.
Real time computing Flink product architecture
Compared with the open source version, Alibaba cloud’s real-time computing product architecture has greatly improved and added value. Now many developers will use open source Apache Flink to build their own real-time computing platform when building their own computer room or cloud virtual machine jobs.What are the features of the real-time computing Flink product officially launched by Alibaba cloud?
According to the architecture diagram of the whole product, the bottom layer is based on Alibaba cloud’s complete cloud native infrastructure. A set of real-time computing Flink products are built through containerization. All Flink computing tasks run on the ecology of kubernetes, and multi tenant isolation is carried out in a containerized way to ensure security. At the same time, it is a fully managed service form, providing a fully managed service with high SLA guarantee on the cloud, eliminating the trouble of user operation and maintenance. Combined with the service architecture, users can more flexibly judge the proportion of various resources, fully cooperate with their own business volume to choose, and there is no need to worry about machine planning.The Flink version of real-time computing is a natural cloud native infrastructure.
On the core computing engine, compared with the open source Apache Flink, Alibaba cloud has optimized several core functions, which have also been tempered by Alibaba’s internal business. At present, Flink products for real-time computing support real-time data services of nearly 100 business units of Alibaba group. Through a large number of business practices, the product has been debugged to the best effect in supporting storage, scheduling, network transmission and so on.
In terms of plug-ins, dozens of enhanced connectors are built in the product, which can connect with all mainstream open source data storage, including mysql, HBase, HDFS, alicloud SLS, etc. on the cloud. It is naturally integrated and out of the box. In terms of development platform, it provides an enterprise level one-stop development platform with its own development and operation and maintenance capabilities, eliminates self built troubles and improves the overall use feeling of enterprise users.
The real-time computing Flink version supports SQL, Java, Python and other multilingual development environments, provides full life cycle management of development tasks, supports enterprise level security mechanisms based on oidc and RBAC, has full link monitoring and alarm based on Prometheus protocol, and provides its own autopilot intelligent tuning system to intelligently help users tune the parameters of Flink tasks, Including resource tuning and concurrency tuning. The product can adapt the traffic of the business completely, and it does not need to be manually debugged(intelligent tuning is the core advantage of Flink version of real-time computing)。
The difference between real-time computing Flink and open source Apache Flink
Compared with open source products, Flink version of real-time computing has 10 performance advantages, which are compared from the perspectives of development, operation and maintenance, cost and security.
In terms of development, it has rich data connection ability, one-stop multilingual development environment, built-in multiple function libraries to facilitate code debugging, multi tenant development, task debugging, test simulation, etc. In terms of operation and maintenance, it supports the monitoring and alarm of the whole link. Users can automatically alarm for data delay, data abnormality, service interruption, etc.
In terms of intelligent operation and maintenance, it supports automatic intelligent diagnosis and tuning, and can automatically help users with performance tuning, job tuning, parameter tuning and resource tuning according to business traffic. It can diagnose and optimize problems. At the resource level, on the basis of open source, more fine-grained and refined resource allocation is achieved, so that each operator of each job can be configured on the granularity of CPU and memory, greatly optimize the utilization of resources, help users save costs, improve the stability of services and reduce the probability of OM. Combined with the original operation and maintenance service, the SLA is 99.9%, the fault tolerance of the whole link and the system stability are guaranteed to fully solve the worries of users.
At the cost level, through cloud cost optimization, the overall TCO of users can be reduced while improving performance, which is also the advantage of core performance.
In the standard test of stream computing based on nexmark, the product performance of real-time computing Flink is about three times that of open source,Relying on the practice optimization accumulated by the powerful R & D team of Alibaba Group under the internal core business scenario, the products highlight the core advantages while reducing the basic cost of users.
Flink version of real-time computing also has the elastic capacity expansion capability of cloud, which can help users reasonably save resources and improve resource utilization. The product payment type supports both annual and monthly payment and volume based payment to better adapt to different needs.
The security level improves the user’s experience through containerized task isolation, and supports tenant isolation, security isolation, VPC isolation and other requirements. At the same time, it is directly connected with Alibaba’s account system. Users can seamlessly control the security between products based on Alibaba cloud’s account. It also supports open identity authentication protocols such as role-based and oidc, which greatly improves the security of business.
Overall, compared with the open source version, the enterprise version has more advantages in functionality and stability. In addition to the advantages of operation and maintenance, out of the box also makes users more convenient.
As a streaming computing engine for real-time computing, Flink can process a variety of real-time data, including ECS online service logs, sensor data in IOT scenarios and other real-time data. At the same time, you can subscribe to the update of binlog in relational databases such as RDS and polardb on the cloud. Then subscribe real-time data through datahub data bus products, SLS log services, open source Kafka message queue products, etc., and include them in real-time computing products for real-time data analysis and processing. Finally, the analysis results are written into different data services, such as maxcompute, maxcompute hologres interactive analysis, Pai machine learning, elasticsearch and other products. The best data service products are selected according to business needs to improve data utilization.
Flink’s main application scenario is to subscribe, process and analyze the data in different real-time data sources in real time, and write the results to other online storage for users to produce and use directly. The whole system has the characteristics of fast speed, accurate data, cloud native architecture and intelligence. It is a very competitive enterprise level product. The product runs on alicloud container service ECs and other IAAs systems, and is naturally connected with alicloud systems to facilitate customers to apply to more scenarios.
Product application scenario
Based on real-time computing, Flink summarizes four application scenarios to facilitate users to easily build their own business real-time computing solutions according to their needs.
1. Real time data warehouse
Real time data warehouse is mainly used in various transaction data scenarios such as website PV / UV statistics, commodity sales statistics, transaction data statistics, etc. By subscribing to the business real-time data source, the information is analyzed in real-time seconds, and finally presented to the decision-makers on the large screen, which is convenient to judge the business status and promotion activities of the enterprise. Make decisions based on real-time business operation data to achieve real data intelligence. Due to the particularity of the scenario, real-time data is particularly important. In the rapidly changing business interaction, it is necessary to analyze and make decisions on the data in the last minute or even the last second. Real time computing is the best choice in this scenario.
2. Real time recommendation
Real time recommendation is mainly personalized recommendation based on user preferences or recommendation based on AI technology. It is a mainstream product form. It is common in short video scenes, e-commerce shopping scenes, content information scenes, etc. it can judge user preferences in real time through the previous user clicks, so as to make targeted recommendations and increase user stickiness. This is a very real-time scene, which can be recommended in real time through Flink technology combined with AI technology.
3. ETL scenario
Real time ETL scenarios are common in data synchronization, and data calculation and processing are also required in the process of data synchronization. For example, synchronization and transformation of different tables in the database, synchronization of different databases, or data aggregation preprocessing. Finally, the results will be written into the data warehouse / data lake for archiving and sedimentation, so as to make preliminary preparations for subsequent in-depth analysis, so as to facilitate users to carry out subsequent log analysis and other operations. In the whole data synchronization and processing link, it is very efficient to do this real-time data synchronization and preprocessing based on Flink.
4. Real time monitoring
Real time monitoring is common in financial or trading business scenarios. According to the uniqueness of the industry, commercial anti cheating supervision is required to determine whether the user is a cheating user according to the behavior in a short time, so as to stop the loss in time. This scenario requires high timeliness. By detecting abnormal data, you can find abnormal conditions in real time and make a stop loss behavior. Collecting indicators or logs and other statistics of indicators of various systems, observing and monitoring indicators in real time and other demand scenarios can be solved by calculating Flink products in real time.
Product official website:https://www.aliyun.com/product/bigdata/sc
Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users, and the copyright belongs to the original author. Alibaba cloud developer community does not own its copyright or bear corresponding legal liabilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.