Introduction:This article is shared by sunxiaoguang, head of Zhihu technology platform, and mainly introduces the construction practice of Zhihu Flink data integration platform. The contents are as follows: 1 Business scenarios; 2. historical design; 3. fully turn to the design of Flink; 4. planning of future Flink application scenarios.
This article is shared by sunxiaoguang, head of Zhihu technology platform, and mainly introduces the construction practice of Zhihu Flink data integration platform. The contents are as follows:
- Business scenario
- Historical design
- Design with a full shift to Flink
- Planning of future Flink application scenarios
1、 Business scenario
I’m glad to share with you the recentZhihu reconstructs the previous generation data integration platform based on FlinkSome gains in the process. As a link to connect various heterogeneous data, the data integration platform needs to connect a variety of storage systems. Different technology stacks and different business scenarios will put forward different design requirements for the data integration system.
Let’s first take a look atKnow the business scenario of internal data integration。 Similar to many Internet companies, online storage systems known in the past were mainly MySQL and redis. At the same time, HBase was also used for some businesses with large data magnitude. In recent years, with the evolution of technology, we began to migrate from Mysql to tidb. Similarly, we also began to evolve HBase to zetta based on the tikv technology stack. In terms of offline storage, the vast majority of scenarios are supported by hive tables.
From online storage to offline storage, there is a strong demand for data synchronization. In addition, there is also a large amount of streaming data, such as the data in the message system. We also hope that it can be connected with various online or offline storage systems. In the past, Kafka was mainly used to support streaming data, and pulsar was also introduced recently. There is a strong demand for data exchange between the two message systems and the storage system.
In the known business scenarios and current development status, there are some challenges in the technology and process management of data integration.
- First, from a technical point of view, the diversity of data sources will put forward higher requirements for the connection and expansion ability of the data integration system. Moreover, the next generation storage system not only brings stronger capabilities to the business, but also relieves the pressure of the business, thus accelerating the expansion of the amount of data. The rapid growth of the data level puts forward higher requirements for the throughput and real-time performance of the data integration platform. Of course, as the basic system related to data, data accuracy is the most basic requirement, and we must do it well.
In addition, from the perspective of process management, we need to understand and integrate the data scattered in different business teams, do a good job in management and ensure the security of data access, so the whole data integration process is relatively complex. Although platformization can automate complex processes, the high cost inherent in data integration cannot be completely eliminated by platformization. Therefore, to maximize the reusability and manageability of the process is also a challenge that the data integration system needs to continuously deal with.
Based on the challenges in these two directions, we planned the design objectives of the data integration platform.
- From the perspective of Technology, we need to support a variety of storage systems that have been put into use and will be promoted in the future, and have the ability to integrate diverse data in these systems. In addition, we also need to ensure the reliability and accuracy of data integration on the premise of high throughput and low scheduling delay.
- From the perspective of process, the ability of reusing the existing system infrastructure can be achieved by integrating the metadata of various internal storage systems and scheduling systems, so as to simplify the data access process and reduce the user access cost. We also hope to provide users with the means to meet their data needs by self-help in a platform based way, so as to improve the overall efficiency of data integration.
From the perspective of improving task manageabilityWe also need to maintain the blood relationship of the data. Let the business better measure the relationship between data outputs, more effectively evaluate the business value of data outputs, and avoid low-quality and repetitive data integration. Finally, we need to provide systematic monitoring and alarm capabilities for all tasks to ensure the stability of data output.
2、 Historical design
Before the first generation of Zhihu data integration platform took shape, a large number of tasks were scattered in the crontab maintained by each business party or various scheduling systems built by itself. In such an unmanaged state, it is difficult to effectively guarantee the reliability and data quality of various integration tasks. Therefore, at this stage, the most urgent thing we need to solve is the management problem, so that the data integration process can be managed and monitored.
Therefore, we have integrated the metadata systems of various storage systems, so that we can see all the data assets of the company in a unified place. Then the synchronization task of these data is uniformly managed in the dispatching center, and the dispatching center is responsible for the dependency management of the task. At the same time, the dispatching center monitors the key indicators of the task and provides abnormal alarm capability. At this stage, we used sqoop, which was widely used in the past, to synchronize data between MySQL and hive. In the late stage of platform construction, with the emergence of streaming data synchronization requirements, we introduced Flink to synchronize Kafka data to HDFS.
When building the first generation integration platform, we made a choice of technology selection, whether to continue to use the widely verified sqoop or to migrate to other optional technical solutions. Compared with sqoop, Alibaba’s open source dataX is a very competitive competitor in this field. If we make a horizontal comparison between the two products, we can find that they have different advantages in different aspects.
- For example, sqoop has MapReduce level scalability and native hive support on the system scale. However, sqoop has the disadvantages of insufficient data source support and lack of some important features.
- DataX provides very rich data source support, built-in speed limit capability of data integration system, and the ability of easy customization and expansion brought by its good design. However, it also has the defects of no cluster resource management support and lack of native support of hive catalog.
At that time, none of the two products had absolute advantages over each other. Therefore, we chose to continue using sqoop, and maintaining the use of sqoop also saved us a lot of investment in the verification process. Therefore, the first generation of data integration platform completed the development, verification and launch in a very short time.
With the launch and maturity of the first generation data integration platform, it has well supported the company’s data integration business needs and achieved significant benefits. So far, there are about 4000 tasks on the platform, running more than 6000 task instances every day, and synchronizing about 8.2 billion pieces of 124tb data.
With the help of the platform, the data access process has been greatly simplified, providing users with the ability to solve data integration needs by themselves. Moreover, the platform can be supplemented with necessary regulatory constraints and security reviews on key process nodes, which not only improves the management level, but also significantly improves the overall security and data quality.
Thanks to the flexibility of yarn and k8s, the scale expansion ability of integrated tasks has also been greatly improved. Of course, as the first generation system to solve the problem from 0 to 1, it will inevitably be accompanied by a series of problems. For example:
- High scheduling delay inherent in MapReduce mode of sqoop
- Data skew caused by uneven distribution of business data
- Some issues that cannot be solved for a long time due to the inactivity of the community
- Poor scalability and manageability caused by poor sqoop code design.
3、 Turn to Flink
Compared with sqoop, Flink is used to support the task of integrating Kafka messages into HDFS data. It has won more trust for its excellent reliability and flexible customizability. Based on the confidence established by the streaming data integration task for Flink, we began to try to turn to Flink to build a next-generation data integration platform.
Although Flink is the best candidate in this platform evolution, we have investigated the optional technical solutions on the market again based on the situation at that time. This time, we compared Apache nifi project with Flink in many aspects. From a functional point of view:
- Apache nifi is very powerful and fully covers our current data integration requirements. But precisely because it is too powerful and self-contained, it also brings a higher integration threshold. Moreover, the inability to utilize the existing Yan and k8s resource pools will also bring additional costs for the construction and maintenance of resource pools.
- In contrast, Flink has a very active and open community. It already has very rich data source support at the time of project approval. It can be expected that its data source coverage will be more comprehensive in the future. As a general-purpose computing engine, Flink has a powerful and easy-to-use API design. It is very easy to carry out secondary development on this basis, so it has outstanding advantages in scalability.
Finally, based on our recognition of the goal of integration of batch and flow, it will be almost completed in the futureUnification of big data computing engine technology stackIt is also a very attractive target.
Based on these considerations, in this round of iteration, we chose to fully use Flink to replace sqoop. Based on Flink, we fully realized the functions of previous sqoop and rebuilt a new integration platform.
As shown in the following figure, the orange part is the part that has changed in this iteration. In addition to Flink, which appeared as the protagonist, we also developed the data integration function of tidb, redis and zetta storage systems during this round of iteration. On the message system side, pulsar directly gets support from the community. When we started our development work, Flink had evolved to a relatively mature stage and built-in native support for hive. The entire migration process did not encounter too many technical difficulties and was very smooth.
Flink’s migration has brought us a lot of benefits.
1. first, from the perspective of maintainability, a very significant improvement over sqoop. As shown in the following figure, on the left is the task definition when sqoop was used in the past. Here are a lot of unstructured and error prone original commands. Flink, on the other hand, only needs to define a source table and a target table using SQL, and then define tasks with the write command. The comprehensibility and debugging of tasks are much better than before, and become a mode that can be understood by end users. Many problems no longer need the cooperation of platform developers. Users can solve many common task exceptions by themselves.
2. in terms of performanceWe have also made many targeted optimizations.
2.1 dispatching strategy
The first is the optimization of scheduling strategy. In the first generation integration platform, we only use Flink to synchronize streaming data, so task scheduling completely uses per job. Now the platform supports the mixed scheduling mode of session and per job at the same time. Therefore, the streaming tasks accessing data from the message system will continue to run in the per job mode, while the batch synchronization tasks will reuse the cluster in the session mode, so as to avoid the time-consuming cluster startup and improve the synchronization efficiency.
Of course, there are also a series of challenges in using session clusters in such scenarios, such as the changing resource requirements caused by the changing workload with the task submission. Therefore, we have built an automatic capacity expansion and contraction mechanism to help the session cluster cope with changing loads. In addition, in order to simplify the billing mechanism and isolate risks, we have also created private session clusters for different business lines to serve the data integration tasks of the corresponding business lines.
In terms of relational database, we use the common JDBC method to synchronize MySQL data, but this method also has some inherent problems that are difficult to solve.
- For example, data skew is caused by uneven spatial distribution of business data in the primary key dimension.
- Another example is the dedicated synchronous slave library built to isolate online and offline workloads, resulting in waste of resources and management costs.
- Moreover, due to the large number of MySQL instances with different specifications, it is also very difficult to reasonably coordinate the instances of multiple concurrent tasks and the host where the instances are located.
In contrast, considering the trend of comprehensively migrating data from Mysql to tidb. We have developed the Flink connector of native tidb to make full use of the advantages of tidb architecture.
- First, the region level load balancing strategy can ensure that for any table structure and any data distribution, the synchronization task can be split with the region as the granularity to avoid the problem of data skew.
- Secondly, by setting the replica placement strategy, you can uniformly place a follower replica of the data in the offline data center. Then, while keeping the original number of target copies unchanged and without additional resource costs, the ability of follower read is used to isolate the load of online transactions and data extraction.
- Finally, we also introduced a distributed data submission method to improve the throughput of data writing.
3. finally, it provides data integration capabilities for redis, which is widely used in Zhihu.Flink community already has a redis connector, but it only has the ability to write, and it is difficult to flexibly customize the keys used in writing. Therefore, we re developed a redis connector based on our own needs, and supported redis as the source and sink.
Similarly, in order to avoid the impact of the data extraction process on online transactions, we adopted the redis native master/slave mechanism to obtain and parse the RDB file extraction data on the data reading path, and obtained a single instance data extraction throughput of about 150MB per second. Moreover, thanks to the metadata of the internal storage system, we can not only support the data extraction of the redis cluster in the fragmented mode, but also select only each fragmented slave node as the data extraction source to avoid the pressure on the master node.
This time, we turned to the evolution of Flink in an all-round way, solved many problems of the previous generation data integration platform, and achieved great successSignificant benefits。
- From the perspective of throughput, using Flink instead of Mr mode reduces the delay of the whole scheduling from minute level to about 10 seconds. With the same amount of data and the same amount of Flink resources, the tidb native connector can increase the throughput by four times than JDBC.
- From a functional point of viewThe new platform can not only natively support the data integration tasks of sub database and sub table, but also avoid the problem of data skew in a business independent manner.
- In terms of data source support capability, we have received support from tidb, zetta, redis and pulsar at a very low cost. Moreover, as Flink’s ecology becomes more and more perfect, there will be more out of the box connectors for us to use in the future.
- In terms of costFinally, the offline MySQL nodes and the unified use of the k8s resource pool have brought us significant benefits from the cost and management point of view.
4、 Flink is the future
Looking back, the input-output ratio of this comprehensive Flink evolution is very high, which further enhances our confidence in “Flink is the future”. At present, in addition to the data integration scenario in Zhihu, Flink is also applied to the timeliness analysis of search query, the processing of commercial advertisement click data and the real-time data warehouse of key business indicators.
In the future, we hope to further expand Flink’s use scenarios in Zhihu and build a more comprehensive real-time data warehouse and systematic online machine learning platform. We prefer the integration of batch and flow, so that large batch tasks of report and ETL can also be implemented on the Flink platform.
Based on the construction mode of Zhihu big data system and the overall resource investment, it is a very suitable choice for Zhihu to close the technology stack to Flink in the future. As users, we look forward to witnessing the achievement of Flink’s goal of integrating batch flow in the future. At the same time, as members of the community, we also hope to contribute to the achievement of this goal in our own way.
Copyright notice:The content of this article is spontaneously contributed by Alibaba cloud real name registered users. The copyright belongs to the original author. The Alibaba cloud developer community does not own the copyright, nor does it assume corresponding legal responsibilities. Please refer to Alibaba cloud developer community user service agreement and Alibaba cloud developer community intellectual property protection guidelines for specific rules. If you find any content suspected of plagiarism in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the content suspected of infringement.