The construction of data warehouse is an indispensable part of “data intelligence” and also an inevitable challenge in large-scale data application. And Flink real-time data warehouse plays an extremely important role in data link. In this paper, Lu Hao, a senior technical expert of meituan reviews, shared the practice of meituan reviews’ real-time data warehouse platform based on Apache Flink.
Author: Lu Hao @ meituan comments
Click to view the live broadcast > >
More videos of the 2019 Flink forward Conference
1、 Real time computing evolution of meituan Reviews
Meituan reviews the evolution of real-time computing
In 2016, meituan reviews has realized the initial platform based on storm real-time computing engine. At the beginning of 2017, we introduced spark streaming to support specific scenarios, mainly in data synchronization scenarios. At the end of 2017, the real-time computing platform of meituan reviews introduced Flink. Flink has many advantages over storm and spark streaming. At this stage, we have carried out a deep platform, with the main focus on security, stability and ease of use. Since 19 years ago, we have been committed to building solutions for specific scenarios including real-time data warehouse and machine learning to provide better support for business.
Real time computing platform
At present, the number of daily active jobs on the real-time computing platform of meituan reviews is 10000. At the peak, the number of messages processed by jobs reaches 150 million per second, and the machine scale has reached thousands, and thousands of users are using real-time computing services.
Real time computing platform architecture
As shown in the figure below is the architecture of meituan reviews real-time computing platform.
- At the bottom isCollection layerThis layer is responsible for collecting real-time data of users, including binlog, back-end service log and IOT data. After the processing of log collection team and DB collection team, the data will be collected into Kafka. These data not only participate in real-time calculation, but also participate in offline calculation.
- Above the collection layer isStorage layerIn addition to using Kafka as the message channel, this layer also stores state data based on HDFS and dimension data based on HBase.
- On top of the storage tier isEngine layer, including storm and Flink. The real-time computing platform will provide users with encapsulation of some frameworks and support of common packages and components in the engine layer.
- Above the engine layer isPlatform layerThe platform layer manages data, tasks and resources.
- The top layer of the architecture isapplication layer, including real-time data warehouse, machine learning, data synchronization and event driven applications.
This sharing mainly introduces the construction of real-time data warehouse.
From a functional point of viewThe real-time computing platform of meituan reviews mainly includes two functions: job management and resource management. Among them, the job part includes job configuration, job publishing and job status.
- stayJob configurationOn the other hand, it includes job settings, runtime settings and topology settings;
- stayJob PublishingOn the other hand, it includes version management, compilation / release / rollback, etc;
- Operation stateIt includes runtime status, custom metrics and alarms, command / runtime logs, etc.
In terms of resource managementIt provides users with the ability of multi tenant resource isolation and resource delivery and deployment.
Business warehouse practice
As mentioned earlier, the current meituan reviews real-time computing platform will pay more attention to security, ease of use and stability, and a large application scenario is business data warehouse. Next, I will share some examples of business warehouse.
The first example is trafficThe traffic data warehouse is the basic service of traffic business. From the perspective of business channel, there will be buried points of different channels and buried point data of different pages. Through the log collection channel, the basic details layer will be split, and different business channels will be divided according to business dimensions, such as meituan channel and take out channel.
Based on the business channel, there will be a more fine-grained split, such as exposure log, guess what you like, recommendation, etc. The above includes two ways of using, one is to provide other downstream business parties with flow, the other is to do some real-time traffic analysis.
In the following figure, on the right is the architecture diagram of the traffic data warehouse, which is divided into four layers from bottom to top, namely the SDK layer, including the front end, small programs and the embedded points of the app; on the top is the collection layer, where the embedded point logs are landed in nginx and received in Kafka through the log collection channel. In the computing layer, the traffic team realizes the SQL encapsulation of the upper layer based on the storm capability, and realizes the dynamic update feature of SQL, so it does not need to restart the job when the SQL changes.
Real time effect of advertisement
Here is another example based on traffic data warehouse – advertising real-time effect verification. The left side of the figure below is a comparison of the real-time effect of the advertisement. Generally, advertisement marking includes request (PV) marking, SPV (server PV) marking, CPV (client PV) exposure marking and CPV click marking. In all marking, there will be a traffic requestid and hit experiment path. According to the request ID and the hit experiment path, all the logs can be joined to get all the data needed in a request, and then the data will be stored in the durid for analysis to support the actual CTR, estimated CTR and other effect verification.
Another example of business warehouse practice listed here is instant delivery. Real time data plays an important role in the operation strategy of real-time distribution. Taking the delivery time prediction as an example, the delivery time measures the delivery difficulty of the rider’s meal delivery. The whole performance time is divided into multiple time periods. The distribution warehouse will clean and extract the feature data based on storm for the algorithm team to train and get the time prediction results. This process involves the participation of merchants, riders and users. There will be a lot of data characteristics and a large amount of data.
Business real-time data warehouse can be roughly divided into three types of scenarios: traffic, business and feature, which are different.
- staydata modelOn the one hand, the traffic class is a flat wide table, the business data warehouse is more based on the modeling of normal form, and the characteristic data is kV storage.
- fromdata sourcesIn general, the data source of traffic data warehouse is log data; the data source of business data warehouse is business binlog data; the data source of characteristic data warehouse is diverse.
- fromData volumeIn other words, traffic and feature silos are massive data, more than 10 billion level per day, while business silos generally have data volume of millions to tens of millions level per day.
- fromData update frequencyIn other words, if the traffic data is rarely updated, the business and characteristic data are updated more. Traffic data generally focuses on timing and trend, while business data and feature data focus on state change.
- stayData accuracyOn the other hand, traffic data requirements are low, while business data and characteristic data requirements are high.
- stayFrequency of model adjustmentOn the other hand, the frequency of business data adjustment is high, and the frequency of traffic data and characteristic data adjustment is low.
2、 Flink based real-time data warehouse platform
The above introduces the business scenario of real-time data warehouse, and then introduces the evolution process of real-time data warehouse and the construction idea of real-time data warehouse platform commented by meituan.
Traditional data warehouse model
In order to organize and manage data more effectively, data hierarchy is often carried out in warehouse construction, which is generally divided into four layers from bottom to top: ODS (operation data layer), DWD (data details layer), DWS (summary layer) and application layer. Instant query is mainly realized through presto, hive and spark.
Real time data warehouse model
The hierarchical mode of real-time data warehouse generally follows the traditional data warehouse model, which is also divided into ODS operation data set, DWD detail layer, DWS summary layer and application layer. However, the processing method of real-time data warehouse model is different from that of traditional data warehouse. For example, the data of detail layer and summary layer are generally placed on Kafka, the dimension data is generally placed on the kV storage such as HBase or TAIR considering the performance problems, and the ad-hoc query can be completed using Flink.
Quasi real time data warehouse model
In addition to the above two kinds of data warehouse models, we find that there is a quasi real-time data warehouse model in the practice process of the business side, which is not completely based on the flow to do, but to import the detailed layer data into the OLAP storage, to do the summary and further processing based on the OLAP computing power.
Comparison between real time data warehouse and traditional data warehouse
The comparison between real-time data warehouse and traditional data warehouse can be considered from four aspects:
- The first isLayering modeIn order to consider the efficiency problem, offline warehouse usually adopts the way of space for time, and the level division is more; in addition, considering the real-time problem, the level division is less, and the possibility of intermediate process error is also reduced.
- The second isFacts data storage, offline data warehouse will be based on HDFS, while real-time data warehouse will be based on message queue (such as Kafka).
- The third isDimension data storage, the real-time data warehouse will put the data on the kV storage.
- The fourth isData processing processThe offline data warehouse is generally based on batch processing such as hive and spark, while the real-time data warehouse is based on real-time computing engines such as storm and Flink, and mainly based on stream processing.
Comparison of real-time warehouse construction schemes
In the figure below, two construction methods of real-time data warehouse are compared, i.e. quasi real-time data warehouse and real-time data warehouse. Their implementation is based on OLAP engine and stream computing engine respectively, and the real-time degree is minutes and seconds respectively.
- stayScheduling overheadOn the one hand, quasi real-time data warehouse is a batch process, so it still needs the support of scheduling system. Although the scheduling cost is less than offline data warehouse, it still exists, but real-time data warehouse has no scheduling cost.
- stayBusiness flexibilityOn the one hand, because the quasi real-time data warehouse is based on OLAP engine, the flexibility is better than the way based on flow calculation.
- In pairsTolerance of late arrival of dataOn the one hand, because the quasi real-time data warehouse can carry out full calculation based on the data in a cycle, the tolerance for late arrival of data is relatively high, while the real-time data warehouse uses incremental calculation, and the tolerance for late arrival of data is lower.
- stayExpansibilityOn the one hand, because the calculation and storage of quasi real-time data warehouse are integrated, its expansibility is weaker than that of real-time data warehouse.
- stayApplicable scenarioOn the one hand, quasi real-time data warehouse is mainly used for scenarios with real-time requirements but not too high, small data volume, complex multi table Association and frequent business changes, such as real-time analysis of transaction types, while real-time data warehouse is more suitable for scenarios with high real-time requirements and large data volume, such as real-time characteristics, traffic distribution and real-time analysis of traffic types.
To sum up, the construction method based on OLAP engine is a compromise scheme to improve the timeliness and development efficiency when the data volume is not too large and the business flow is not too high. From the perspective of the future development trend, the real-time data warehouse based on flow computing has more development prospects.
One stop solution
In the process of business practice, we have seen the common needs of business construction of real-time data warehouse, including the discovery that metadata of different businesses is fragmented, business development also tends to use SQL to develop offline data warehouse and real-time data warehouse at the same time, which requires more operation and maintenance tools support. So we plan a one-stop solution, hoping to connect the whole process.
The one-stop solution here mainly provides users with data development platform and metadata management. At the same time, considering the problems in the process of business from production to application, our OLAP production platform solves the OLAP production problems from the aspects of modeling method, production task management and resources. On the left is our existing data security system, resource system and data governance, which can be shared by offline data warehouse and real-time data warehouse.
The reason why Flink is selected for the construction of real-time warehouse platform is based on the following four considerations, which is also the core issue of real-time warehouse.
- The first is state management. There will be a lot of aggregation calculation in the real-time data warehouse, which needs to access and manage the state. Flink is mature in this respect.
- The second is the ability of table semantics. Flink provides a very rich multi-level API, including stream API, table API and Flink SQL.
- The third is that the ecosystem is perfect, real-time data warehouse is widely used, users have access requirements for a variety of storage, and Flink’s support for this aspect is relatively perfect.
- Finally, Flink offers the possibility of stream batch unification.
Real time data warehouse platform
The construction of real-time data warehouse platform is divided into four levels from the outside to the inside. We think that what the platform should do is to provide users with abstract expression capabilities, which are message expression, data expression, computing expression, flow and batch unification.
Real time data warehouse platform architecture
As shown in the figure below is the real-time data warehouse platform architecture of meituan reviews. From the bottom to the top, the resource layer and storage layer reuse the real-time computing platform capabilities, and the engine layer will realize some expansion capabilities based on Flink streaming, including the integration of UDF and connector. On top of that is the independent SQL layer based on Flink SQL, which is mainly responsible for parsing, verification and optimization. On top of this is the platform layer, including the development workbench, metadata, UDF platform and OLAP platform. The top layer is the application of real-time data warehouse supported by the platform, including real-time report, real-time OLAP, real-time dashboard and real-time features.
Message expression data access
At the message expression level, because the data formats of binlog, embedded point log, back-end log and IOT data are inconsistent, the real-time data warehouse platform of meituan reviews provides the data access process, which can help you synchronize the data to the ODS layer. There are two main things implemented here, namely, unified messaging protocol and shielding processing details.
As an example of the access process is shown on the left side of the figure below, for binlog type data, the real-time data warehouse platform also provides support for the sub database and sub table, which can collect different sub database and sub table data belonging to the same business into the same ODS table according to business rules.
Computational representation – extended DDL
Based on Flink, meituan reviews real-time data warehouse platform has expanded DDL. The main purpose of this part of work is to build metadata system and get through the internal mainstream real-time storage, including kV data, OLAP data, etc. Because the development workbench and metadata system are connected, so many data details do not need to be explicitly declared in DDL, just write the data name in the declaration, and some settings at runtime, such as MQ consumption from the latest consumption or the oldest consumption or consumption from a certain time stamp, and other data access methods are consistent.
Computing expression – UDF platform
For UDF platform, there are three aspects to consider:
- First isData security。 In the previous database construction process, users can upload jar packages to directly reference UDF, which is dangerous, and we can’t know the flow direction of data. From the perspective of data security, the platform will conduct code audit and blood relationship analysis, and can perform component convergence for historical risk components or components with problems.
- Second, we will focus on data securityOperation quality of UDF, the platform will provide the management of templates, use cases and tests for users, shield the process of compilation and packaging, jar package management for users, and bury the indicator logs and handle exceptions in UDF templates.
- The third level isReusability of UDFBecause UDFs developed by one business party are likely to be used by other business parties, but incompatibilities may arise during the upgrade process. Therefore, the platform provides project management, function management and version management capabilities for the business.
In fact, UDF is widely used. UDF platform does not only support real-time data warehouse, but also supports offline data warehouse, machine learning, query service and other application scenarios. In the following figure, the use case of UDF is shown on the right, the development process of UDF is shown on the left, the user only needs to care about the registration process, and the next compilation, packaging, testing and uploading are completed by the platform; in the use process of UDF, the user only needs to declare UDF, and the platform will perform parsing verification, path acquisition and integration when the job is submitted.
Real time data warehouse platform web IDE
Finally, it introduces the development platform of real-time data warehouse platform, which integrates model, job and UDF management in the form of Web ide. Users can develop in SQL mode on the web ide. The platform will manage some versions of SQL, and support users to go back to the deployed version.
3、 Future development and thinking
Automatic resource tuning
From the perspective of the whole real-time computing, at present, the number of nodes of the real-time computing platform of meituan reviews has reached thousands, which is likely to reach tens of thousands in the future, so resource optimization will soon be put on the agenda. Because the traffic of the business itself has peaks and troughs, for a real-time task, it may need a lot of resources in the peak, but not in the low.
On the other hand, the peak itself will also change, and it is possible that with the increase of business, the amount of resources originally allocated will not be enough. Therefore, there are two meanings of automatic resource tuning: one is that the peak traffic of the jobs that can be adapted increases and the max value is automatically adapted; the other is that the jobs can automatically adapt to the decrease of traffic after the peak, and can quickly shrink capacity. We can get the relationship function of operators, traffic and resources by fitting the historical operation of each task or even operator, and adjust the resource quantity synchronously when the traffic changes.
The above is the idea of resource optimization. In addition, we need to consider how to use resources after optimization. In order to ensure availability, real-time and offline tasks are generally deployed separately. Otherwise, bandwidth and IO may be filled up by offline computing, resulting in real-time task delay. From the perspective of resource utilization, we need to consider the mixed deployment of real-time and offline, or to deal with some real-time tasks in the way of flow. This requires more fine-grained resource isolation and faster resource release.
Promote the upgrading of real-time warehouse construction mode
The construction of real-time data warehouse is generally divided into several steps:
- First, the business proposes requirements, and then design modeling, business logic development and underlying technology implementation will be carried out. The idea of real-time data warehouse construction of meituan reviews is to realize unified expression of technology, let business focus on logic development, and logic development can also realize automatic construction based on configuration means.
- The next level is to realize intelligent modeling according to business requirements, and to automate the design modeling process.
At present, the construction of real-time warehouse platform of meituan reviews is still focused on the level of unified expression, and there is still a long way to go from the ideal state.