The purpose of real time data warehouse construction
Solve the problem of traditional data warehouse
Real time data warehouse is a very confusing concept. The real-time data warehouse itself seems to be the same as turning the black background of PPT into whiter. From the traditional experience, we think that the data warehouse has a very important function, that is, it can record history. Usually, data warehouse hopes to have data from the first day of business online, and then record until now.
But real-time processing technology is also a technology that emphasizes the current processing state, so we think that when these two relatively opposite solutions overlap, it is doomed not to be a solution to a relatively wide range of problems. Therefore, the purpose of real-time data warehouse construction is to solve the problem that can not be solved due to the low timeliness of traditional data warehouse.
Because of this characteristic, we give two principles
- The problems that can be solved by traditional data warehouse are not solved by real-time data warehouse. For example, some historical statistics of last month will not be built with real-time data warehouse.
- The problem itself is not suitable to be solved by data warehouse, and it does not need to be solved by real-time data warehouse. For example, the demand for business is very strong, or the demand for timeliness is particularly high. We also don’t recommend to solve these needs through real-time data warehouse.
Of course, in order to make our whole system look like a data warehouse, we put forward some requirements for ourselves. In fact, this requirement is the same as that for us to build an offline data warehouse. First, the real-time data warehouse needs to be topic oriented, and then it is integrated and relatively stable.
The difference between off-line data warehouse and real-time data warehouse is that off-line data warehouse is used to save historical accumulated data. When we build real-time data warehouse, we only keep the data from the previous batch to the current one. This statement is very awkward, but in fact it is quite easy to operate.
Generally speaking, the solution is to keep the data for about three days, because if you keep the data for three days, you can stably guarantee the complete data for two days. In this way, you can ensure that you can still provide a complete data service when the batch process has not finished processing the data of yesterday.
Application scenarios of real time data warehouse
- Real time OLAP analysis
OLAP analysis itself is very suitable for solving a kind of problems with data warehouse. We can improve the timeliness ability of data warehouse through the expansion of real-time data warehouse. Even at the analysis level, the original OLAP analysis tools may have the ability to analyze real-time data without much modification.
- Real time data Kanban
This scenario is easy to accept, such as the real-time large screen scrolling display of changes in core data of tmall double 11. In fact, for meituan, there are not only sales promotion businesses, but also some major store businesses. For store owners, they may also be very concerned about the sales of each business line in their daily life.
- Real time features
Real time feature refers to the operation of aggregate indicators to mark some features on merchants or users. For example, users who purchase goods many times will be judged as high-quality users in the background. In addition, the background will think that the merchant’s popularity is higher. Then, similar stores or merchants may be given priority when doing real-time precise operations.
- Real time business monitoring
Meituan reviews will also monitor some core business indicators. For example, when there are some problems online, some business indicators may decline. We can find these problems as soon as possible through monitoring, so as to reduce losses.
How to build real time data warehouse
Concept mapping of real time data warehouse
Through the corresponding relationship table of offline data warehouse development and real-time data warehouse development, we can help you quickly and clearly understand some concepts of real-time data warehouse.
- Programming mode
The most common solution for offline development is to use hive SQL for development, and then add some extended UDF. When mapping to real-time data warehouse, we will use Flink SQL, which is also developed with UDF.
- Operation execution level
The execution level of offline processing is generally MapReduce or spark job, corresponding to the real-time data warehouse, which is a continuously running Flink streaming program.
- Data warehouse object level
Offline data warehouse is actually using hive table. For real-time data warehouse, we use stream table to abstract the table.
- Physical storage
For offline data warehouse, we will use HDFS for storage in most cases. For real-time data warehouse, we often use message queue like Kafka to store data.
Overall architecture of real time data warehouse
Before that, we had a share about why we chose Flink to do real-time data warehouse, in which we focused on the reasons and ideas for the selection of technical components. For specific content, please refer to “meituan comments on the construction practice of real-time data warehouse based on Flink”. The main content shared in this paper is around the data itself. The following is the data architecture diagram of our current real-time data warehouse.
Meituan reviews real time data warehouse construction practice based on Flink
From the data architecture diagram, the data architecture of real-time data warehouse is similar to that of offline data warehouse. For example, hierarchical structure; for example, ODS layer, detail layer, summary layer, and even application layer, their naming patterns may be the same. However, there are many differences between real-time data warehouse and offline data warehouse.
The main difference from offline data warehouse is that the real-time data warehouse has fewer layers.
According to our current experience in building off-line data warehouse, the second layer of data warehouse is far more than that simple. Generally, there are some concepts such as light summary layer. In fact, the second layer contains many layers. The other is the application layer. In the past, when building a data warehouse, the application layer was actually inside the warehouse. After the construction of the application layer, the synchronization task will be built to synchronize the data to the database of the application system.
In the real-time data warehouse, the application table of the so-called app layer is already in the database of the application system. In the figure above, although the app layer is drawn, it is not really a table in the data warehouse. These data have been saved in essence.
Why is there less thematic hierarchy? It is because when processing data in real time, every layer is built, the data will inevitably have a certain delay.
Why will the summary layer be built as little as possible? This is because in order to tolerate the delay of some data, some artificial delays may be created to ensure the accuracy of data.
For example, when you count the data in an event, you may wait until 10:00:05 or 10:00:10 to make sure that all the data before 10:00 have been accepted. Therefore, if there are too many levels in the summary layer, it will aggravate the data delay caused by human.
It is suggested to reduce the number of layers as much as possible, especially the number of summary layers. It is better not to exceed two. There may be more layers of details. It’s good that there will be such a design concept of system details.
The second big difference is the storage of data sources.
When building offline data warehouse, the whole data warehouse may be built on hive table and run on Hadoop. However, when building a real-time data warehouse, we may even use different ways to store the same table.
For example, in common cases, most of the detailed data or summary data may be stored in Kafka, but dimension data may be stored in kV storage systems such as TAIR or HBase. In fact, summary data may also be stored in Kafka. The specific reasons will be analyzed in detail later. In addition to the overall structure, we also share the key points of the construction of each floor.
Construction of ODS layer
The data sources should be unified as far as possible, and the partition should be used to ensure the local order of the data
First of all, the first key point of construction is the ODS layer. In fact, the construction of ODS layer may not necessarily have a certain relationship with the warehouse. As long as you use the Flink development program, you must have a real-time data source. At present, the main real-time data source is message queue, such as Kafka. At present, the main data sources we come into contact with are binlog, traffic log and system log.
I would like to make two main points
First of all, the first key point of construction is the ODS layer. In fact, the construction of ODS layer may not necessarily have a certain relationship with this warehouse. As long as you use the Flink development program, you must have this kind of real-time data source. At present, the main real-time data source is message queue, such as Kafka. At present, the data sources we come into contact with are mainly binlog, traffic log and system log.
I would like to talk about two main points: how can I choose so many data sources? We think that from the experience of data warehouse:
First, the sources of data sources should be unified as far as possible.This unity has two meanings
- The first unification is that the real-time data source itself should be unified with itself. For example, if you choose to access a certain kind of data from a certain system, you can either access it from binlog or from system log. It’s better not to mix it up. In the case of not knowing the process of data production, some access through binlog and some access through system log, it is easy to cause the problem of data disorder.
- The second unification refers to the unification of real-time and offline, which may be more important. Although we are building a real-time data warehouse, it is still a data warehouse in essence. As a team, the calculation logic of the indicators in the warehouse should be completely consistent with the data source, and people using the data should not be misunderstood. If both teams can provide data for you, we suggest that you choose the same data source as offline students. Our company itself is also doing some work to ensure the consistency of offline and real-time data sources.
The second key point is that the data is out of order. When we collect data, we will have a big problem. Maybe the same data, due to the existence of partition, will be consumed first when it occurs first, and then consumed when it occurs later. When we solve this problem, we use a data component inside meituan.
In fact, the main idea to ensure the order of data is to use Kafka partition to ensure the local order of data in the partition. As for how to operate, you can refer to “meituan reviews real time data warehouse construction practice based on Flink”. This is a set of solutions made by meituan data synchronization department, which can provide very rich strategies to ensure that the same data is consumed in order according to the production order, so as to solve the problem of data disorder at the source.
Construction of DW floor
Solve the problems of noise, incompleteness and inconsistent data form in the original data. Form a standard and unified data source. Be as consistent with offline as possible.
In fact, the construction idea of detail layer is basically the same as that of offline data warehouse. It mainly lies in how to solve the problems of data noise, incompleteness and non-uniform form that may exist in the data of ODS layer, so that it can be a set of uniform data source that meets the specification in the warehouse. Our suggestion is that if possible, it’s better to enter what warehouse and how to enter it. This process is consistent with offline.
In particular, some data sources are relatively uniform, but the logic of development is often changing. In this case, we may adopt a set of warehousing rules based on configuration. Maybe offline students have a warehousing system. After they have configured the rules, they will know which data on the data table need to enter the real-time data warehouse and which fields need to be entered. Then the real-time and offline students use the same configuration to enter the warehouse, which can ensure that our offline data warehouse and real-time data warehouse maintain a consistent state in the DW layer for a long time.
In fact, the main work of building DW layer is the following four parts.
The only red mark is the standardization of the model. In fact, the standardization of the model is a commonplace issue. Maybe every team will write its own standardization before building the warehouse. But the actual result is that we will see that not every team can finally implement the specification.
In the construction of real-time data warehouse, we should pay special attention to the standardization of the model, because the implementation of data warehouse has a characteristic, that is, the real-time job itself is a 7 × 24-hour scheduling state, so when modifying a field, the operation and maintenance cost may be very high. In the offline data warehouse, you may change a table. As long as you change the downstream operation within one day, there will be no problem. But the real-time data warehouse is different. As long as the upstream table structure is changed, the downstream operation must be able to correctly analyze the upstream data.
In addition, with a system like Kafka, it is not a structured storage, and there is no concept of metadata. It is also impossible to directly change the previously non-standard table name and type to the standard, just like changing the table. It would be costly to regulate after the fact. Therefore, it is suggested that the standardization of these models should be implemented as soon as possible at the beginning of construction, so as to avoid the huge cost of governance.
- Duplicate data processing
In addition to the data itself, we will add some additional information on each data to deal with some common problems in real-time data production
- Unique key and primary key
We will add a unique key and a primary key to each piece of data. These two are a pair. The unique key is to identify the only piece of data, and the primary key is to mark it as a row of data. A row of data may change many times, but the primary key is the same. Each change is the only change, so there will be a unique key. The only key is to solve the problem of data duplication. In terms of hierarchy, data is produced from outside our warehouse, so it is difficult to ensure that the data outside our warehouse will not be duplicated.
Maybe some people will deliver data to you and tell you that there may be duplication of data. Generating unique key means that we need to ensure that the data of DW layer can have an identification to solve the problem of calculation duplication caused by the duplicate data generated upstream. In fact, the most important way to generate a primary key is to partition the primary key in Kafka, which is the same as the principle of connecting ods to ensure orderly partition. After partitioning in Kafka through the primary key, the consumption of single data can be ensured orderly when consuming data.
- Version and batch
Version and batch are actually another group. Of course, the name of the content can be chosen at will. The most important thing is its logic.
First, the version. The concept of version is the corresponding table structure, that is, the data of a version of schema. When dealing with real-time data, the downstream script relies on the schema of the previous table for development. When the data table structure changes, there may be two cases: first, the new or deleted fields may not be used, in fact, there is no need to sense, no need to do any operation. In another case, you need to use changed fields. At this point, there will be a problem. In Kafka’s table, it is equivalent to data with two different table structures. At this time, we need a marked version to tell us what kind of table structure should be used to process the consumed data, so we need to add a concept like version.
Second, batch. Batch is actually a more unusual scenario. Sometimes data retransmission may occur, which is different from restart. Restart may be a change, and then start from the last consumption location. If the data is retransmitted, the position of data consumption will change.
For example, today’s data is miscalculated, and the leader is very anxious for me to change it. Then I need to recalculate today’s data. Maybe after modifying the data program, I have to set up a program, such as starting to run again in the early morning of today. At this time, because the whole data program is online for 7×24 hours, in fact, the original data program cannot be stopped. Only after the retransmitted program catches up with the new data can the original program be stopped. Finally, the retransmitted data is used to update the data of the result layer.
In this case, there must be two sets of data for a short time. When these two sets of data want to be distinguished, they should be distinguished by batches. In fact, all jobs only consume the data of the specified batch. When the re import job is generated, only the jobs that consume the re import batch will consume the re imported data. After the data catch up, all the jobs in the original batch can be stopped. This can solve the problem of data re import.
Dimension data construction
The second is dimension data. Our detail layer includes dimension data. In fact, dimension data is divided into two categories and processed by different methods.
- Dimensions with low frequency of change
The first type of data is data with relatively low frequency of change, which may be basically unchangeable data. For example, some geographic dimension information, holiday information and some fixed code conversion.
In fact, the way we use to access these data is to have the corresponding dimension table in the offline warehouse, and then load it into the cache through a synchronization job. Some dimension data will be created quickly, and new data may be created continuously. However, once it is created, it will not change any more.
For example, meituan has opened a new store, the name of the city where the store is located and other fixed attributes may not change for a long time. Just take the latest data. In this case, we will directly access the latest data through some public services within the company. Finally, we will package a concept of dimension service to shield users. Specifically, where to query the relevant details can be associated with specific dimension information through dimension service.
- Dimensions with high frequency of change
The second is the data with high frequency of change. For example, the state change of common patients’ cardiology and brain department, or the price of a certain commodity, etc. These things tend to change more frequently and quickly over time. For this kind of data, our processing scheme is a little more complex. First of all, for such dimension data as price, which changes frequently, we will monitor its changes. For example, if we think of price as a dimension, we will listen to the information about the price change of the dimension, and then build a zipper table of price change.
Once the dimension zipper table is established, when a piece of data comes, you can know the exact dimension corresponding to the data at a certain time, so as to avoid the problem of wrong dimension association caused by rapid dimension change.
Another kind of dimension, such as new and old customers, is actually a derivative dimension for us, because it is not the calculation method of dimension itself, it is calculated by whether the user has placed an order, so it is actually a dimension calculated by order data.
So for a dimension similar to the number of orders, we will establish some calculation models of derivative dimensions in the DW layer, and then the output of these calculation models is actually zipper table, recording the change degree of a user’s new and old customers every day, or maybe the change process of a high-quality user. Since the zipper table itself needs to be associated with dimensions, the previous grouping of keys can be used to ensure that the zipper table is not out of order. In this way, it can be associated as an invariant dimension.
It is relatively troublesome to build zipper table in this way, so it is actually recommended to use the functions of some external components. In practice, we use HBase. HBase itself supports multiple versions of data, and it can record the time stamp of data update. When fetching data, it can even use this time stamp as an index.
So in fact, as long as the data is stored in HBase and combined with mini versions, it can ensure that the data will not die over time. As mentioned above, there is a general principle in the whole real-time data warehouse, which does not deal with the process that offline data warehouse can handle. It is equivalent to the process of processing, only needs to process the data within three days, so we can also configure TTL to ensure that these dimensions in HBase can be eliminated as soon as possible. Because many days ago, dimensions will no longer be associated, which ensures that the dimension data will not grow unlimited, leading to storage explosion.
Use of dimension data
How to use the dimension data after processing it?
The first and simplest solution is to use udtf Association. In fact, it is to write a udtf to query the above-mentioned dimension services. Specifically, it is to use the local table keyword for association. Both internal and external associations are supported.
Another solution is to identify the associated dimension table and the fields in the dimension table by parsing SQL, transform its original query into the original table. Flatmap (dimension table), and finally convert the result of the whole operation into a new table to complete the association operation.
But this operation requires users to have a lot of peripheral systems to cooperate. First, they need to be able to parse SQL, at the same time, they need to be able to recognize text, remember the information of all dimension tables, and finally they need to be able to perform SQL conversion. Therefore, this scheme is suitable for some systems that already have mature SQL development framework based on Flink SQL. If you only write encapsulated code, it is recommended to use udtf for association. It will be very simple, and the effect is the same.
Construction of summary layer
When building the summary layer of real-time data warehouse, there are many things in common with the offline scheme.
The first point is that for the processing of some common indicators, such as PV, UV and transaction volume, we will carry out unified operations in the summary layer. In addition, multiple operations in various scripts not only waste computational power, but also may cause miscalculation. It is necessary to ensure that the caliber of indicators is unified in a fixed model. Flink SQL actually supports many calculation methods, including count distinct.
It is worth noting that when count distinct is used, it will store all the data to be de duplicated in a state by default. Therefore, when the base number of de duplication is relatively large, it may consume a lot of memory and cause the program to crash. At this time, we can consider using some imprecise system algorithms, such as Bloom filter imprecise de duplication and hyperloglog ultra-low memory de duplication schemes, which can greatly reduce the use of memory.
The second feature of Flink is that it has a lot of time windows. There are tumble windows, slide windows and session windows in Flink SQL. These windows are difficult to write when writing offline SQL, so we can develop some more focused models, and even use some smaller time windows that are rarely used in offline development.
For example, calculating the data of the last 10 minutes can help us build some applications based on time trend graph. But one thing to note is that once the time window is used, the corresponding TTL parameters should be configured, which can reduce the use of memory and improve the running efficiency of the program. In addition, if the TTL is not enough to meet the window, it may also lead to data calculation errors.
The third point is to carry out multi-dimensional topic summary in the summary layer. Because the real-time warehouse itself is topic oriented, each topic may care about different dimensions, so we will summarize the data according to the dimensions concerned by this topic under different topics, and finally calculate the summary indicators mentioned before. However, there is a problem here. If you do not use the time window and use group by directly, the produced data will be a retract stream. By default, Kafka’s sink only supports the append mode, so we need to carry out a transformation here.
If you want to write this data to Kafka, you need to do a conversion. The general conversion scheme is actually to remove the false process in the backflow, save the true process, convert it into an append stream, and then write it to Kafka.
Fourth, an important work will be done in the summary layer, that is, the processing of derived dimensions. If HBase can be used to store the derived dimensions during processing, the version mechanism of HBase can help you more easily build a zipper table of such derived dimensions, and help you accurately get the accurate dimensions of real-time data at that time.
Warehouse quality assurance
After the above links, if you have already established a warehouse, you will find that it is a very troublesome process to ensure the normal operation of the warehouse or its high-quality operation. It is much more complex than the front-line operation. Therefore, after the construction of the warehouse, we need to build a lot of peripheral systems to improve our production efficiency.
The following is an introduction to some of the tool chain systems we currently use. The functional structure of the tool chain system is shown in the figure below.
First of all, the tool chain system includes a real-time computing platform. Its main function is to submit jobs, allocate some resources, and monitor alarms. But in fact, no matter whether the data warehouse is developed or not, it probably needs such a tool, which is the basic tool for developing Flink.
For us, there are two main tools related to data warehouse
- System management module, this module is actually our real-time and offline is used together. The knowledge base management module is mainly used to record some information of tables and fields in the model. In addition, some work order solutions will also be maintained. Flink management is mainly used to manage some Flink related system components developed by our company.
- In fact, the focus is our whole development tool for developing real-time data warehouse ETL. The main points are as follows:
- SQL and UDF management, SQL script and UDF management, and UDF configuration.
- Task log view and task monitoring.
- Scheduling management mainly manages the retransmission and retransmission of tasks.
- Data asset management, real-time and offline metadata management, and task dependency information.
As a matter of fact, each tool in the whole tool chain has its own specific use scenarios. Here are two of them.
Metadata and blood relationship management
During the development of Flink SQL, we need to rewrite the metadata for every task. Because Kafka and many cache components, such as TAIR and redis, do not support metadata management, we must build a metadata management system as soon as possible.
Blood relationship management
In fact, blood relationship is more important for real-time data warehouse. As mentioned above, in the operation and maintenance process of real-time operation, once you modify your own operation, you must ensure that the downstream can accurately analyze new data. If we rely on this kind of brain to remember, for example, who uses my sales list or oral notice, the efficiency will be very low, so we must establish a blood relationship management mechanism. We need to know who used the production table, and then who used it upstream, so that we can know when we modify it, so as to ensure the stability of our whole real-time data warehouse.
The simplest way to implement metadata and blood relationship management system is as follows:
- Generating catalog through metadata service
First, through the metadata system, the metadata information in the metadata system is loaded into the program, and then the Flink catalog is generated. In this way, we can know which tables can be consumed and used by the current job.
- Parsing DDL statement to create update table
When a job performs a series of operations and finally outputs a table, it parses the DDL code of the output part in the job, creates new metadata information and writes it to the metadata system.
- Job information and running status are written into metadata
The metadata information of the job itself and its running status will also be synchronized to the metadata system, so that these information can help us establish blood relationship.
The final system can store this information through the database. If the system you designed is not so complex, you can also use files to store it. The key is to establish such a system as soon as possible, otherwise in the follow-up development and operation and maintenance process will be very painful.
Data quality verification
Write real-time data to hive, and use offline data to continuously verify the accuracy of real-time data.
After building a data warehouse, especially after the first time, you will doubt whether your data is accurate or not. Before that, the verification method was to check the data in the warehouse by writing a program, and then see if the data is correct. In the follow-up construction process, we found that it was too tired to compare artificially every day.
We have adopted the first mock exam to write the middle level table to Hive, and then use off-line data rich quality verification tools to compare the data difference between offline and real-time models, and finally monitor and alarm according to the set threshold. Although this solution can’t find the problem of real-time data in time, it can help you understand the accuracy of real-time model before going online. Then the task of transformation, and constantly improve the accuracy of data. In addition, this scheme can also test the accuracy of offline data.
The above is meituan comments based on Flink build real-time data warehouse application experience sharing, I hope to help you!