Implementation of typical ETL scenarios based on Flink

Time:2020-12-2

Author: Mairong meituan review senior technical expert
Organized by: Zhao Yang (Flink community volunteer)
Proofread by: Miao Haochong (Flink community volunteer)

This paper will review the development and evolution of digital warehouse from the background of its birth, its architecture, and the comparison between offline and real-time data warehouse, and then share several solutions to implement typical ETL scenarios based on Flink.

1. Overview of real-time data warehouse

1.1 background of real-time data warehouse generation

Let’s review the concept of data warehouse.

Implementation of typical ETL scenarios based on Flink

The concept of data warehouse was proposed by bill inmon in the 1990s. At that time, the background was that the traditional OLTP database could not well support the long-term analysis and decision-making scenarios. Therefore, we should compare and understand the four core points of the concept of data warehouse with the current state of OLTP database.

  1. Topic oriented: the data organization of data warehouse is different from that of OLTP. Because data warehouse is oriented to analysis and decision-making, data is often organized in the form of analysis scenarios or analysis objects.
  2. Integrated: for data warehouse, it is often necessary to gather multiple scattered and heterogeneous data sources, do some ETL processing such as data cleaning, and integrate them into a data warehouse. OLTP does not need to do similar integration operations.
  3. Relatively stable: OLTP database is generally business oriented, its main function is to accurately reflect the current business status, so OLTP database needs to support a large number of operations such as adding, deleting and modifying. But for data warehouse, as long as it is stored in the warehouse, the general usage scenario is query, so the data is relatively stable.
  4. Reflecting historical changes: data warehouse is a collection of data reflecting historical changes, which can be understood as saving some snapshots of historical data. As for the OLTP database, it only needs to reflect the latest state at that time.

The above four points are a core definition of data warehouse. We can also see that for real-time data warehouse, some definitions in traditional data warehouse, that is, offline data warehouse, will be weakened, for example, in reflecting historical changes. After introducing the basic concept of data warehouse, some classical modeling methods will be used in data warehouse modeling, including normal form modeling, dimension modeling and data vault. In the context of Internet big data, the most commonly used method is dimension modeling.

Then, let’s take a look at the classic architecture of offline silos. As shown in the figure below:

Implementation of typical ETL scenarios based on Flink

This data warehouse architecture is mainly a scenario scheme biased towards Internet big data. As can be seen from the above figure, there are three core links.

  1. The first link is the data source part. Generally, there are two types of data sources for Internet companies
  • The first type is to collect user behavior logs and some log type data sources of back-end logs by reporting at the client’s buried point. For the embedded point behavior log, it usually goes through such a process. Firstly, the data will be reported to nginx, then collected by flume, and then stored in the message queue like Kafka. Then, some real-time or offline pull tasks are pulled to our offline data warehouse HDFS.
  • The second type of data source is the business database. For the business database, its binlog is collected through canal, and then it is collected into the message queue, and finally pulled to HDFS by Camus.

These two parts of data sources will eventually land in the ODS layer of HDFS, also known as the paste source data layer, which is consistent with the original data source.

  1. The second link is the offline data warehouse, which is shown in the blue box in the figure. We can see that it is a hierarchical structure, and the model design is based on the idea of dimensional modeling.
  • The bottom layer is the ODS layer, which keeps the data stored in HDFS without information loss, and basically keeps the original log data unchanged.
  • Above the ODS layer, a unified data cleaning and normalization will be carried out to get the DWD detailed data layer. This layer also contains uniform dimension data.
  • Then, based on the DWD detail data layer, we will organize our data according to some analysis scenarios and analysis entities, and organize them into some sub topic summary data layers DWS.
  • Based on DWS, we will do some app application data layer that is closer to the application for application scenarios. These data should be highly aggregated and can be directly imported into our application services for use.

In the production link of the intermediate offline data warehouse, some offline production architecture engines are generally adopted, such as MapReduce, hive, spark, etc., and the data is generally stored in HDFS.

  1. After the first two links, some of our application layer data will be stored in data services, such as HBase, redis, kylin and other kV storage. And it will encapsulate the corresponding service interface for some data existing in the data storage, and provide external services. In the outer layer, we will produce some business oriented reports, analysis oriented data products, and some online business products. This layer is called the data application part closer to the business side.

The above is an introduction to the classic architecture of offline data warehouse.

We all know that with the popularity of mobile devices, we are gradually transiting from the manufacturing era to the Internet era. In the era of manufacturing industry, the traditional digital warehouse is mainly to support the business decision-makers and managers of some enterprises in traditional industries to make some business decisions. At that time, the data storage period of oracle was long enough.

However, with the development of distributed computing technology, the development of intelligent technology, the improvement of overall computing power, the development of the Internet and other factors, the amount of data we collect on the Internet has increased exponentially. And the business no longer only depends on people to make decisions. Most of the decision-making subjects have been transformed into computer algorithms, such as some intelligent recommendation scenarios and so on. In the case of seconds, the decision-making period is very short. In the scenario, we will also face more scenarios that need real-time data processing, such as real-time personalized recommendation, advertising scene, and even some traditional enterprises have started to monitor whether the processed products have quality problems in real time, and the financial industry relies heavily on the counterfeits and so on. Therefore, under such a background, real-time data warehouse must be proposed.

1.2 real time data warehouse architecture

First of all, let’s introduce lambda architecture, the classic architecture of real-time data warehouse

Implementation of typical ETL scenarios based on Flink

This architecture was put forward by storm. In fact, the main idea of lambda architecture is to add the real-time data warehouse part on the basis of the original offline data warehouse architecture, and then merge the offline inventory data with our T + 0 real-time data to generate the result of real-time data status update.

  • Compared with the above 1.1 offline data warehouse architecture diagram, it can be clearly seen that the increased part of real-time data warehouse is the yellow area in the figure above. In general, we will put the real-time data warehouse on the message queue like Kafka, and there will also be some layers of dimensional modeling. However, in the part of summarizing data, we will not put some data of APP layer in the real-time data warehouse, but more will be moved to the data service side to do some calculation.
  • In this way, we can improve the computing efficiency of the flash engine from the second level, the second level and the day level.

You can also see that in this architecture diagram, there are two parts in the intermediate data warehouse, one is the offline data warehouse, and the other is the real-time data warehouse. We need to implement real-time code and off-line code in real-time. Then, when merging, we need to ensure the consistency between the implementation and offline data. Therefore, whenever our code changes, we also need to do a lot of real-time offline data comparison and verification. In fact, this is relatively high for both resources and operation and maintenance costs. This is an obvious and prominent problem in lamda architecture. So we have the kappa structure.

Implementation of typical ETL scenarios based on Flink

One of the main ideas of kappa architecture is to remove the off-line data warehouse in the data warehouse, and the production of digital warehouse adopts real-time data warehouse. As can be seen from the above figure, the offline data warehouse module is no longer available.

About kappa architecture, students familiar with real-time warehouse production may have a question. Because we often face business changes, a lot of business logic needs to be iterated. If the caliber of some previously produced data is changed, it is necessary to recalculate or even redraw the historical data. For real-time data warehouse, how to solve the problem of data recalculation?

The idea of kappa architecture in this part is: first, prepare a message queue that can store historical data, such as Kafka, and the message pair column can support you to start consumption again from a historical node. Then we need to start a new task to consume Kafka from an earlier time node Then when the progress of the new task is even with the running task, you can switch the downstream of the current task to the new task, and the old task can be stopped and the original result table can be deleted.

With the improvement of real-time OLAP technology, a new real-time architecture has been proposed, which is called real-time OLAP variant.

Implementation of typical ETL scenarios based on Flink

This idea is to take on a lot of aggregation, analysis and calculation by the real-time OLAP engine. In the part of real-time data warehouse calculation, we don’t need to pay special attention to some logic related to aggregation, and then we can ensure that we can flexibly face the demand changes of various business analysis in the data application layer, and the whole architecture is more flexible.

Finally, let’s make an overall comparison of these architectures of real-time data warehouse

Implementation of typical ETL scenarios based on Flink

This is a comparison of the overall three real-time data warehouse architectures

  • From the perspective of computing engine: lamda architecture needs to maintain two sets of batch flow computing engines. Kappa architecture and real-time OLAP variants only need to maintain flow computing engines.
  • Development cost: for lamda architecture, the development cost will be higher because it needs to maintain two sets of real-time and offline codes. Kappa architecture and real-time OLAP variants can only maintain a set of code.
  • Analytical flexibility: real time OLAP variants are the most flexible.
  • On the real-time OLAP engine dependence: the real-time OLAP variants strongly depend on the ability of the real-time OLAP variant engine, while the former two are not.
  • Computing resources: lamda architecture needs two sets of computing resources for batch streaming, while kappa architecture only needs streaming computing resources, while real-time OLAP variants require additional OLAP resources.
  • Recalculation of logical changes: lamda architecture is recalculated through batch processing. Kappa architecture needs to re consume message queue recalculation in the way described above. Real time OLAP variants also need to re consume message queues, and the data must be re imported into OLAP engine for calculation.

1.3 traditional vs. real time

Then let’s look at the overall difference between traditional and real-time silos.

Implementation of typical ETL scenarios based on Flink

  1. First fromTimelinessLet’s see: offline data warehouse supports hour level and day level, and real-time data warehouse can reach second level and minute level, so the timeliness of real-time data warehouse is very high.
  2. stayData storage modeLet’s see: the offline data warehouse needs to be stored on HDFS and RDS. The real-time data warehouse generally has message queue and some kV storage. For example, dimension data will be more stored in kV storage.
  3. stayProduction processOn the other hand, the off-line data warehouse needs to rely on the offline computing engine and offline scheduling. But for real-time data warehouse, it mainly depends on real-time computing engine.

2. Implement typical ETL scenarios based on Flink

Here we mainly introduce two real-time ETL scenarios: dimension table join and double stream join.

  • Dimension table join

    • Preload dimension table
    • Hot storage correlation
    • Broadcast dimension table
    • Temporal table function join
  • Double stream join

    • Offline join vs. real time join
    • Regular join
    • Interval join
    • Window join

2.1 dimension table join

2.1.1 preload dimension table

Scheme 1:

The specific implementation method is that we define a class to implement the richflatmapfunction, then read the dimension database in the open function, and then load the full amount of data into memory, and then use the operator on the probe stream to associate with the memory dimension data at runtime.

The advantage of this scheme is that it is relatively simple to implement and has obvious disadvantages. Because we need to load each dimension data into memory, it only supports a small number of dimension data. At the same time, if we want to update the dimension table, we also need to restart the job. Therefore, the cost of updating the dimension data is a little high, and it will cause a period of delay. For preloading dimension table, it is applicable to small dimension table, where the demand for change frequency is not high, and the requirement for timeliness of change is relatively low.

Let’s look at a simple code example:

Implementation of typical ETL scenarios based on Flink

What is intercepted in this code is a key fragment. Here we define a dimflatmapfunction to implement the richflatmapfunction. There is a dim of map type. In fact, after reading the dimension data of DB, it can be used to store our dimension data. Then in the open function, we need to connect to our dB and obtain the data in dB. Then in the following code, we can see that our scenario is to get the ID and name of the product from a commodity list. After getting the dimension data in dB, we will store it in dim.

Next, we will use dim in flatmap function. After obtaining the data of probe stream, we will compare it in dim. If there is a tuple of product ID, it will output the same product name. This is a basic process.

In fact, this is a basic implementation of the original version. However, there is an improvement in this scheme. In the open function, a new thread can be created to load the dimension table regularly. In this way, there is no need to manually restart the job to update the dimension data, and a periodic update of the dimension data can be realized.

Option 2:

Through the distributed cash mechanism, the local dimension files are distributed to the task manager and then loaded into memory for association. The implementation can be divided into three steps

  • The first step is to pass env.registerCached Registration documents.
  • The second step is to implement the richfunction. In the open function, the cache file can be obtained through the runtimecontext.
  • The third step is to parse and use the file data.

One advantage of this method is that you don’t need to prepare or rely on external databases. The disadvantage is that the data is also loaded into memory, so the amount of dimension table data supported is relatively small. Moreover, if the dimension data needs to be updated, the job needs to be restarted. Therefore, it is not recommended to use this scheme in the normal production process, because in fact, from the perspective of data warehouse, all data can be managed through schema. Putting data in files to do such an operation is not conducive to the management and standardization of the overall data. So in this way, you can use it when doing some small demo or some testing.

Then it is applicable to scenarios where the dimension data is in the form of files, the amount of data is relatively small, and the frequency of updating is relatively low. For example, we read a static code table, configuration file, and so on.

2.1.2 hot storage correlation

Implementation of typical ETL scenarios based on Flink

The second major implementation idea in dimension table join is hot storage association. Specifically, we import the dimension data into some hot stores such as redis, TAIR, HBase, and then query them through asynchronous IO, and overlay the cache mechanism. We can also add some obsolete mechanisms. Finally, we cache the dimension data in memory to reduce the overall access pressure on the hot storage.

This process is shown in the figure above. In terms of cache, Google’s guava cache is recommended. It encapsulates some asynchronous interaction about cache, as well as some cache elimination mechanisms. It is convenient to use.

There are two important points in the experimental scheme just now. One is that we need to use asynchronous IO to access storage. Here we will review the difference between synchronous IO and asynchronous io

  • For synchronous IO, after sending a request, you must wait for the request to return before sending a new request. Therefore, the overall throughput is relatively small. As real-time data processing pays special attention to latency, this method of synchronous IO is not acceptable in many scenarios.
  • Asynchronous IO can send multiple requests in parallel. The overall throughput is relatively high and the latency is much lower. If asynchronous IO is used, when the throughput of external storage increases, it will put more pressure on external storage, and sometimes it will become the bottleneck of delay in our whole data processing. Therefore, the purpose of introducing cache mechanism is to reduce our access to external storage through cache.

The use of cuava cache mentioned just now is very simple. The following figure shows an example of defining cache:

Implementation of typical ETL scenarios based on Flink

You can see that its interface is very simple, you can try to use it. For the hot storage association scheme, its advantage is that the dimension data is not fully loaded in the memory, so it is not limited by the memory size, and the dimension data can be more. In the traffic scenario of meituan review, our dimension data can support up to 1 billion orders of magnitude. On the other hand, the disadvantages of this scheme are obvious. We need to rely on hot storage resources, and the feedback of dimension update to the result is delayed. Because we need to import the data into the hot storage first, and then we will lose the cache expiration time at the same time.

Generally speaking, this method is applicable to the situation that the volume of dimension data is relatively large, and it can accept the situation that the dimension update has a certain delay.

2.1.3 broadcasting dimension table

The third idea is to broadcast dimension table, which mainly uses broadcast state to broadcast dimension data stream to downstream task for join.

Implementation mode:

  • Dimension data is sent to Kafka as broadcast raw stream S1
  • Define the state descriptor mapstatedescriptor. Call S1. Broadcast() to obtain broadcaststream S2
  • Call non broadcast stream S3. Connect (S2) to get broadcastconnectedstream S4
  • The associated processing logic is implemented in keyedbroadcastprocessfunction / broadcastprocessfunction, and S4. Process() is called as a parameter

The advantage of this scheme is that the dimension changes can be updated to the results in time. However, the disadvantage is that the data still needs to be saved in memory, because it exists in the state, so the amount of data supporting the dimension table is still not very large. The applicable scenario is that we need to perceive the change of dimension from time to time, and the dimension data can be transformed into real-time flow.

Here is a small demo:

Implementation of typical ETL scenarios based on Flink

The broadcast stream pagestream we use here actually defines a page ID and page name. For the non broadcast stream probestream, it is a string in JSON format, which contains the device ID, page ID, and time stamp. We can understand it as the user’s behavior record of PV access on the device.

The whole implementation is to follow the above four steps

  • The first step is to define the broadcast state descriptor.
  • In step 2, we will generate the broadcaststream here.
  • In step 3, we need to connect the two streams.
  • The most important step in step 4 is to implement the broadcastprocessfunction. The first parameter is our probestream, the second parameter is the broadcast stream data, and the third parameter is our output data. You can see that the main data processing logic is in processelement.

In the process of data processing, we first get our broadcast statedesc through context, then parse the data of probe stream, and finally get a corresponding pageid. Then we just got the state to query whether there is the same pageid. If we can find the corresponding pageid, we will add the corresponding pagename to our entire JSON stream for output.

2.1.4 Temporal table function join

After introducing the above methods, another important method is to use temporary table function join. First of all, what is a temporary table? It is actually a concept: it is a view that can return the data content of the continuous change table at a certain time. The continuous change table, also known as the changing table, can be a real-time changelog data or a materialized dimension table placed on external storage.

Its implementation is to join probe stream and temporary table through udtf, which is called temporary table function join. This join method is applicable to the scenario that the dimension data is in the form of changelog stream, and we need to associate according to the time version.

First of all, let’s take an example of exchange rate and currency transactions on the official website. For our dimensional data, that is, the changelog stream just mentioned, it is ratehistory. It reflects the exchange rates of different currencies relative to the yen at different times.

Implementation of typical ETL scenarios based on Flink

The first field is time and the second is currency. The third field is the exchange rate relative to Japanese yen, and then in our probe table, it defines the purchase of orders in different currencies. For example, when you buy two euros at 10:15, the table records a situation of currency transactions. In this example, what we require is the total transaction volume of yen for purchasing currency. How can we achieve this goal through the temporary table function join?

  • In step 1, we need to define the temporary table function on the changelog stream. There are two key parameters that are necessary. The first parameter is a time attribute that can help us identify the version information. The second parameter is the component that needs to be associated. In this case, we choose currency.
  • Then we can register the name of temporary table function in tableenv.

Then let’s take a look at our registered temporaltablefunction and what it can do.

Implementation of typical ETL scenarios based on Flink

For example, if we use the rates function to get the 11:50 state. In fact, we can see that it falls at 11:11 in the range of $99:50. Then for the euro, the time of 11:50 falls between 11:15 and 12:10, so we will choose a data such as 119. In fact, it implements the concept of temporary table that we defined at the beginning, and can obtain the valid data of changelog at a certain time. After defining the temporary table function, we need to use this function to implement the business logic.

Implementation of typical ETL scenarios based on Flink

Please note that we need to specify the join key we need to use. For example, because the two streams are constantly updated, for the record at 11:00 in our order table, it is related to the status of euro at 10:45, and then it is 116, so the final result is 232.

What I just introduced is the usage of temporary table function join.

Comparison of 2.1.5 dimension table join

Then, let’s review the differences in the dimension table join, so that we can better understand the scenarios in which each method is applicable.

Implementation of typical ETL scenarios based on Flink

  • In terms of implementation complexity: except that the hot storage association is slightly more complex, the other implementation methods are basically low in complexity.
  • In the dimension table data volume: Hot Storage Association and temporary table function join can support more data. In other ways, the size of the memory is limited because the dimension table must be loaded into the memory.
  • Above the update frequency of dimension tables: because preloading DB data into memory and distributed cache need to be restarted when updating dimension table data again, they are not suitable for scenarios where dimension tables need to be changed frequently. For broadcast dimension table and temporary table function join, the dimension table data can be updated in real time and reflected to the results, so they can support frequent updating scenarios of dimension tables.
  • For the real-time updating of dimension table: Broadcast dimension table and temporary table function join can achieve faster real-time update effect. Hot storage association can also meet business requirements in most scenarios.
  • In the form of dimension table: you can see that the first way mainly supports accessing DB to store a small amount of data, distributed cache supports file form, and hot storage association needs to access HBase and TAIR, etc. Both broadcast dimension table and temporary table function join require dimension data to be transformed into real-time stream.
  • On external storage: the first way and hot storage association need to rely on external storage.

In the dimension table join, we will first introduce these basic methods. Some students may have some other plans, and then they can give feedback and exchange. Here, some common schemes are proposed, but not limited to these schemes.

2.2 double stream join

First of all, let’s review how batch processes two table joins? Generally, when batch engine is implemented, two ideas are adopted.

One is sort merge join based on sorting. The other is to convert it into a hash table and load it into memory for hash join. Are these two ideas also applicable to the scenario of double stream join? In the dual stream join scenario, the objects to be processed are no longer batch data or limited data, but infinite data sets. For infinite data sets, we can’t sort them and then process them. Similarly, we can’t convert all the infinite data sets into cache and load them into memory for processing. Therefore, these two methods are basically not applicable. At the same time, in the dual stream join scenario, our join objects are two streams, and the data is constantly entering, so the results of our join also need to be continuously updated.

So what kind of solution should we have to solve the implementation problem of double stream join? One of the basic ideas of Flink is to store the data of two streams in the state persistently, and then use it. Because the result of join needs to be updated constantly, the previous data cannot be discarded without any additional conditions. However, in terms of implementation, state can’t keep all the data permanently, so we need to localize the global scope of join in some ways, that is to split an infinite data stream into segments of wired data sets as far as possible to join.

In fact, it is such a big idea. Let’s take a look at the specific implementation.

2.2.1 offline join vs. real time join

Next, let’s take inner join as an example to see a simple implementation idea:

Implementation of typical ETL scenarios based on Flink

The left stream is marked in black, and the right stream is marked in blue. The two streams need inner join. First of all, the left stream and the right stream need to store the relevant elements in the corresponding state after the elements are entered. In addition to being stored in the state, the data elements in the left stream need to be compared with the right state on the right to see if they can be matched. Similarly, when the flow element on the right arrives, it is also necessary to compare it with the left state on the left to see whether it can be matched. If it can, it will be output as the result of inner join. This figure is relatively rough, showing the general details of an inner join. It is also a way to realize the double stream join.

2.2.2 Regular join

Let’s take a look at the first type of double stream join, regular join. This join method needs to keep the state of the two streams, keep them continuously and do not clear them. The data on both sides are visible to each other’s streams, so the data needs to be persistent in the state. Then the state cannot be stored too large. Therefore, this scenario is only suitable for bounded data streams. Its syntax is similar to SQL for offline batch processing

Implementation of typical ETL scenarios based on Flink

In the above page, there are some writing methods of regular join supported by Flink. You can see that they are basically consistent with our ordinary SQL.

2.2.3 Interval join

The second type of join supported by Flink is interval join, also known as interval join. What does it mean? In other words, a time window is added. When two streams join, one stream must fall within a certain time range of the other stream’s timestamp, and their join keys are the same to complete the join. By adding the time window limit, we can clean up the data beyond the time range. In this way, we don’t need to reserve all the States.

Interval join supports both processing time and even time to define time. If processing time is used, Flink will use the system time to divide the window and do the related state cleaning. If ever time is used, the watermark mechanism will be used to divide the window and clean the state.

Let’s look at some examples:

Implementation of typical ETL scenarios based on Flink

The example in the figure above uses two tables: one is the order table and the other is the delivery table. The time limit defined here is that the delivery time must be within 4 hours after the order is placed.

The author of Flink has a very intuitive content sharing before. Here, I directly reuse an example of his part:

Implementation of typical ETL scenarios based on Flink

We can see that for interval join, it defines a lower limit of time, so that we can clean up the data beyond the time limit. For example, in the SQL just now, we have limited the join condition that the order time must be greater than shiptime minus 4 hours. For the shipment stream, if the watermark is received at 12:00, it means that the data of the orders stream is less than the data timestamp before 8:00, and it can be discarded and no longer kept in the state.

Implementation of typical ETL scenarios based on Flink

At the same time, for shiptime, it also sets a lower limit of time, that is, it must be greater than the order time. For the orders stream, if a watermark of 10:15 is received, the data before 10:15 of the shipment state can be discarded. Therefore, interval join enables us to clean up some historical states.

2.2.4 Window join

Finally, let’s talk about the third window join of double stream join: its concept is to join elements with the same key and in the same window in two streams. Its execution logic is similar to inner join, which must satisfy the same join key at the same time, and the elements can be output in the final result in the same window. The specific usage is as follows:

Implementation of typical ETL scenarios based on Flink

At present, window join only supports datastream API, so the usage here is also a form of datastream. We can see that we first join the two streams, and then define the conditions of the join key in where and equalto. Then we need to specify the window partition method, windowassigner, and finally define the joinfunction or flatjoinfunction to realize our specific processing logic of matching elements.

Because windows are actually divided into three categories, our window join is also divided into three categories:

  • The first type of tumbling window join: it is a window divided by time interval.

Implementation of typical ETL scenarios based on Flink

You can see that there are two streams in this diagram (green flow year and yellow flow). In this example, we define a two millisecond window. Each loop is a single element on each stream. The time stamp above represents the corresponding time of the element. Therefore, we can see that it is divided according to the interval of two milliseconds. There is no overlap between windows and windows. For the first window, we can see that there are two elements in the green stream and two elements in the yellow stream. They will be combined in the form of pair, and finally input into the joinfunction or flatjoinfunction for specific processing.

  • The second type of window is sliding window join: Here we use sliding window.

Implementation of typical ETL scenarios based on Flink

Sliding window defines the size of a window first, and then defines the size of a sliding time window. If the size of the sliding time window is smaller than the defined window size, there will be overlap between the windows. As shown in the figure, the red window and the Yellow window overlap. The 0 element of the green flow is in the red window and the Yellow window at the same time, indicating that an element can be in two windows at the same time. For window 0, the two elements of window 0, which are composed of red window and 0, can meet the conditions. Then, for the Yellow window, the green 0 and yellow 0 and 1 are qualified. They will be combined into two pairs of 0,1,0,0 and 1,0. Finally, they will enter the join function defined by us for processing.

  • The third type is sessionwindow join: the window used here is session window.

Implementation of typical ETL scenarios based on Flink

The session window defines a time interval. If a stream does not arrive within this time interval, it will open a new window. In the picture above, we can see that there is no overlap between windows. The gap we define here is 1. For the first window, we can see that the green 0 element and the yellow 1 and 2 element are in the same window, so it will form a pair of 1, 0 and 2,0. The rest are similar. The qualified pairs will enter the final join function for processing.

As a whole, we can review it. This section mainly introduces the Flink ETL implementation methods for dimension table join and double stream join scenarios. In the dimension table join, we mainly introduce four ways: preloading dimension table, hot storage association, broadcast dimension table and temporary table function join. Then we introduce double join and regular join.

The author introduces:

Mairong, senior technical expert of meituan review, is responsible for the construction of user panoramic data. He has been responsible for the construction of flow data warehouse and flow management analysis system of meituan review, focusing on the fields of massive data processing, data warehouse construction, user and traffic analysis, etc.