Introduction: the real-time warehouse introduction training camp was jointly organized by Alibaba cloud researcher Wang Feng, Alibaba cloud senior technical expert Jin Xiaojun, Alibaba cloud senior product expert Liu Yiming and other real-time computing Flink version and several front-line technical / product experts of hologres to build the course system of the training camp and carefully polish the course content, Hit directly at the pain points encountered by the current students. From simple to deep, comprehensively analyze the architecture, scene and practical application of real-time data warehouse. Seven high-quality courses help you grow from Xiaobai to Daniel in five days!
This article is compiled from the live broadcast “real time computing Flink version of SQL practice – Li Lin (seal)”
1、 Introduction to Flink SQL for real-time computing
2、 Real time computing Flink SQL starter example
3、 Develop common problems and Solutions
Introduction to Flink SQL for real-time computing
（1） About real-time computing Flink SQL
Flink version of real-time computing selects SQL, a declarative language, as the top-level API, which is relatively stable and convenient for users. Flink SQL has the characteristics of unified flow batch, giving users a unified development experience and consistent semantics. In addition, Flink SQL can be automatically optimized, including shielding the complexity of the state in the flow calculation, providing an automatically optimized plan, and integrating the function of automatic tuning of autopilot. Flink SQL also has a wide range of application scenarios, including data integration, real-time reports, real-time risk control, and online machine learning.
（2） Basic operation
Basically, you can see that the syntax of SQL is very similar to standard SQL. The example includes basic select and filter operations., You can use built-in functions, such as date formatting, or user-defined functions. For example, the exchange rate conversion in the example is a user-defined function, which can be used directly after registration on the platform.
（3） Dimension table lookup join
In the actual data processing process, the lookup join of dimension table is also a common example.
Here is an example of a dimension table inner join.
The source table shown in the example is a real-time order information table. It associates dimension table information through inner join. The syntax of dimension table join highlighted in yellow here. You can see that it is different from traditional batch processing in writing, with more for system_ Time as of this clause to indicate that it is a dimension table join operation. Every order message from the source table will trigger the dimension table operator to query the dimension table information, so it is called a lookup join.
（4） Window aggregation
Window aggregation is also a common operation. Flink SQL supports several common window types, such as tumble window, session window, hop window, and the newly introduced cumulate window.
Tumble window can be understood as a fixed size time window, also known as rolling window. For example, there is no overlap between windows with fixed intervals of 5 minutes, 10 minutes or 1 hour.
The session window defines the range of a continuous event. A parameter in the window definition is called session gap, which means that if the interval between two data exceeds the defined time, the previous window will end and a new window will be generated at the same time.
Unlike scrolling windows, hop windows do not overlap, and windows of sliding windows can overlap. The sliding window has two parameters: size and slide. Size is the size of the window, and slide is the step size of each slide. If slide < size, the windows will overlap, and the same piece of data may be assigned to multiple windows; If slide = size, it is equivalent to tumble window. If slide > size, there is no overlap and gaps between windows.
The cumulative window is newly introduced in version 1.13 of Flink community. It can be understood by comparing with hop window. The difference is that it continues to accumulate from window start. In the example, window 1, window 2 and window 3 are growing. It has a maximum window length. For example, if we define window size as one day and step size as one hour, it will generate aggregate results accumulated to the current hour every hour of the day.
Take a look at a specific example of window aggregation processing.
As shown in the figure above, for example, it is necessary to count the hits of a single user every 5 minutes.
The source data is the user’s click log. We expect to calculate the total number of clicks of a single user every 5 minutes. The latest windowtvf syntax of the community is used in SQL. First open the window on the source table, and then the attribute window corresponding to the group by window_ Start and window_ End, count (*) is the click count.
It can be seen that when processing the data from 12:00 to 12:04, two users generated 4 clicks, and it can be counted that the user Mary is 3 times and Bob is 1 time. In the next batch of data, there are three pieces of data, which are updated to the next window respectively once and twice.
（5） Group aggregation
Compared with window aggregation, group aggregation directly triggers the calculation without waiting until the end of the window. One applicable scenario is to calculate the cumulative value.
The example above is the statistics of the number of hits accumulated by a single user to the current. From the perspective of query, the writing method is relatively simple. Direct group by user to calculate count (*), that is, cumulative count.
It can be seen that the results are different from the output of window. In the first four pieces of input data the same as window, the result of group aggregation output is that Mary’s hits have been updated to 3 times. The specific calculation process may change from 1 to 2 and then to 3. Bob is once. With the input of the next three pieces of data, Bob’s corresponding hits will be updated to 2 times, The result is a continuous update process, which is different from the calculation scenario of window.
The data output in the previous window will not change after the window ends. In group aggregation, the results of the same group key will be continuously updated.
（6） Window aggregation vs group aggregation
Compare some differences between window and group aggregation more comprehensively.
Window aggregation is output on time in the output mode. It will be output only after the defined data expires. For example, if you define a 5-minute window, the result is delayed output. For example, in the time period from 00:00 to 00:05, it will wait until the data of the whole window is complete, and the result will be output only once and will not be changed.
Group aggregation is a data trigger. For example, the first data will output the result, and the second data of the same key will update the result. Therefore, the nature of the output stream is also different. Window aggregation normally outputs append stream, while group aggregation outputs update stream.
There are also large differences in state processing between the two. Window aggregation will automatically clean up expired data, so users don’t need to pay extra attention to state expansion. Group aggregation accumulates based on infinite States, so users need to define the TTL of the state according to their own calculation scenarios, that is, how long the state is saved.
For example, to count the accumulated PV and UV in a day, regardless of the data delay, at least ensure that the TTL of the state is greater than or equal to one day, so as to ensure the accuracy of the calculation. If the TTL of the state is defined as half a day, the statistical value may be inaccurate.
The storage requirements for the output are also determined by the nature of the output stream. On the output of window, because it is an append stream, all types can be docked for output. Group aggregatio outputs the update stream, so the target storage is required to support updates. You can use hologres, MySQL or HBase to support updates.
Real time computing Flink SQL starter example
Let’s see how each SQL operation will be used in real business scenarios through specific examples, such as basic SQL syntax operations, including the use of some common aggregation.
（1） Example scenario description: e-commerce transaction data – real-time warehouse scenario
The example here is the e-commerce transaction data scenario, which simulates the layered data processing in the real-time data warehouse.
In the data access layer, we simulate the transaction order data of e-commerce, which includes order ID, commodity ID, user ID, transaction amount, leaf category of commodities, transaction time and other basic information. This is a simplified table.
Example 1 will complete a data cleaning work from the access layer to the data detail layer. In addition, it will do the association of category information. Then, in the data summary layer, we will demonstrate how to complete the transaction statistics at the minute level and how to do the real-time transaction statistics at the hour level. Finally, we will introduce how to do the quasi real-time statistics in the transaction scenarios accumulated at the day level.
-Example environment: beta version
The demo environment is the current internal beta version of real-time computing Flink product. On this platform, you can directly do one-stop job development, including debugging and online operation and maintenance.
-Access layer data
Use SQL datagen connector to generate simulated e-commerce transaction data.
Access layer data: in order to facilitate the demonstration and simplify the link, the built-in SQL datagen connector is used to simulate the generation of e-commerce data.
There’s order_ ID is a self increasing sequence designed, and the parameters of connector are not completely posted. Datagen connector supports several generation modes. For example, sequence can be used to generate self increasing sequence, and random mode can simulate random values. Here, different generation strategies are selected according to different field business meanings.
Like order_ The ID is self increasing, the commodity ID is randomly selected from 10000 to 100000, the user ID is 10000 to 10 million, and the transaction amount is divided into two units, cat_ ID is the leaf category ID. here, 100 leaf categories are simulated and directly generated by calculating the column to take the remainder of the commodity ID. the order creation time is simulated by the current time, so that it can be debugged on the development platform without creating Kafka or datahub for access layer simulation.
（2） Example 1-1 data cleaning
-E-commerce transaction data – order filtering
This is a data cleaning scenario. For example, if you need to complete order filtering in business, the business party may have the maximum and minimum exception filtering for the transaction amount. For example, if it is greater than 1 yuan and less than 10000 yuan, it will be retained as valid data.
The transaction creation time is after a certain time. This logic can be completed through where condition combination filtering.
The real business scenario may be much more complex. Let’s see how SQL runs.
This is the debugging mode. Click the run button on the platform for local debugging. You can see that the amount column is filtered, and the order creation time is greater than the required time value.
From this simple cleaning scenario, it can be seen that there is little difference between real-time and traditional batch processing in terms of writing, including output results. The main difference of flow jobs is that they run for a long period of time, unlike traditional batch processing, which ends after processing data.
（3） Example 1-2 Category Information Association
Next, let’s see how to do dimension table Association.
According to the order data of the access layer just now, because the original data contains the leaf category information, it is necessary to associate the dimension table of the category in business. The dimension table records the association relationship, ID and name from the leaf category to the first level category. The goal of the cleaning process is to associate the dimension table with the leaf category ID in the original table and supplement the ID and name of the first level category. Here, select the fields corresponding to the dimension table after association through the writing method of inner join dimension table.
The only difference between the writing method of dimension table and batch processing is the special syntax for system_ TIME AS OF。
As shown above, the platform can upload its own data for debugging. For example, a CSV test data is used here to map 100 leaf categories to 10 primary categories.
The single digit of the corresponding leaf category ID is the ID of its primary category, which will be associated with the corresponding primary category information and return its name. The advantage of local debugging is that it is fast and the results can be seen immediately. In the local debugging mode, the terminal will automatically pause after receiving 1000 pieces of data to prevent the result from affecting the use.
（4） Example 2-1 minute transaction statistics
Next, let’s take a look at window based statistics.
The first scenario is minute level transaction statistics, which is a commonly used calculation logic in the summary layer.
Minute level statistics is easy to think of the tumble window. Each minute is calculated separately. Several indicators need to be calculated, including the total number of orders, the total amount, the number of traded goods, the number of traded users, etc. The number of traded goods and users need to be de duplicated, so a distinct processing is done in the writing method.
The window is the tumble window just introduced. Draw a window of one minute according to the order creation time, and then count the transaction of each minute according to the dimension of the first level category.
The above figure is a little different from the debugging mode just now. After going online, it is really submitted to the cluster to run a job. Its output adopts debugging output and prints directly to the log. Expand the job topology and you can see that the two-stage optimization of local global is automatically enabled.
-Run log – view debug output
After running for a period of time, you can see the final output through the log in the task.
Print sink is used, which will be directly typed into the log. For the output of the real scene, such as writing to hologres / MySQL, you need to view it in the corresponding stored database.
It can be seen that the output data lags behind the original time of the data.
At 19:46:05, the data in the window of 19:45:00 is output, which is delayed by about 5 seconds, and the aggregation result of the first minute is output.
These five seconds are actually related to the setting of water mark when defining the source table. When declaring water mark, it is relatively GMT_ The create field is offset for 5 seconds. The effect is that when the earliest data arrives at 19:46:00, we think the water level is 19:45:55, which is the 5-second delay effect to realize the tolerant processing of disordered data.
（5） Example 2-2 hourly real-time transaction statistics
The second example is to do hourly real-time transaction statistics.
As shown in the above figure, when real-time statistics are required, directly open the tumble window to a tumble window of 1 hour size. Can this meet the real-time performance? According to the output results just shown, it has a certain delay effect. Therefore, when opening an hour window, you must wait until the data of this hour are received, and then the results of the previous hour can be output at the beginning of the next hour. If the delay is at the hour level, it can not meet the requirements of real-time. Reviewing the previous introduction, group aggregation can meet the real-time requirements.
Specifically, for example, it is necessary to complete the two caliber statistics of hour + category and hour only. The two statistics are done together. The grouping sets function commonly used in traditional batch processing is also supported on real-time Flink.
We can directly group by grouping sets. The first is the hourly full caliber, the second is the category + hourly statistical caliber, and then calculate the number of orders, including the total amount, the number of weight removed goods and the number of users.
In this way, null value conversion is added to the result to facilitate viewing the data, that is, the statistics of hourly full caliber. The output primary category is empty, so it needs to be null value conversion.
The above is the running process of debugging mode. You can see that the data generated by datagen is updated to the first level category and its corresponding hours in real time.
It can be seen here that the results of two different group by are output together, and a column of all in the middle is converted by null value, which is the statistical value of full caliber. Local debugging is relatively intuitive and convenient. If you are interested, you can also apply or purchase on Alibaba cloud’s official website for experience.
（6） Example 2-3 day level cumulative transaction quasi real-time statistics
The third example is the day level cumulative transaction statistics. The business requirement is quasi real-time. For example, it can accept the minute level update delay.
According to the real-time statistics of group aggregation hours just now, it is easy to think that this requirement can be realized by directly changing query into day dimension, and the real-time performance is relatively high. After the data is triggered, it can be updated in seconds.
Recall the differences between window and group aggregation in the processing of built-in states mentioned earlier. Window aggregation can realize the automatic cleaning of states. Group aggregation requires users to adjust TTL. Due to the quasi real-time requirements in business, there can be an alternative scheme here. For example, the newly introduced cumulate window is used for cumulative window calculation, the day level is accumulated, and then the minute level step is used to realize the quasi real-time requirements updated every minute.
Review the cumulate window, as shown above. If the day level is accumulated, the maximum size of the window is to days, and its window step is one minute, which can express the day level cumulative statistics.
The specific query is as above. Here, the new tvf syntax is used to include the definition of windows in the middle through a table keyword, and then the cumulate window refers to the input table, and then defines its time attribute, step size and size parameters. Group by is a common writing method. Because it has early output, we print the start time and end time of the window together.
This example also looks at the log output through online operation.
It can be seen that its structure is similar to that of the previous tumble window. It is also pre aggregation plus global aggregation. The difference between it and tumble window is that it does not need to wait until all the data is available.
-Run log – observe commissioning results
As can be seen from the above example, at 20:47:00, the results from 00:00:00 to 20:47:00 have accumulated, and there are four corresponding columns of statistical values. The next output is the next cumulative window. You can see that 20:47:00 to 20:48:00 is a cumulative step, which can meet the cumulative statistical requirements of day level and quasi real-time requirements.
（7） Example summary: e-commerce transaction data – real time warehouse scenario
Then let’s summarize the above examples as a whole.
The cleaning process from the access layer to the details layer is relatively simple and clear. For example, the business logic requires fixed filtering conditions, including dimension expansion, which are very clear and direct.
From the detail level to the summary level, for the minute level statistics in the example, we use the tumble window, while the hour level is changed to group aggregation because of the real-time requirements, and then to the day level accumulation to display the group aggregation and the newly introduced cumulate window respectively.
From the calculation characteristics of the summary layer, we need to pay attention to the real-time requirements and data accuracy requirements of the business, and then select window aggregation or group aggregation according to the actual situation.
Why mention data accuracy here?
When comparing window aggregation and group aggregation at the beginning, it was mentioned that the real-time performance of group aggregation is very good, but its data accuracy depends on the TTL of the state. When the statistical period is greater than TTL, the TTL data may be distorted.
On the contrary, in window aggregation, there is an upper limit on the tolerance of disorder, such as waiting for one minute at most. However, in the actual business data, 99% of the data may meet this requirement, and 1% of the data may come after an hour. Based on the processing of water mark, it is a discarding strategy by default. The data exceeding the maximum offset will be discarded and will not be included in the statistics. At this time, the data will also lose its accuracy. Therefore, this is a relative indicator and needs to be selected according to the specific business scenario.
Develop common problems and Solutions
（1） Common problems in development
Above is the problem of high frequency in real-time calculation of real business contact.
The first is real-time computing. I don’t know how to start and how to start real-time computing. For example, some students have batch processing background, and then just started to contact Flink SQL. I don’t know where to start.
Another kind of problem is that after the SQL is written, it is also clear about the level of data input and processing, but it is not known how much resources need to be set after the real-time job runs
Another kind is that SQL is written more complex. At this time, we need to debug. For example, we need to check why the calculated data does not meet expectations and other similar problems. Many students have no way to start.
How to tune the job after running is also a very high-frequency problem.
（2） Develop solutions to common problems
1. How to start real-time computing?
For the problem of getting started, the community has many official documents and provides some examples. You can start with simple examples to slowly understand the characteristics of different operators in SQL during flow computing.
In addition, you can also pay attention to the real time computing of Flink, ververica.cn, B Apache Flink official account.
After becoming familiar with SQL, if you want to apply it to the production environment to solve real business problems, Alibaba cloud’s industry solutions also provide some typical architecture designs for reference.
2. How to debug complex operations?
If you encounter thousands of lines of complex SQL, even for Flink’s development students, you can’t locate the problem at a glance. In fact, you still need to follow the process from simple to complex. You may need to use some debugging tools, such as the platform debugging function demonstrated earlier, and then do segmented verification. After debugging the correctness of local results of small SQL segments, Assemble step by step, and finally make this complex operation meet the requirements of correctness.
In addition, you can use the characteristics of SQL syntax to organize SQL more clearly. Real time computing Flink has a code structure function, which can easily locate specific statements in long SQL. These are all auxiliary tools.
3. How to tune the initial resource settings of a job?
We have an experience that we initially do a small concurrent test according to the input data to see its performance, and then estimate it. During large concurrent pressure test, it is a direct but reliable way to approach gradually according to the required throughput, and then get the expected performance configuration.
Tuning is mainly based on the operation of jobs. We will pay attention to some key indicators, such as whether there is data skew, whether the lookup join of dimension tables needs to access external storage, and whether there is an IO bottleneck. These are common bottlenecks affecting job performance, which need to be paid attention to.
A function called autopilot is integrated into the real-time computing Flink product, which can be understood as similar to automatic driving. Under this function, the initial resource setting is not a troublesome problem.
In the product, after setting the maximum resource limit of the job, according to the actual data processing capacity, the engine can automatically help us adjust to the optimal state and scale according to the load.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.