Stepping on the pit | time zone problem in Flink day level window

Time:2021-9-6

Each article in this series starts from some practical cases, analyzes some problems often encountered in the production environment, and offers a brick to attract jade, so as to help the partners solve some practical problems. This article introduces the Flink time and time zone, and analyzes the time zone problems encountered in the day level window. If it is helpful to small partners, welcome to like + look again~
This paper is mainly divided into two parts:
The first part (sections 1 – 3) mainly analyzes the time zone problems and solutions in the day level window of Flink.
The analysis in the second part (Section 4) can be used as the analysis idea of all time zone problems. It mainly makes a general in-depth analysis based on the case of why the time zone offset in the solution is plus 8 hours.
In order to let readers have a general understanding of the problems discussed in this paper, this paper first gives the Problem SQL and the solution. A detailed analysis is given later~

1. Problems and Solutions

Problem SQL

SQL is very simple. It is used to count the cumulative UV of the day.

---------------Pseudo code---------------
INSERT INTO
 kafka_sink_tableSELECT
 --Window start time cast(
 TUMBLE_START(proctime, INTERVAL '1' DAY) AS bigint
 ) AS window_ Start, -- time of current record processing cast (max (procime) as bigint) as current_ ts,
 --UV count (distinct ID) as part in each bucket_ daily_ full_ uv
FROM
 kafka_source_tableGROUP BY
 mod(id, bucket_number),
 -- bucket_number 为常数,根据具体场景指定具体数值 TUMBLE(proctime, INTERVAL '1' DAY)---------------Pseudo code---------------

Can you see the problems of this SQL at a glance( PS: the data source and data sink time zone are East Zone 8)
Yes, there is a time zone problem in the day level window, that is, this code does not count the UV of the whole day’s data in the East eighth district where the building owner is located. The whole day’s range of this code is from 8 a.m. on the first day to 8 a.m. on the second day.

Solution

The current time zone of the landlord is Dongba District, and the solution is as follows:

---------------Pseudo code---------------
CREATE VIEW view_table AS
SELECT
 ID, -- solved through the injection time -- add the time offset of East Zone 8, and set the injection time as the timestamp column cast (current)_ TIMESTAMP AS BIGINT) * 1000 + 8 * 60 * 60 * 1000 as ingest_ time
FROM 
   source_table;
INSERT INTO
 target_tableSELECT
 CAST(
 TUMBLE_START(ingest_time, INTERVAL '1' DAY) AS bigint
 ) AS window_start, cast(max(ingest_time) AS BIGINT) - 8 * 3600 * 1000 AS current_ts,
 count(DISTINCT id) AS part_daily_full_uv
FROM
 view_tableGROUP BY
 mod(id, 1024),
 --Divide the day level window according to the injection time, and turn (ingest)_ time, INTERVAL '1' DAY)
---------------Pseudo code---------------

Through the above scheme, the time range of statistical data can be adjusted from 0:00 today to 0:00 tomorrow in Dongba district. The whole requirement scenario and the implementation and analysis process of the solution are described in detail below.

2. Requirement scenario and implementation scheme

Demand scenario

Coming, the demand scenario is relatively simple. It is a buried point log data source in the upstream of consumption. According to the ID in the buried point, the cumulative UV from 0:00 to the current time of the day is counted, and it is output to the downstream OLAP engine according to the minute level for simple aggregation. Finally, it is displayed on the Bi Kanban without any dimension field (moved to cry).

Data link and component selection

Client user behavior buried point log – > logserver – > Kafka – > flex (SQL) – > Kafka – > Druid – > Bi kanban.
There are many implementation schemes and specific implementation methods. This time, the SQL API is used.

flink sql schema

The source and sink table schemas are as follows (only key fields are reserved):

---------------Pseudo code---------------
CREATE TABLE kafka_sink_table (
 --Day level window start time window_ start BIGINT,
 --Current record processing time_ ts BIGINT,
 --UV (ID is divided into buckets during processing) part in each bucket_ daily_ full_ uv BIGINT
) WITH (
 -- ... );
CREATE TABLE kafka_source_table (
 --... -- ID for UV calculation
 id BIGINT,
 --Processing time (proctime as proctime()) with(
 -- ... );
---------------Pseudo code---------------

flink sql transform

---------------Pseudo code---------------
INSERT INTO
 kafka_sink_tableSELECT
 --Window start time cast(
 TUMBLE_START(proctime, INTERVAL '1' DAY) AS bigint
 ) AS window_ Start, -- time of current record processing cast (max (procime) as bigint) as current_ ts,
 --UV count (distinct ID) as part in each bucket_ daily_ full_ uv
FROM
 kafka_source_tableGROUP BY
 mod(id, bucket_number),
 -- bucket_number 为常数,根据具体场景指定具体数值 TUMBLE(proctime, INTERVAL '1' DAY)---------------Pseudo code---------------

Use the early fire mechanism (the same as the continuousprocessingtimetrigger in the datastream API) and set the trigger interval to 60 s.
In the above SQL implementation, we divide the IDS into buckets, and the number of data pieces output per minute is bucket_ Number, and finally sum up the data of all buckets in Druid according to the minute granularity to obtain the full amount of UV accumulated from 0:00 of the day to the current minute.

Time zone problem

Passion scene restoration:

Initials ∩ technical little brother: use SQL, easy game, sit around and fish

Initials ∩ technical little brother: wait until00:00I found that the indicators are still rising. Is it because the SQL logic is wrong? It shouldn’t be. After trying for minutes, there is no such problem in the hour level window

Initials ∩ technical little brother: pick your head ing, forget it. I’ll analyze this problem later. There’s still business to do now

Initials ∩ technical little brotherIn the morning, I glanced at the configured time series report and found that08:00The indicator returns to zero and starts accumulating again. The idea flashed by, East eighth district( Why was there no window in the sink data under format_ start…)

3. Problem location

Problem description

When using the concept of time, Flink is based on the concept of java time era (i.e. Greenwich 1970 / 01 / 01 00:00:00, i.e. UNIX timestamp is 0), and window alignment and triggering are also based onJava time Era

Problem scenario reproduction

You can directly view the window of sink data_ Start draws the above conclusion.
However, in order to restore the whole process, we reproduce the whole problem according to the following source and sink data:
The source data is as follows:

id proctime Proctime UTC + 0 (Greenwich) format time Proctime UTC + 8 (Beijing) format time
1 1599091140000 2020/09/02 23:59:00 2020/09/03 07:59:00
2 1599091140000 2020/09/02 23:59:00 2020/09/03 07:59:00
3 1599091140000 2020/09/02 23:59:00 2020/09/03 07:59:00
1 1599091200000 2020/09/03 00:00:00 2020/09/03 08:00:00
2 1599091200000 2020/09/03 00:00:00 2020/09/03 08:00:00
3 1599091260000 2020/09/03 00:01:00 2020/09/03 08:01:00

Sink data(For ease of understanding, the data is displayed directly after Druid aggregation):

window_start current_ts part_daily_full_uv window_ Start UTC + 8 (Beijing) format time current_ TS UTC + 8 (Beijing) format time
1599004800000 1599091140000 3 2020/09/02 08:00:00 2020/09/03 07:59:00
1599091200000 1599091200000 2 2020/09/03 08:00:00 2020/09/03 08:00:00
1599091200000 1599091260000 3 2020/09/03 08:00:00 2020/09/03 08:01:00

From the above data, it can be found that the day level windowstart timeThe time zone in UTC + 8 (Beijing) is 8 a.m. every day, that is, 0 a.m. in UTC + 0 (Greenwich).
The solution is given below, and then the concepts of each time and time zone are analyzed in detail~

Solution

SQL level solution

---------------Pseudo code---------------
CREATE VIEW view_table AS
SELECT
 ID, -- solved through the injection time -- add the time offset of East Zone 8, and set the injection time as the timestamp column cast (current)_ TIMESTAMP AS BIGINT) * 1000 + 8 * 60 * 60 * 1000 as ingest_ time
FROM 
   source_table;
INSERT INTO
 target_tableSELECT
 CAST(
 TUMBLE_START(ingest_time, INTERVAL '1' DAY) AS bigint
 ) AS window_start, cast(max(ingest_time) AS BIGINT) - 8 * 3600 * 1000 AS current_ts,
 count(DISTINCT id) AS part_daily_full_uv
FROM
 view_tableGROUP BY
 mod(id, 1024),
 --Divide the day level window according to the injection time, and turn (ingest)_ time, INTERVAL '1' DAY)
---------------Pseudo code---------------

The time zone I currently belong to is Dongba district (Beijing time). This problem can be solved by setting the injection time through the above SQL and dividing the day level window by adding an 8-hour offset to the injection time (or by adding a corresponding injection timestamp to the calculation column in the schema when creating table). If you have a better solution at the SQL level, welcome to discuss~

Notes:

  • The solution for the East n zone is timestamp + nWith an offset of 3600 seconds, the solution for the West n area is timestamp-n3600 second offset
  • The datastream API has the same day level window time zone problem

Here’s a question: why does Dongba District need to add an 8-hour offset to the timestamp for day level window calculation, rather than subtracting 8 hours or adding 32 (24 + 8) hours? Have you analyzed it in detail~
According to the above problems, this paper leads to the second part of this paper, that is, the deep analysis of time zone offset, which can be used as the analysis idea of all time zone problems.

4. Why does the East eighth District add 8 hours?

Basic concepts of time and time zone

time zone: because countries and regions in the world have different longitude and local time, they will be divided into different time zones.
UNIX timestamp“): Unix timestamp, or UNIX time or POSIX time, is a time representation, defined as the total number of seconds from 00:00:00 GMT, January 1, 1970 (Midnight UTC / GMT).
UNIX timestamp is not only used in UNIX systems and UNIX like systems, but also widely used in many other operating systems.
GMT: Greenwich mean time. This is the time obtained from the observation results of the British Greenwich Observatory. This is the British Greenwich local time, which used to be regarded as the world standard time.
UT: universal time universal time. The time calculated from the atomic clock.
UTC: coordinated universal time. Because the earth rotates more and more slowly, it will be a few tenths of a second more than the previous year every year. The coordinated universal time organization will give universal time + 1 second every few years, so that the difference between universal time based on atomic clock and Greenwich mean time based on Astronomy (human perception) is not too large. The obtained time is called UTC, which is the world standard time currently used.
Coordinated universal time is not related to any regional location, nor does it represent the time of a place at the moment, so the time zone should be added when explaining the time of a place, that is, GMT is not equal to UTC, but equal to UTC + 0, but Greenwich is just in time zone 0.

Vernacular time and time zone

After reading this series of time and time zone instructions, my brain was actually blank Ojbk… With my current understanding, I try to explain all the above concepts related to time.

  • GMT: Greenwich mean time.
  • UTC: Based on the world standard time after atomic clock coordination. It can be considered that UTC time is consistent with Greenwich mean time. That is, GMT = UTC + 0, where 0 represents Greenwich time zone 0.
  • time zone: explain by reverse thinking (only from the technical level, not from other complex levels). No time zone division means that the world is the same time zone, so the explicit time seen at the same time is the same. For example: if the whole world takes Greenwich mean time as the unified time, at 0:00 GMT, for the two students in Beijing and Canada, what they perceive is that the sun has just risen in Beijing (early morning) and the sun has just set in Canada (late evening).

However, because there is no time zone division, the time seen by the two students is 0 o’clock, so this is not in line with human understandingPerceived timeAnd myselfTime to seeI understand. Therefore, after the time zone is divided, it can meet that the time seen by students in Beijing (UTC + 8 in East Zone 8) is 8 a.m. and that seen by students in Canada (UTC – 4 in West Zone 4) is 8 p.m. Note that the division of time zones is bound to UTC. East Zone 8 is UTC + 8.

  • Flink time: the time used by flynk is based on the java time era (GMT 1970 / 01 / 01 00:00:00, UTC + 0 1970 / 01 / 01 00:00:00).
  • UNIX timestamp: in any place in the world, the corresponding UNIX timestamp of the data received at the same time is the same, which is similar to the UNIX timestamp at the same time in the world regardless of time zone.
  • UNIX timestamp is 0: corresponding Greenwich Mean Time: 1970-01-01 00:00:00, corresponding Beijing time (East 8th District): 1970-01-01 08:00:00**

The conceptual relationship is shown in the figure:
Stepping on the pit | time zone problem in Flink day level window

Why does the East eighth District add 8 hours?

The following table only indicates some important times:

Stepping on the pit | time zone problem in Flink day level window

Stepping on the pit | time zone problem in Flink day level window
Take the first data to explain that it represents a piece of data generated at 00:00:00 Beijing time on January 1, 1970. The UNIX timestamp carried by the data is – 8 * 3600.
According to the requirements, the figure above and the above table, we can get the following derivation process:

  • The demand scenario is to count the UV of a whole day, that is, the day level window. For example, when counting data in the range of 1970 / 01 / 01 00:00:00 – 1970 / 01 / 02 00:00:00 Beijing time, the UNIX timestamp range carried by the data in this date range is – 83600 to 16 3600
  • For Flink, by default, the range of a full day UNIX timestamp it can count is 0 to 24 * 3600
  • Therefore, when we want to correctly count the data within the range of Beijing time (1970 / 01 / 01 00:00:00 – 1970 / 01 / 02 00:00:00) through Flink, that is, the UNIX timestamp is – 83600 to 163600 data, you need to map the timestamp.
  • The mapping method is as follows: translate the timestamp in the overall range on the time axis, that is, – 83600 to 0, 163600 to 243600。 It is equivalent to adding 8 to the UNIX timestamp of Beijing time 3600。
  • Finally, subtract the added 8 hours from the output timestamp (because the explicit time will automatically format the UNIX timestamp according to the time zone).

Notes:

  • Can you add 32 hours? The answer is yes. In Dongba District, for the division of day level window, the effect of day level window division and calculation after adding 8 hours and adding 8 + n * 24 (where n is an integer) hours is the same. Flink will divide the data of the whole day in Dongba District into one day level window. Therefore, adding 32 (8 + 24), 56 (8 + 48) and – 16 (8 – 24) hours has the same effect. The above example only selects the minimum distance of time axis translation, that is, 8 hours. Note that exceptions occur when the UNIX timestamp of some systems is negative.
  • This reasoning process is applicable to all scenarios that encounter time zone problems. If you have this problem in other application scenarios, you can also solve it in the above way

Appendix

Find the UNIX timestamp corresponding to the UNIX timestamp of 0:00 a.m. every day in the East eighth district.

public static final long ONE_DAY_MILLS = 24 * 60 * 60 * 1000L;
public static long transform(long timestamp) {
 return timestamp - (timestamp + 8 * 60 * 60 * 1000) % ONE_DAY_MILLS;}

5. Summary

This paper first introduces SQL and gives our problems and solutions directly.
Starting from the demand scenario and the implementation scheme of the whole data link, the second section explains how we use flick SQL to realize the demand, and then leads to the time zone problem of the sky level window in SQL.
The third section confirms the reasons for the time zone problem of the day level window, leads to the fact that the java time era is used in flyk, and gives solutions at the engine level and SQL level. It also raises a question: why is our solution to add 8-hour offset?
The fourth section analyzes the reasons for adding 8-hour offset, and expounds in detail the relationship between time zone, UTC, GMT and UNIX timestamp.
The last section summarizes this paper.
If you have a more convenient way to understand the time zone offset, please leave a message~

The Flink keyword in the official account (mangodata) can get Flink’s learning materials and videos.

Stepping on the pit | time zone problem in Flink day level window