Zipper table of data warehouse

Time:2021-4-14

Zipper WatchIt is defined according to the way of table storing data in data warehouse design. As the name suggests, the so-called zipper is to record history.Record all the changes of a thing from the beginning to the current state.

The following is a zipper table, which stores the most basic information of users and the life cycle of each record. We can use this table to get the latest data of the day and the previous historical data.

Date of registration User number phone number t_start_date t_end_date
2017-01-01 001 111111 2017-01-01 9999-12-31
2017-01-01 002 222222 2017-01-01 2017-01-01
2017-01-01 002 233333 2017-01-02 9999-12-31
2017-01-01 003 333333 2017-01-01 9999-12-31
2017-01-01 004 444444 2017-01-01 2017-01-01
2017-01-01 004 432432 2017-01-02 2017-01-02
2017-01-01 004 432432 2017-01-03 9999-12-31
2017-01-02 005 555555 2017-01-02 2017-01-02
2017-01-02 005 115115 2017-01-03 9999-12-31
2017-01-03 006 666666 2017-01-03 9999-12-31

explain:

  • t_start_dateIndicates the start time of the life cycle of the record,t_end_dateIndicates the end time of the life cycle of the record;
  • t_end_date=’9999-12-31′ indicates that the record is currently in a valid state;
  • If all currently valid records are queried, theselect * from user where t_end_date = '9999-12-31'
  • If query2017-01-01Thenselect * from user where t_start_date <= '2017-01-01' and end_date >= '2017-01-01', this statement will query the following records:

The use scene of zipper table

In the data model design process of data warehouse, we often encounter the following table design:

  1. Some tables have a large amount of data. For example, a user table has about 1 billion records and 50 fields. Even if Orc compression is used, the storage of a single table will exceed 100g. In HDFS, double backup or triple backup is even larger.
  2. Some fields in the table will be updated, such as user contact information, product description information, order status, etc.
  3. You need to view the historical snapshot information of a certain time point or period, for example, to view the status of an order at a certain time point in history.
  4. The proportion and frequency of changes recorded in the table are not very large. For example, there are a total of 1 billion users, about 2 million new users and changes occur every day, and the proportion of changes is very small.

For the design of this kind of table? There are several options:

  • Scheme 1: only keep the latest one every day. For example, we use dataX to extract the latest full data into hive every day.
  • Plan 2: keep a full amount of slice data every day.
  • Scheme 3: use zipper table.

Why use zipper watch

Plan 1: only keep the latest one every day

This kind of scheme is needless to say, and its implementation is very simple. Drop the data of the previous day every day, and extract the latest one again.
The advantages are obvious, saving space, some common use is also very convenient, do not need to add a time partition when selecting the table.
The disadvantages are also obvious. Without historical data, you can only turn over the old accounts by other means, such as drawing from the flow chart.

Plan 2: keep a full amount of slice data every day

One full slice per day is a safe plan, and the historical data are also in the process of development.
The disadvantage is that the amount of storage space is too large. If you keep a full amount of the table here every day, then a lot of unchanged information will be saved in each full amount, which is a great waste of storage.
Of course, we can also make some trade-offs, such as only keeping data for nearly a month? However, the demand is shameless, and the life cycle of data is not completely controlled by us.

Scheme 3: zipper table

Zipper watch in the use of the basic consideration of our needs.
First of all, it makes a trade-off in space. Although it does not occupy as much space as scheme 1, its daily increment may be only one thousandth or even one thousandth of scheme 2.
In fact, it can meet the needs of scheme 2. It can not only obtain the latest data, but also add filtering conditions and obtain historical data.
So we still need to use zipper watch.

Design of zipper Watch

Information changes in user table in MySQL relational database

The data in the table on January 1, 2017 are as follows:

Date of registration User number phone number
2017-01-01 001 111111
2017-01-01 002 222222
2017-01-01 003 333333
2017-01-01 004 444444

stay2017-01-02The data in the table is, users002and004The data has been modified,005Yes, new users:

Date of registration User number phone number remarks
2017-01-01 001 111111 nothing
2017-01-01 002 233333 (from 222222 to 2333333)
2017-01-01 003 333333 nothing
2017-01-01 004 432432 (from 444444 to 432432)
2017-01-02 005 555555 (new on January 2, 2017)

stay2017-01-03The data in the table is, users004and005The data has been modified,006Yes, new users:

Date of registration User number phone number remarks
2017-01-01 001 111111
2017-01-01 002 233333
2017-01-01 003 333333
2017-01-01 004 654321 (from 432432 to 654321)
2017-01-02 005 115115 (from 555555 to 115115)
2017-01-03 006 115115 (new on January 3, 2017)

If the history zipper table is designed in the data warehouse to save the table, there will be the following table, which is the data of the latest day (2017-01-03)

Date of registration User number phone number t_start_date t_end_date
2017-01-01 001 111111 2017-01-01 9999-12-31
2017-01-01 002 222222 2017-01-01 2017-01-01
2017-01-01 002 233333 2017-01-02 9999-12-31
2017-01-01 003 333333 2017-01-01 9999-12-31
2017-01-01 004 444444 2017-01-01 2017-01-01
2017-01-01 004 432432 2017-01-02 2017-01-02
2017-01-01 004 432432 2017-01-03 9999-12-31
2017-01-02 005 555555 2017-01-02 2017-01-02
2017-01-02 005 115115 2017-01-03 9999-12-31
2017-01-03 006 666666 2017-01-03 9999-12-31

explain:

  • t_start_dateIndicates the start time of the life cycle of the record,t_end_dateIndicates the end time of the life cycle of the record;
  • t_end_date = '9999-12-31'Indicates that the record is currently in a valid state;
  • If all currently valid records are queried, theselect * from user where t_end_date = '9999-12-31'
  • If query2017-01-01Thenselect * from user where t_start_date <= ‘2017-01-01′ and end_date >= '2017-01-01'

Implementation and update of zipper table

Realization of zipper table in hive

  1. We need oneODSThe user full scale of layer. At least it needs to be used to initialize.
  2. Daily user update table.

Moreover, we need to determine the time granularity of zipper table. For example, zipper table only takes one state every day. That is to say, if there are three state changes in a day, we only take the last state. This kind of day granularity table can actually solve most problems.

Get daily user increments

monitorMysqlChanges in data, such as usingCanalFinally, merge the daily changes to get the final status.
Suppose we get a slice of data every day, we can take the difference between the slice data of two days as the daily update table. In this case, we can update all the fields firstconcatAnd thenmd5That’s OK.
Flow chart, with daily change flow chart

Table structure

odsStratifiedusersurface

CREATE EXTERNAL TABLE ods.user (
  user_ Num string comment 'user number',
  Mobile string comment 'mobile number',
  reg_ Date string comment 'registration date'
Comment 'user profile'
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/ods/user';
)

odsStratifieduser_updatesurface

CREATE EXTERNAL TABLE ods.user_update (
  user_ Num string comment 'user number',
  Mobile string comment 'mobile number',
  reg_ Date string comment 'registration date'
Comment 'daily user profile update table'
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/ods/user_update';
)

Zipper Watch

CREATE EXTERNAL TABLE dws.user_his (
  user_ Num string comment 'user number',
  Mobile string comment 'mobile number',
  reg_ Date string comment 'user number',
  t_start_date ,
  t_end_date
Comment 'user profile'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/dws/user_his';
)

to update

Suppose it’s initialized2017-01-01And then need to be updated2017-01-02Data for that day

INSERT OVERWRITE TABLE dws.user_his
SELECT * FROM
(
    SELECT A.user_num,
           A.mobile,
           A.reg_date,
           A.t_start_time,
           CASE
                WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01'
                ELSE A.t_end_time
           END AS t_end_time
    FROM dws.user_his AS A
    LEFT JOIN ods.user_update AS B
    ON A.user_num = B.user_num
UNION
    SELECT C.user_num,
           C.mobile,
           C.reg_date,
           '2017-01-02' AS t_start_time,
           '9999-12-31' AS t_end_time
    FROM ods.user_update AS C
) AS T

supplement

Zipper meter and flow meter

A flow chart stores a user’s change record. For example, in a flow chart, each modification record of a user will be stored in the data of a day, but there is only one record in the zipper table.
This is a granularity problem that should be paid attention to when designing zipper table. Of course, we can also set the granularity smaller, which is generally enough by day.

Query performance

Of course, linked lists will encounter query performance problems. For example, if we store zipper data for five years, then this list will be relatively large. When querying, the performance will be relatively low. I think there are two ways to solve this problem

  1. In some query engines, we use start_ Date and end_ Date index, which can improve a lot of performance.
  2. Keep some historical data. For example, we store a full amount of zipper table data in a table, and then expose a zipper table that only provides data for nearly three months.

Recommended Today

CentOS makes document server based on nginx

CentOS makes document server based on nginx Basic environment for installation yum install gcc-c++ yum install -y pcre pcre-devel yum install -y zlib zlib-devel yum install -y openssl openssl-devel Find the latest stable version of nginx on the official website https://nginx.org/en/download.html Stable version: https://nginx.org/download/nginx-1.18.0.tar.gz cd /home wget -c https://nginx.org/download/nginx-1.18.0.tar.gz Decompress compressed package data tar -zxvf […]