Big data development – overview of data warehouse zipper table, and how to iterate or roll back

Time:2021-2-9

1. Background

What is zipper table? When building a data warehouse, it is an important way to process table data. Data structure can be compared with algorithm, and zipper table can be compared with data warehouse. The purpose is to solve the SCD requirements in building a data warehouse. Then what is SCD? It is a slowly changing dimension. As time goes by, the data changes slowly relative to the fact table.

The common processing methods of SCD are as follows:

  • Keep the original value

  • Direct coverage

  • Add new attribute column

  • Snapshot table

  • Zipper Watch

This paper mainly explains the zipper table to deal with the SCD problem, its characteristics are summarized as follows, there are several scenarios, you can use the zipper table.

1. There is a large amount of data in the scale, so using the full scale will take up a lot of storage

2. The table data will be modified. It is difficult to deal with duplicate and modified data by using incremental table

3. Need to know the full data of a certain time point in history

4. The data has been modified, but the frequency and quantity are not very large. For example, only one in a million has been modified

2. Zipper table processing theory

First of all, zipper table is a full scale and not a partition table. In order to achieve the various effects described above, an intermediate table must be used as the intermediate springboard. The intermediate springboard table is a partition table, and the data is incremental data. The incremental content includes modification and addition, that is, it is often a partition tablecreate_time or update_timeIn the current day, for zipper table, you need to add two fields that have nothing to do with the original data to identify the data start time and effective end time. In the example, the two dates arestart_date andend_dateZipper table can be processed in three ways: initialization, updating data every day and rolling back data.

2.1 initialization and new data

The daily rolling mode is as follows:

file

The initialization part is the start time of the zipper full scale, and it also establishes the earliest time when the zipper can be rolled back. The daily update logic is as shown in the figure above. The new data will be divided into two parts. One part is the new data added every day. When there is the same changed or unchanged data in the partition of the day, modify the corresponding datastart_date andend_dateThe data can be updated.

2.1 data rollback

For the above update logic, let’s consider how to roll back the data, that is, to go back to a certain point in history. For zipper tables, it is a full scale, so only one roll back is enough. The rollback strategy can be generated according to the rollback time point and datastart_date andend_dateLet’s take a look at the following diagram:

file
stayend_date < rollback_dateThe data should be retained for processingend_date ≥ rollback_date ≥ start_dateset upend_dateby9999-12-31For the result of rollback, generally, in order to maintain the integrity of the data, you can put the rolled back data in a new zipper temporary table.

3. Zipper table handling cases

For the common layered dim of data warehouse, that is, the dimension layer is a common scenario of zipper table. Here is an example to see how to add and roll back zipper table.

This paper uses zipper table to implement dim layer business dimension table in core transaction analysis, and realizes the rollback of zipper table.

3.1 creating tables and importing data

The structure of business dimension table is as follows:

--Create business information table (incremental table partition table)
drop table if exists ods.ods_trade_shops;
create table ods.ods_trade_shops(
  `Shop ID ',
  `Userid ` int comment 'shop manager', 
  `Areaid ` int comment 'area ID',
  `Shopname ` string comment 'shop name',
  `Shoplevel ` int comment 'shop level',
  `Status ` int comment 'shop status',
  `Create time ` string comment 'create date',
  `Modifytime ` string comment 'modifydate'
)Comment 'merchant information table'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';

--Create business information dimension table
drop table if exists dim.dim_trade_shops;
create table dim.dim_trade_shops(
  `Shop ID ',
  `Userid ` int comment 'shop manager', 
  `Areaid ` int comment 'area ID',
  `Shopname ` string comment 'shop name',
  `Shoplevel ` int comment 'shop level',
  `Status ` int comment 'shop status',
  `Create time ` string comment 'create date',
  `Modifytime ` string comment 'modifydate',
  `StartDate ` string comment 'effective start date',
  `Enddate ` string comment 'expiration end date'
)Comment 'merchant information table';

Import the following test data:

/root/data/shop-2020-11-20.dat
100050, 1100225, wsxxx supermarket, 1,12020-06-282020-11-20 13:22:22
100052, 2100236, fresh XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100053,3100011, Huawei XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100054, 4100159, Xiaomi XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100055,5100211, apple XXX flagship store, 1,12020-06-282020-11-20 13:22:22
 
 
/root/data/shop-2020-11-21.dat
100057, 7100311, three XXX mice snacks, 1,12020-06-282020-11-21 13:22:22
100058, 8100329, Liangzi XXX shop, 1,12020-06-282020-11-21 13:22:22
100054, 4100159, Xiaomi XXX flagship store, 2,12020-06-282020-11-21 13:22:22
100055,5100211, apple XXX flagship store, 2,12020-06-282020-11-21 13:22:22
 
 
/root/data/shop-2020-11-22.dat
100059, 9100225, Leju XXX daily necessities, 1,12020-06-282020-11-22 13:22:22
100060,10100211, Tongren XXX health, 1,12020-06-282020-11-22 13:22:22
100052, 2100236, fresh XXX flagship store, 1,22020-06-282020-11-22 13:22:22

load data local inpath '/root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-20');
load data local inpath '/root/data/shop-2020-11-21.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-21');
load data local inpath '/root/data/shop-2020-11-22.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-22');

3.2 zipper table initialization

Let’s assume that the first day’s data is all the data in history

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt ='2020-11-20';

3.3 update zipper table

For incremental tables, the general logic is,create_timeperhapsmodifytimeAs the partition of the daydtmodifytime Greater than or equal tocreate_time Here are the first two

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '2020-11-21'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('2020-11-21', -1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='2020-11-21') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;

The script to load the zipper table is as follows:

dim_load_shops.sh

#!/bin/bash
 
source /etc/profile
if [ -n "$1" ]
then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
fi
 
sql="
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '$do_date'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('$do_date', -1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='$do_date') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
"
 
hive -e "$sql"

You can perform this script load2020-12-22Data from,sh dim_load_shops.sh 2020-12-22

3.4 rollback zipper table to a certain point in time

First, create a temporary table,tmp.shops_tmpData used for rollback

DROP TABLE IF EXISTS tmp.shops_tmp;
CREATE TABLE IF NOT EXISTS tmp.tmp_shops AS
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       enddate
FROM dim.dim_trade_shops
WHERE enddate < '2020-11-21'
UNION ALL
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       '9999-12-31' AS enddate
FROM dim.dim_trade_shops
WHERE startdate <= '2020-11-21'
  AND enddate >= '2020-11-21';


INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT *
FROM tmp.tmp_shops;

The rollback script is similar to the update script. You only need to update the SQL in the script, which will not be repeated here.
Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention
file