1. Background
What is zipper table? When building a data warehouse, it is an important way to process table data. Data structure can be compared with algorithm, and zipper table can be compared with data warehouse. The purpose is to solve the SCD requirements in building a data warehouse. Then what is SCD? It is a slowly changing dimension. As time goes by, the data changes slowly relative to the fact table.
The common processing methods of SCD are as follows:
-
Keep the original value
-
Direct coverage
-
Add new attribute column
-
Snapshot table
-
Zipper Watch
This paper mainly explains the zipper table to deal with the SCD problem, its characteristics are summarized as follows, there are several scenarios, you can use the zipper table.
1. There is a large amount of data in the scale, so using the full scale will take up a lot of storage
2. The table data will be modified. It is difficult to deal with duplicate and modified data by using incremental table
3. Need to know the full data of a certain time point in history
4. The data has been modified, but the frequency and quantity are not very large. For example, only one in a million has been modified
2. Zipper table processing theory
First of all, zipper table is a full scale and not a partition table. In order to achieve the various effects described above, an intermediate table must be used as the intermediate springboard. The intermediate springboard table is a partition table, and the data is incremental data. The incremental content includes modification and addition, that is, it is often a partition tablecreate_time or update_time
In the current day, for zipper table, you need to add two fields that have nothing to do with the original data to identify the data start time and effective end time. In the example, the two dates arestart_date
andend_date
Zipper table can be processed in three ways: initialization, updating data every day and rolling back data.
2.1 initialization and new data
The daily rolling mode is as follows:
The initialization part is the start time of the zipper full scale, and it also establishes the earliest time when the zipper can be rolled back. The daily update logic is as shown in the figure above. The new data will be divided into two parts. One part is the new data added every day. When there is the same changed or unchanged data in the partition of the day, modify the corresponding datastart_date
andend_date
The data can be updated.
2.1 data rollback
For the above update logic, let’s consider how to roll back the data, that is, to go back to a certain point in history. For zipper tables, it is a full scale, so only one roll back is enough. The rollback strategy can be generated according to the rollback time point and datastart_date
andend_date
Let’s take a look at the following diagram:
stayend_date < rollback_date
The data should be retained for processingend_date ≥ rollback_date ≥ start_date
set upend_date
by9999-12-31
For the result of rollback, generally, in order to maintain the integrity of the data, you can put the rolled back data in a new zipper temporary table.
3. Zipper table handling cases
For the common layered dim of data warehouse, that is, the dimension layer is a common scenario of zipper table. Here is an example to see how to add and roll back zipper table.
This paper uses zipper table to implement dim layer business dimension table in core transaction analysis, and realizes the rollback of zipper table.
3.1 creating tables and importing data
The structure of business dimension table is as follows:
--Create business information table (incremental table partition table)
drop table if exists ods.ods_trade_shops;
create table ods.ods_trade_shops(
`Shop ID ',
`Userid ` int comment 'shop manager',
`Areaid ` int comment 'area ID',
`Shopname ` string comment 'shop name',
`Shoplevel ` int comment 'shop level',
`Status ` int comment 'shop status',
`Create time ` string comment 'create date',
`Modifytime ` string comment 'modifydate'
)Comment 'merchant information table'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ',';
--Create business information dimension table
drop table if exists dim.dim_trade_shops;
create table dim.dim_trade_shops(
`Shop ID ',
`Userid ` int comment 'shop manager',
`Areaid ` int comment 'area ID',
`Shopname ` string comment 'shop name',
`Shoplevel ` int comment 'shop level',
`Status ` int comment 'shop status',
`Create time ` string comment 'create date',
`Modifytime ` string comment 'modifydate',
`StartDate ` string comment 'effective start date',
`Enddate ` string comment 'expiration end date'
)Comment 'merchant information table';
Import the following test data:
/root/data/shop-2020-11-20.dat
100050, 1100225, wsxxx supermarket, 1,12020-06-282020-11-20 13:22:22
100052, 2100236, fresh XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100053,3100011, Huawei XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100054, 4100159, Xiaomi XXX flagship store, 1,12020-06-282020-11-20 13:22:22
100055,5100211, apple XXX flagship store, 1,12020-06-282020-11-20 13:22:22
/root/data/shop-2020-11-21.dat
100057, 7100311, three XXX mice snacks, 1,12020-06-282020-11-21 13:22:22
100058, 8100329, Liangzi XXX shop, 1,12020-06-282020-11-21 13:22:22
100054, 4100159, Xiaomi XXX flagship store, 2,12020-06-282020-11-21 13:22:22
100055,5100211, apple XXX flagship store, 2,12020-06-282020-11-21 13:22:22
/root/data/shop-2020-11-22.dat
100059, 9100225, Leju XXX daily necessities, 1,12020-06-282020-11-22 13:22:22
100060,10100211, Tongren XXX health, 1,12020-06-282020-11-22 13:22:22
100052, 2100236, fresh XXX flagship store, 1,22020-06-282020-11-22 13:22:22
load data local inpath '/root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-20');
load data local inpath '/root/data/shop-2020-11-21.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-21');
load data local inpath '/root/data/shop-2020-11-22.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-22');
3.2 zipper table initialization
Let’s assume that the first day’s data is all the data in history
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt ='2020-11-20';
3.3 update zipper table
For incremental tables, the general logic is,create_time
perhapsmodifytime
As the partition of the daydt
,modifytime
Greater than or equal tocreate_time
Here are the first two
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '2020-11-21'
UNION ALL
SELECT b.shopid,
b.userid,
b.areaid,
b.shopname,
b.shoplevel,
b.status,
b.createtime,
b.modifytime,
b.startdate,
CASE
WHEN a.shopid IS NOT NULL
AND b.enddate ='9999-12-31' THEN date_add('2020-11-21', -1)
ELSE b.enddate
END AS enddate
FROM
(SELECT *
FROM ods.ods_trade_shops
WHERE dt='2020-11-21') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
The script to load the zipper table is as follows:
dim_load_shops.sh
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
CASE
WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
ELSE substr(createtime, 0, 10)
END AS startdate,
'9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '$do_date'
UNION ALL
SELECT b.shopid,
b.userid,
b.areaid,
b.shopname,
b.shoplevel,
b.status,
b.createtime,
b.modifytime,
b.startdate,
CASE
WHEN a.shopid IS NOT NULL
AND b.enddate ='9999-12-31' THEN date_add('$do_date', -1)
ELSE b.enddate
END AS enddate
FROM
(SELECT *
FROM ods.ods_trade_shops
WHERE dt='$do_date') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
"
hive -e "$sql"
You can perform this script load2020-12-22
Data from,sh dim_load_shops.sh 2020-12-22
3.4 rollback zipper table to a certain point in time
First, create a temporary table,tmp.shops_tmp
Data used for rollback
DROP TABLE IF EXISTS tmp.shops_tmp;
CREATE TABLE IF NOT EXISTS tmp.tmp_shops AS
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
startdate,
enddate
FROM dim.dim_trade_shops
WHERE enddate < '2020-11-21'
UNION ALL
SELECT shopid,
userid,
areaid,
shopname,
shoplevel,
status,
createtime,
modifytime,
startdate,
'9999-12-31' AS enddate
FROM dim.dim_trade_shops
WHERE startdate <= '2020-11-21'
AND enddate >= '2020-11-21';
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT *
FROM tmp.tmp_shops;
The rollback script is similar to the update script. You only need to update the SQL in the script, which will not be repeated here.
Wu Xie, little third master, is a rookie in the field of big data and artificial intelligence.
Please pay more attention