Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Time:2022-4-4

@TOC

Warehouse notes

Detailed explanation of data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads:https://blog.csdn.net/weixin_42526326/article/details/121633372

Project requirements and architecture design of shangsilicon Valley digital warehouse practice 1:https://blog.csdn.net/weixin_42526326/article/details/121658388

Shang Silicon Valley digital warehouse actual combat 2 digital warehouse layered + dimensional modeling:https://blog.csdn.net/weixin_42526326/article/details/121658605

Construction of 3 digital warehouses in the actual combat of shangsilicon Valley:https://blog.csdn.net/weixin_42526326/article/details/121658635

Shangsi Valley data warehouse 4.0 video tutorial

Direct from station B:2021 new version of e-commerce warehouse v4 0 – actual combat of big data warehouse project
Baidu online disk:https://pan.baidu.com/s/1FGUb8X0Wx7IWAmKXBRwVFg, extraction code: yyds
Alicloud disk:https://www.aliyundrive.com/s/F2FuMVePj92, extraction code: 335o

Chapter 4 data warehouse construction ODS floor

1) Keep the original appearance of the data without any modification, and play the role of backup data.

2) LZO compression is adopted for data to reduce disk storage space. 100g data can be compressed to less than 10g.

3) Create a partition table to prevent subsequent full table scanning, and use a large number of partition tables in enterprise development.

4) Create an external table. In enterprise development, in addition to creating internal tables for their own temporary tables, the vast majority of scenarios are to create external tables.

4.2 ODS layer (business data)

The partition planning of ODS layer service table is as follows:

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

The data loading idea of ODS layer business table is as follows

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

4.2.1 activity information table

DROP TABLE IF EXISTS ods_activity_info;

CREATE EXTERNAL TABLE ods_activity_info(

  `ID ` string comment 'number',

  `activity_ Name ` string comment 'activity name',

  `activity_ Type ` string comment 'activity type',

  `start_ Time ` string comment 'start time',

  `end_ Time ` string comment 'end time',

  `create_ Time ` string comment 'creation time'

)Comment 'activity information table'

PARTITIONED BY (`dt` STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS

 INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'

 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION '/warehouse/gmall/ods/ods_activity_info/';

Chapter 5 data warehouse construction – dim floor

5.1 commodity dimension table (full volume)

1. CREATE TABLE statement

DROP TABLE IF EXISTS dim_sku_info;

CREATE EXTERNAL TABLE dim_sku_info (

  `ID ` string comment 'commodity ID',

  `Price ` decimal (16,2) comment 'commodity price',

  `sku_ Name ` string comment 'commodity name',

  `sku_ Desc ` string comment 'product description',

  `Weight ` decimal (16,2) comment ',

  `is_ Sale ` Boolean comment ',

  `spu_ ID ` string comment 'SPU number',

  `spu_ Name ` string comment 'SPU name',

  `category3_ ID ` string comment 'three-level classification ID',

  `category3_ Name ` string comment 'three-level classification name',

  `category2_ ID ` string comment 'secondary classification ID',

  `category2_ Name ` string comment 'secondary classification name',

  `category1_ ID ` string comment 'first level classification ID',

  `category1_ Name ` string comment 'primary classification name',

  `tm_ ID ` string comment 'brand ID',

  `tm_ Name ` string comment 'brand name',

  `sku_ attr_ values` ARRAY<STRUCT<attr_ id:STRING,value_ id:STRING,attr_ name:STRING,value_ Name: String > > comment 'platform attribute',

  `sku_ sale_ attr_ values` ARRAY<STRUCT<sale_ attr_ id:STRING,sale_ attr_ value_ id:STRING,sale_ attr_ name:STRING,sale_ attr_ value_ Name: String > > comment 'sales attribute',

  `create_ Time ` string comment 'creation time'

)Comment 'item dimension table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dim/dim_sku_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2. Zoning planning

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3. Data loading

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

5.6 user dimension table (zipper table)

5.6.1 zipper table overview

1) What is a zipper watch

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

2) Why make a zipper watch

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3) How to use zipper Watch

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

4) Zipper watch forming process

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

5.6.2 making zipper table

1. CREATE TABLE statement

DROP TABLE IF EXISTS dim_user_info;

CREATE EXTERNAL TABLE dim_user_info(

  `ID ` string comment 'user ID',

  `login_ Name ` string comment 'user name',

  `nick_ Name ` string comment 'user nickname',

  `Name ` string comment 'user name',

  `phone_ Num ` string comment 'mobile number',

  `Email ` string comment ',

  `user_ Level ` string comment 'user level',

  `Birthday ` string comment ',

  `Gender ` string comment 'gender',

  `create_ Time ` string comment 'creation time',

  `operate_ Time ` string comment 'operation time',

  `start_ Date ` string comment 'start date',

  `end_ Date ` string comment 'end date'

)Comment 'user table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dim/dim_user_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2. Zoning planning

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3. Data loading

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

1) First day loading

The zipper table is loaded on the first day and needs to be initialized. The specific work is to import all historical users up to the initialization day into the zipper table at one time. Current ODS_ user_ The first partition of info table, i.e. 2020-06-14 partition, is all historical users. Therefore, after certain processing, the partition data can be imported into 9999-99-99 partition of zipper table.

2) Daily loading

(1) Realization idea

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

Chapter 6 data warehouse construction – DWD floor

1) Analyze user behavior data.

2) Re model the business data using the dimension model.

6.1 DWD layer (user behavior log)

6.1.1 log analysis ideas

1) Log structure review

(1) Page buried point log

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

(2) Startup log

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

2) Log parsing idea

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

6.1.2 json_ Use of object function

There are also JSON processing functions for response in mysql, but the performance is not hive high.

6.1.3 startup log table

Analysis idea of startup log: each row of data in the startup log table corresponds to a startup record. A startup record should contain the public information and startup information in the log. First filter out all logs containing the start field, and then use get_ json_ The object function parses each field.

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

1) Create table statement

DROP TABLE IF EXISTS dwd_start_log;

CREATE EXTERNAL TABLE dwd_start_log(

  `area_ Code ` string comment 'region code',

  `Brand ` string comment 'mobile phone brand',

  `Channel ` string comment ',

  `is_ New ` string comment 'start for the first time',

  `Model ` string comment 'mobile phone model',

  `mid_ ID ` string comment 'device ID',

  `OS ` string comment 'operating system',

  `user_ ID ` string comment 'member ID',

  `version_ Code ` string comment 'app version number',

  `Entry ` string comment 'icon mobile phone icon notice notify install to start after installation',

  `loading_ Time ` bigint comment 'start loading time',

  `open_ ad_ ID ` string comment 'ad page ID',

  `open_ ad_ MS ` bigint comment 'total playing time of advertisement',

  `open_ ad_ skip_ MS ` bigint comment 'user skips advertisement time point',

  `TS ` bigint comment 'time'

)Comment 'start log table'

Partitioned by (` DT ` string) -- create partitions by time

Stored as parquet -- parquet column storage

LOCATION '/warehouse/gmall/dwd/dwd_ start_ Log '-- specifies the storage location on HDFS

Tblproperties ('parquet. Compression '='lzo') -- LZO compression

;

6.1.4 page log table

Page log analysis idea: each row of data in the page log table corresponds to a page access record. A page access record should contain the public information and page information in the log. First filter out all logs containing the page field, and then use get_ json_ The object function parses each field.

6.1.5 action log table

Action log analysis idea: each row of data in the action log table corresponds to an action record of the user. An action record should contain public information, page information and action information. First filter out the log containing the action field, then “blow up” the action array through the udtf function (similar to the effect of the explode function), and then use get_ json_ The object function parses each field.

6.1.7 error log table

Error log analysis idea: each row of data in the error log table corresponds to an error record. In order to facilitate error location, an error record should contain the corresponding public information, page information, exposure information, action information, startup information and error information. First filter out the logs containing the err field, and then use get_ json_ The object function parses all fields.

6.2 DWD layer (business data)

In terms of business data, the main focus of DWD layer construction is dimension modeling. Here is just an example of order details. Please download the actual combat notes of sunsilicon valley.

6.2.2 order detail fact sheet (transactional fact sheet)

1) Create table statement

DROP TABLE IF EXISTS dwd_order_detail;

CREATE EXTERNAL TABLE dwd_order_detail (

  `ID ` string comment 'order number',

  `order_ ID ` string comment 'order number',

  `user_ ID ` string comment 'user ID',

  `sku_ ID ` string comment 'SKU item ID',

  `province_ ID ` string comment 'province ID',

  `activity_ ID ` string comment 'activity ID',

  `activity_ rule_ ID ` string comment 'activity rule ID',

  `coupon_ ID ` string comment 'coupon ID',

  `create_ Time ` string comment 'creation time',

  `source_ Type ` string comment 'source type',

  `source_ ID ` string comment 'source number',

  `sku_ Num ` bigint comment 'quantity of goods',

  `original_ Amount ` decimal (16,2) comment 'original price',

  `split_ activity_ Amount ` decimal (16,2) comment ',

  `split_ coupon_ Amount ` decimal (16,2) comment 'coupon discount allocation',

  `split_ final_ Amount ` decimal (16,2) comment 'final price allocation'

)Comment 'order detail fact table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dwd/dwd_order_detail/'

TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3) Data loading

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

6.2.4 additional purchase fact table (periodic snapshot fact table, daily snapshot)

1) Create table statement

DROP TABLE IF EXISTS dwd_cart_info;
CREATE EXTERNAL TABLE dwd_cart_info(
  `ID ` string comment 'number',
  `user_ ID ` string comment 'user ID',
  `sku_ ID ` string comment 'commodity ID',
  `source_ Type ` string comment 'source type',
  `source_ ID ` string comment 'source number',
  `cart_ Price ` decimal (16,2) comment 'price when adding to shopping cart',
  `is_ Ordered ` string comment ',
  `create_ Time ` string comment 'creation time',
  `operate_ Time ` string comment 'modification time',
  `order_ Time ` string comment 'order time',
  `sku_ Num ` bigint comment 'additional purchase quantity'
)Comment 'additional purchase fact table'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_cart_info/'
TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3) Data loading

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

6.2.7 payment fact table (cumulative snapshot fact table)

1) Create table statement

DROP TABLE IF EXISTS dwd_payment_info;

CREATE EXTERNAL TABLE dwd_payment_info (

  `ID ` string comment 'number',

  `order_ ID ` string comment 'order number',

  `user_ ID ` string comment 'user number',

  `province_ ID ` string comment 'region ID',

  `trade_ No ` string comment 'transaction number',

  `out_ trade_ No ` string comment 'external transaction number',

  `payment_ Type ` string comment 'payment type',

  `payment_ Amount ` decimal (16,2) comment 'payment amount',

  `payment_ Status ` string comment 'payment status',

  `create_ Time ` string comment 'creation time', -- the time when the third-party payment interface is called

  `callback_ Time ` string comment 'completion time' -- payment completion time, that is, the callback time of successful payment

)Comment 'payment fact table'

PARTITIONED BY (`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dwd/dwd_payment_info/'

TBLPROPERTIES ("parquet.compression"="lzo");

2) Zoning planning

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

3) Data loading

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

Chapter 7 data warehouse construction – DWS floor

7.2 DWS floor

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

7.2.1 visitor theme

1) Create table statement

DROP TABLE IF EXISTS dws_visitor_action_daycount;

CREATE EXTERNAL TABLE dws_visitor_action_daycount

(

  `mid_ ID ` string comment 'device ID',

  `Brand ` string comment 'equipment brand',

  `Model ` string comment 'equipment model',

  `is_ New ` string comment 'first visit',

  `Channel ` array < string > comment ',

  `OS ` array < string > comment 'operating system',

  `area_ Code ` array < string > comment 'region ID',

  `version_ Code ` array < string > comment 'application version',

  `visit_ Count ` bigint comment 'number of visits',

  `page_ stats` ARRAY<STRUCT<page_ id:STRING,page_ count:BIGINT,during_ Time: bigint > > comment 'page access statistics'

)Comment 'daily equipment behavior table'

PARTITIONED BY(`dt` STRING)

STORED AS PARQUET

LOCATION '/warehouse/gmall/dws/dws_visitor_action_daycount'

TBLPROPERTIES ("parquet.compression"="lzo");

Chapter 8 data warehouse construction – DWT floor

Chapter 9 data warehouse construction – ads floor

9.1 statement of construction

The ads layer does not involve modeling, and the table creation depends on the specific needs.

9.3 user theme

9.3.1 user statistics

The demand is user comprehensive statistics, which includes several indicators. The following is the explanation of each indicator.

index explain Corresponding field
Number of new users Statistics of new registered users new_user_count
Number of new order users Count the number of new order users new_order_user_count
Total order amount Count the total amount of all orders order_final_amount
Number of single users Count the total number of users placing orders order_user_count
Number of users without orders Count the number of active but not ordered users no_order_user_count

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_total;

CREATE EXTERNAL TABLE `ads_user_total` (

 `DT ` string comment 'statistical date',

 `recent_ Days ` bigint comment 'last days, 0: cumulative value, 1: last 1 day, 7: last 7 days, 30: last 30 days',

 `new_ user_ Count ` bigint comment 'number of newly registered users',

 `new_ order_ user_ Count ` bigint comment 'number of new order users',

 `order_ final_ Amount ` decimal (16,2) comment 'total order amount',

 `order_ user_ Count ` bigint comment 'number of users placing orders',

 `no_ order_ user_ Count ` bigint comment 'number of users who have not placed an order (specifically refers to users who have not placed an order among active users)'

)Comment 'user statistics'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_total/';

9.3.2 user change statistics

The demand includes two indicators: the number of lost users and the number of returned users. The following is the explanation of the two indicators.

index explain Corresponding field
Number of lost users Users who have been active before but have not been active recently are called lost users. Here, it is required to count the total number of users who were active before the 7th day (only including the day before the 7th day), but not active in the last 7 days. user_churn_count
Number of returned users Previously active users, who have not been active (lost) for some time, are active again today, which is called return users. It is required to count the total number of return users here. new_order_user_count

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_change;

CREATE EXTERNAL TABLE `ads_user_change` (

 `DT ` string comment 'statistical date',

 `user_ churn_ Count ` bigint comment 'number of lost users',

 `user_ back_ 'comment bit' number of users

)Comment 'user change statistics'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_change/';

9.3.3 user behavior funnel analysis

Funnel analysis is a data analysis model, which can scientifically reflect the user transformation in each stage of a business process from the beginning to the end. Because it can show the links of each stage, it can be clear at a glance which stage has problems.

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

This demand requires statistics on the number of people in each stage of a complete shopping process.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_action;

CREATE EXTERNAL TABLE `ads_user_action` (

 `DT ` string comment 'statistical date',

 `recent_ Days ` bigint comment 'last days, 1: last 1 day, 7: last 7 days, 30: last 30 days',

 `home_ Count ` bigint comment 'number of home page visitors',

 `good_ detail_ Count ` bigint comment 'number of people browsing the product details page',

 `cart_ Count ` bigint comment 'number of people added to shopping cart',

 `order_ Count ` bigint comment 'number of people placing orders',

 `payment_ Count ` bigint comment 'number of people paid'

)Comment 'funnel analysis'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_action/';

9.3.4 user retention rate

Retention analysis generally includes new retention and active retention analysis.

New retention analysis is to analyze how many new users have subsequent active behaviors on a certain day. Active retention analysis is to analyze how many active users have subsequent active behaviors on a certain day.

Retention analysis is an important indicator to measure the value of products to users.

Here, it is required to count the new retention rate. The new retention rate specifically refers to the ratio of the number of retained users to the number of new users. For example, when 100 users are added on June 14, 2020, 80 of the 100 people are active one day later (June 15, 2020), the number of retained users on June 14, 2020 is 80, and the retention rate on June 14, 2020 is 80%.

It is required to count the retention rate from 1 to 7 days every day, as shown in the figure below.

Construction of three data warehouses in the actual combat of shangsilicon Valley data warehouse

Insert picture description here

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_user_retention;

CREATE EXTERNAL TABLE ads_user_retention (

 `DT ` string comment 'statistical date',

 `create_ Date ` string comment 'user added date',

 `retention_ Day ` bigint comment 'number of days left as of the current date',

 `retention_ Count ` bigint comment 'number of retained users',

 `new_ user_ Count ` bigint comment 'number of new users',

 `retention_ Rate ` decimal (16,2) comment 'retention rate'

)Comment 'user retention rate'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_user_retention/';

9.5 order subject

9.5.1 order statistics

The demand includes the total number of orders, the total amount of orders and the total number of people placing orders.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_order_total;

CREATE EXTERNAL TABLE `ads_order_total` (

 `DT ` string comment 'statistical date',

 `recent_ Days ` bigint comment 'last days, 1: last 1 day, 7: last 7 days, 30: last 30 days',

 `order_ Count ` bigint comment 'number of orders',

 `order_ Amount ` decimal (16,2) comment 'order amount',

 `order_ user_ Count ` bigint comment 'number of people placing orders'

)Comment 'order statistics'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_order_total/';

9.5.2 order statistics by Region

The demand includes the total number of orders and the total amount of orders in each province.

1. CREATE TABLE statement

DROP TABLE IF EXISTS ads_order_by_province;

CREATE EXTERNAL TABLE `ads_order_by_province` (

 `DT ` string comment 'statistical date',

 `recent_ Days ` bigint comment 'last days, 1: last 1 day, 7: last 7 days, 30: last 30 days',

 `province_ ID ` string comment 'province ID',

 `province_ Name ` string comment 'province name',

 `area_ Code ` string comment 'region code',

 `iso_ Code ` string comment 'international standard area code',

 `iso_ code_ 3166_ 2 ` string comment 'international standard area code',

 `order_ Count ` bigint comment 'number of orders',

 `order_ Amount ` decimal (16,2) comment 'order amount'

)Comment 'order statistics by region'

ROW FORMAT DELIMITED  FIELDS TERMINATED BY '\t'

LOCATION '/warehouse/gmall/ads/ads_order_by_province/';

Chapter 10 whole process scheduling