Data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads

Time:2022-6-2

@TOC

Warehouse notes

Detailed explanation of data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads:https://blog.csdn.net/weixin_42526326/article/details/121633372

Project requirements and architecture design of shangsilicon Valley digital warehouse practice 1:https://blog.csdn.net/weixin_42526326/article/details/121658388

Shang Silicon Valley data warehouse practice 2 data warehouse layered + dimensional modeling:https://blog.csdn.net/weixin_42526326/article/details/121658605

Construction of 3 data warehouses of shangsilicon Valley:https://blog.csdn.net/weixin_42526326/article/details/121658635

Data flow direction

Data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads

Application examples

Data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads

What is Data Warehouse DW

Data warehouse (which can be abbreviated as DW or DWH) refers to a data warehouse that is aA complete theoretical system including ETL, scheduling and modeling

The purpose of data warehouse scheme construction is to serve as the basis for front-end query and analysis. It is mainly applied to OLAP (on-line analytical processing), supports complex analysis operations, focuses on decision support, and provides intuitive and understandable query results. At present, AWS redshift, Greenplum, hive, etc. are popular in the industry.

Data warehouse is not the final destination of data, but prepares for the final destination of data, including cleaning, escaping, classification, reorganization, consolidation, splitting, statistics, etc

main features

Subject oriented[appendix]

The operational database organization is oriented to transaction processing tasks, while the data in the data warehouse is organized according to a certain subject domain.

Topics refer to the key aspects that users care about when using data warehouse to make decisions. A topic is related to multiple operational information systems.

integrate

The source data needs to be processed, integrated, unified and integrated

In the process of processing, the inconsistency of source data must be eliminated to ensure that the information in the data warehouse is the consistent global information about the whole enterprise. (relationship)

Not modifiable

The data in DW is not up-to-date, but comes from other data sources

Data warehouse is mainly used to provide data for decision analysis, and the operations involved are mainly data query

Time dependent

The data in the data warehouse for decision-making needs to be marked with time attribute

Comparison with database

DW: specially designed for data analysis, involving reading a large amount of data to understand the relationship and trend between data

Databases: for capturing and storing data

Attribute data warehouse transaction database

Suitable workload analysis, reporting, big data transaction processing

Data sources data collected and standardized from multiple sources data captured from a single source, such as a transactional system

Data capture batch writes are optimized for continuous writes by executing according to a predetermined batch plan because new data maximizes transaction throughput

Data standardization non standardized schema, such as star schema or snowflake schema highly standardized static schema

Data storage is optimized using columnar storage for easy access and high-speed query performance optimized for high-throughput write operations in single row physical blocks

Data access optimizes a large number of small read operations to minimize i/o and maximize data throughput

Why layering

Problems involved in data warehouse:

Why do you want to be a data warehouse?

Why do we do data quality management?

Why metadata management?

What is the role of each layer in the data warehouse hierarchy?

……

In practical work, we all hope that our data can flow in order, and designers and users can clearly know the entire declaration cycle of data, such as the following figure on the left.

However, in practice, the data situation we are facing is likely to be highly complex and disordered. We may make a set of data system with disordered table dependency structure and circular dependency.

In order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly. This isData Tiering。 Benefits of data tiering:

Clear data structure: each data layer has its own role and responsibility, which can be more convenient and understandable in use and maintenance

Simplification of complex problems: a complex task is disassembled into multiple steps to be completed step by step, and each layer only solves specific problems

Unified data caliber: provide a unified data outlet and output caliber through data layering

Reduce repeated development: standardizing data layering and developing a common middle tier can greatly reduce the work of repeated calculation

Data Tiering

Each company’s business can be divided into different levels according to its own business needs; At present, there are mature data layers: data operation layer ODS, data warehouse layer DW, and data service layer ads (APP).

Data operation layer ODS

Data operation layer: operation data store data preparation area, also known as paste source layer. The data in the data source enters this layer after extraction, cleaning and transmission, that is, the ETL process. Main functions of this layer:

ODS is the preparation area of the later data warehouse layer

Provide raw data for DWD layer

Reduce impact on business systems

When the source data is loaded into this layer, a series of operations such as de-noising (for example, the age of the person in one piece of data is 300, which belongs to abnormal data, and some processing needs to be done in advance), de duplication (for example, in the personal data table, there are two duplicate data with the same ID, and one step of de duplication needs to be done when accessing), field naming conventions, and so on.

However, in order to consider the possible need to trace data in the future, it is not recommended to do too much data cleaning for this layer. It is also possible to access the original data intact according to the specific layered requirements of the business.

This layer of data is the source of subsequent data warehouse processing data. datasourceMethod:

Business log

Sqoop is often used to extract, for example, once a day.

In terms of real-time, you can consider using canal to listen to MySQL binlog and access it in real time.

Buried point log

Logs are generally saved in the form of files, and flume can be used for timing synchronization

You can use spark streaming or Flink to access in real time

Kafka OK

Third party crawler data

Data warehouse layer

From top to bottom, the data warehouse layer can be divided into three layers: Data detail layer DWD, data middle layer DWM, and data service layer DWS.

Data detail layer DWD

Data detail layer: data warehouse details, DWD (data cleaning /dwi)

This layer is the isolation layer between the business layer and the data warehouse, maintaining the same data granularity as the ODS layer; It is mainly used to clean and standardize the ODS data layer, such as removing empty data, dirty data, outliers, etc.

In order to improve the usability of the data detail layer, the layer usually adopts some dimension degradation methods to degrade the dimension into the fact table and reduce the association between the fact table and the dimension table.

Data m-server DWM

Data warehouse middle, DWM

Based on the data of DWD layer, this layer performs some slight aggregation operations on the data, generates some columns of intermediate result tables, improves the reusability of public indicators, and reduces repeated processing.

To be brief, aggregate the common core dimensions to calculate the corresponding statistical indicators

Data service layer DWS (DWT)

Data service layer: data warehouse service, DWS (wide table – user behavior, slight aggregation)

The total information aggregated by DWT on the basis of DWS, such as user information. The unit of DWS is day, and DWT is the total information

This layer is based on the basic data on DWM, which is integrated and summarized into a data service layer for analyzing a subject field. It is generally a wide table, which is used to provide subsequent business queries, OLAP analysis, data distribution, etc.

Generally speaking, there are relatively few data tables in this layer; A table covers a lot of business content. Because it has many fields, it is generally called a wide table.

User behavior, slight aggregation on DWD

It mainly summarizes the ods/dwd layer data.

Data application layer ads

Data application layer: application data service, ads (app/dal/df) – report results

This layer mainly provides data for data products and data analysis. It is generally stored in ES, redis, PostgreSQL and other systems for online systems; It may also be stored in hive or Druid for data analysis and data mining. For example, common data reports exist here.

Fact tablee

A fact table is a table that stores fact records, such as system logs and sales records. The record of the fact table is growing. For example, the commodity order table of e-commerce is a similar situation, so the volume of the fact table is usually much larger than that of other tables.

Dimension (DIM)

A dimension table or dimension table, sometimes called a lookup table, is a table corresponding to a fact table; It saves the attribute values of the dimension and can be associated with the fact table. It is equivalent to extracting and standardizing the attributes that frequently appear in the fact table and managing them in one table. The dimension table mainly contains two parts:

High cardinality dimension data: generally, it is a data table similar to user data table and commodity data table. The data volume may be tens of millions or hundreds of millions

Low cardinality dimension data: generally, it is a configuration table, such as the Chinese meaning corresponding to enumeration fields, or a date dimension table; The data volume may be single digits or tens of thousands.

Temporary table tmp

There are many temporary tables in each layer of calculation. A dwtmp layer is set up to store the temporary tables in our data warehouse

data mart

Ads layer in narrow sense; Broadly speaking, it refers to the data that Hadoop synchronizes from DWD DWS ads to RDS

Data mart (data mart), also known as data market, is to meet the needs of specific departments or usersMultidimensionalIt includes defining dimensions, indicators to be calculated, dimension levels, etc. to generate adata cube

In terms of scope, the data is enterprise widedatabasedata warehouse Or extracted from a more professional data warehouse. The focus of the data center is that it caters to the special needs of professional user groups in terms of analysis, content, performance and ease of use. Data center users want data to be represented by terms they are familiar with.

Data warehouse structure with data mart

Data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads

Differentiated data warehouse

Data mart is enterprise leveldata warehouse It is mainly oriented to department level business and only faces a specific topic. In order to solve the contradiction between flexibility and performance, data mart is a small department or workgroup level data warehouse added to the data warehouse architecture. Data marts store pre calculated data for specific users to meet their performance needs. Data mart can alleviate the bottleneck of accessing data warehouse to some extent.

Theoretically, there should be a general concept of data warehouse before there is a data mart. In the actual construction of data mart, this is rarely done in China. In China, we usually start with the data mart first, and then build a data warehouse for a specific topic (such as enterprise customer information). The order of establishing data warehouse and data mart is closely related to the design method. As an engineering discipline, data warehouse has no right or wrong.

In terms of data structure, data warehouse is a collection of subject oriented and integrated data. Data marts are usually defined as star or snowflake data structures. Data marts are generally composed of a fact table and several dimension tables.

Data warehouse and data mart: ODS, DW, DWD, DWM, DWS, ads

Problem summary

What is the difference between ODS and DWD?

Q: I still don’t quite understand the difference between ODS and DWD. With ODS, I feel DWD is useless.

A: Well, I understand it this way. From an ideal point of view, if the data in the ODS layer is very regular, it can basically meet most of our needs. Of course, this is good. At this time, the DWD layer is not necessary. However, in reality, it is difficult to guarantee the quality of ODS layer data. After all, there are various sources of data, and the push party will also have its own push logic. In this case, we need to shield some underlying differences through an additional layer of DWD.

Q: I understand. Does DWD mainly do some data cleaning and standardization for ODS layer, and DWS mainly does some light summary for ODS layer data?

Yes, it can be roughly understood in this way.

APP层干什么的?

问:感觉DWS层是不是没地方放了,各个业务的DWS表是应该在 DWD还是在 app?

答:这个问题不太好回答,我感觉主要就是明确一下DWS层是干什么的,如果你的DWS层放的就是一些可以供业务方使用的宽表表,放在 app 层就行。如果你说的数据集市是一个比较泛一点的概念,那么其实 dws、dwd、app 这些合起来都算是数据集市的内容。

问:那存到 Redis、ES 中的数据算是 app层吗?

答:算是的,我个人的理解,app 层主要存放一些相对成熟的表,能供业务侧使用的。这些表可以在 Hive 中,也可以是从 Hive 导入 Redis 或者 ES 这种查询性能比较好的系统中。

附录

ETL

ETL :Extract-Transform-Load,用于描述将数据从来源端经过抽取、转换、加载到目的端的过程。

宽表

含义:指字段比较多的数据库表。通常是指业务主体相关的指标、纬度、属性关联在一起的一张数据库表。

特点:

宽表由于把不同的内容都放在同一张表,宽表已经不符合三范式的模型设计规范:

坏处:数据有大量冗余

好处:查询性能的提高和便捷

宽表的设计广泛应用于数据挖掘模型训练前的数据准备,通过把相关字段放在同一张表中,可以大大提供数据挖掘模型训练过程中迭代计算的消息问题。

主题(Subject)

是在较高层次上将企业信息系统中的数据进行综合、归类和分析利用的一个抽象概念,每一个主题基本对应一个宏观的分析领域。在逻辑意义上,它是对应企业中某一宏观分析领域所涉及的分析对象。例如“销售分析”就是一个分析领域,因此这个数据仓库应用的主题就是“销售分析”。

参考:http://www.cnblogs.com/amyzhu

https://www.jianshu.com/p/1dd894e5bb62

https://blog.csdn.net/pmdream/article/details/113601956

百度百科

Recommended Today

P1851 Good friend (Todo)

good friend topic background Xiao Keke and all the other students have a RFID serial number plate on their wrist so that the teacher can easily count their number. Many students have a "good friend". if\(A\)The sum of the divisors of the serial numbers is exactly equal to\(B\)serial number, then\(A\)'s good friend is\(B\). Here, the […]