What is data Lakehouse

Time:2021-6-17

This article is reproduced fromhttps://mp.weixin.qq.com/s/Il…

background

Data lake and data Lakehouse seem to have become the hottest buzzwords in the field of big data. When we accept these buzzwords, as technicians, we often ask, is this a new technology, or is it just a conceptual renovation (new bottles and old wine)? What problems does it solve and what new features does it have? What is its current situation and what are the problems?

With these questions in mind, today, I’d like to unveil the mystery of data Lakehouse from the author’s understanding, so as to explore the essence of technology?

Data Lakehouse is a new data architecture, which absorbs the advantages of data warehouse and data lake at the same time. Data analysts and data scientists can operate data in the same data storage. At the same time, it can also bring more convenience for companies to carry out data governance. So what is data Lakehouse and what are its features?

This article refers tohttps://www.xplenty.com/gloss…://www.xplenty.com/glossary/what-is-a-data-lakehouse/。

What are the characteristics of data Lakehouse?

For a long time, we have been using two data storage methods to structure data

Data warehouse: a data storage architecture such as data warehouse, which mainly stores structured data organized by relational database. Data is transformed, integrated and cleaned up, and imported into the target table. In the data warehouse, the structure of data storage is strongly matched with its defined schema.
Data Lake: a data storage structure like data lake, which can store any type of data, including unstructured data such as pictures and documents. Data lakes are usually larger and their storage costs are cheaper. The data stored in it does not need to satisfy a specific schema, and the data lake does not try to implement a specific schema on it. On the contrary, the data owner usually parses the schema on read when reading the data, and applies the transformation to the corresponding data when processing it.

Nowadays, many companies tend to build two storage architectures at the same time, one big data warehouse and several small data lakes. In this way, the data in these two kinds of storage will have some redundancy.

The emergence of data Lakehouse tries to integrate the differences between data warehouse and data lake. By building data warehouse on data lake, the storage becomes cheaper and more flexible. At the same time, Lakehouse can effectively improve data quality and reduce data redundancy. ETL plays a very important role in the construction of Lakehouse. It can transform the unorganized data from Lake layer to structured data from warehouse layer.

The concept of data Lakehouse is proposed by databricks in this paper [1]. At the same time, some characteristics are listed as follows:

Transaction support: Lakehouse can handle multiple different data pipelines. This means that it can support concurrent read-write transactions without destroying data integrity.
Schemas: the data warehouse will impose schema on all data stored on it, while the data lake will not. The architecture of Lakehouse can apply schema to most of the data according to the requirements of the application to make it standardized.
•  Report and analysis application support: this storage architecture can be used by both reporting and analysis applications. The data stored in Lakehouse is cleaned up and integrated, which can be used to speed up the analysis. At the same time, compared with the data warehouse, it can save more data, and the timeliness of data will be higher, which can significantly improve the quality of the report.
Data type extension: data warehouse can only support structured data, while Lakehouse can support more different types of data, including files, video, audio and system logs.
End to end streaming supportLakehouse can support streaming analysis, so it can meet the needs of real-time reports. The importance of real-time reports in more and more enterprises is gradually increasing.
Computing storage separation: we often use low-cost hardware and cluster architecture to realize data lake, which provides very cheap separate storage. Lakehouse is built on the data lake, so it naturally adopts the architecture of separation of storage and calculation. Data is stored in one cluster and processed in another cluster.
Openness: in the construction of Lakehouse, iceberg, Hudi, delta lake and other components are usually used. Firstly, these components are open-source and open-source. Secondly, these components use open and compatible storage formats such as parquet and orc as the underlying data storage format. Therefore, different engines and languages can operate on Lakehouse.

The concept of Lakehouse was first proposed by databricks, while other similar products include azure synapse analytics. Lakehouse technology is still in development, so these features mentioned above will be constantly revised and improved.

What problems does data Lakehouse solve

Now that we have finished with the features of data Lakehouse, what problems does it solve?

Over the years, in many companies, data warehouse and data lake have been coexisting and developing separately, and have not encountered too serious problems. However, there is still room for progress in some areas, such as:

Data repeatability: if an organization maintains a data lake and multiple data warehouses at the same time, it will undoubtedly lead to data redundancy. In the best case, this will only lead to inefficient data processing, but in the worst case, it will lead to inconsistent data. Data Lakehouse unifies everything. It eliminates the repeatability of data and truly achieves the single version of truth.
High storage costThe purpose of data warehouse and data lake is to reduce the cost of data storage. Data warehouse often reduces cost by reducing redundancy and integrating heterogeneous data sources. Data lake often uses big data file system (such as Hadoop HDFS) and spark to store computing data on cheap hardware. The cheapest way is to combine these technologies to reduce costs, which is the goal of Lakehouse architecture.
Differences between reporting and analytical applicationsReport analysts tend to use integrated data, such as data warehouse or data mart. Data scientists, on the other hand, tend to deal with data lakes and use various analytical techniques to process raw data. In an organization, there is often not much intersection between the two teams, but in fact, there are some duplication and contradictions in their work. When using data Lakehouse, two teams can work on the same data architecture to avoid unnecessary duplication.
Data stagnation: in the data lake, data stagnation is the most serious problem, if the data has been unattended, it will soon become a data swamp. We often easily throw data into the lake, but the lack of effective governance, in the long run, the timeliness of data becomes more and more difficult to trace. The introduction of Lakehouse can effectively help improve the timeliness of data analysis by cataloging massive data.
Risk of potential incompatibilityData analysis is still a rising technology, and new tools and technologies are still emerging every year. Some technologies may only be compatible with data lake, while others may only be compatible with data warehouse. Lakehouse’s flexible architecture means that companies can prepare for the future in two ways.

Problems in data Lakehouse

There are still some problems in the existing Lakehouse architecture

Unified structureThe unified architecture of Lakehouse has many advantages, but it also introduces some problems. Generally, the unified architecture is inflexible, difficult to maintain, and difficult to meet the needs of all users. Architects tend to use multi-mode architecture to customize different paradigms for different scenarios.
It is not an essential improvement on the existing architectureThere are still questions about whether Lakehouse can really bring additional value. At the same time, there are different opinions – whether the combination of existing data warehouse and data Lake structure with appropriate tools will bring similar efficiency?
The technology is not yet matureLakehouse technology is not mature at present, and there is still a long way to go before it can achieve the above mentioned capabilities.

References

[1]  This article:  _https://databricks.com/blog/2…