Introduction to MaxCompute Lake and Warehouse Integration


Introduction: This article shares the introduction of MaxCompute lake and warehouse integration. Shared by: Meng Shuo, Alibaba Cloud MaxCompute product expert

Video link:…


This article will introduce the integration of MaxCompute lake and warehouse in two parts.

1. What is the integration of MaxCompute lake and warehouse?

2. Introduction of successful cases of lake and warehouse integration

1. What is the integration of MaxCompute lake and warehouse?

The overall architecture of lake and warehouse integration is mainly used by data analysts, data scientists and big data engineers. The main applications are Machine, unstructured data analysis, Ad-hoc/BI, Reporting and Learning, etc. In the overall architecture, as a platform for unified data development and management, DataWorks is mainly responsible for comprehensive work such as data security, IDE development, task scheduling and data asset management to ensure the stable operation of the platform.

As shown in the figure above, in the overall architecture, we first open up the network of the data lake cluster and the MaxCompute data warehouse cluster, and then open up the data in the storage layer to ensure intelligent cache, hot and cold tiering, storage optimization, and performance acceleration. At the computing layer, we implement the perspective of DB-level metadata to avoid data silos.

DataWorks unifies various data assets, such as E-MapReduce, CDH HBase, CDH Hive and AnalyticDB for etc. Not only can you see the global data assets in the data map, but also support the extraction of elements and information from the data source.

Within Alibaba, we have achieved a certain degree of data democratization. Today, employees of all tables within the Alibaba Group can see the table name and metadata information, as well as the security level of the information. DataWorks as the middle platform: It can collect data from the data sources supported in the list and bring it into the platform management and control.

At present, the existing product capabilities of unified table-level and field-level data lineage can only be limited to cross-lineage within a single engine. It is expected to achieve cross-engine data bloodline next year.

Within a single engine, multiple hadoop clusters can be mounted to realize the docking and management of a unified engine.

As a unified data development platform, DataWorks can mix MC tasks and hadoop tasks in one process. Not only can the temporary query entry be unified and sent to different engines. And different engine jobs can be mixed and scheduled. For example, data integration jobs, MaxCompute jobs, and Hive jobs.

2. Introduction of successful cases of lake and warehouse integration

The advertising algorithm team of an Internet game company is the main customer of Hucang Integration. The main application is the machine learning DW+MC+PAI+EAS online model service. The team is highly self-service and needs a one-stop machine learning platform. On the other hand, Hadoop clusters are shared by multiple teams, and the use of clusters is strictly controlled and cannot support innovative businesses with large workloads in a short period of time.

Based on the above requirements, we connected the new business platform with the original data platform through the integration of lake and warehouse, namely PAI on MaxCompute+DataWorks. It provides customers with one-stop machine learning, model development, model release, large-scale computing and other capabilities, which improves the team's work efficiency.

By introducing MaxCompute as the data center of the computing engine, Shuhe not only allows the data lake computing to flow freely, but also solves the problem of different and unified storage management, metadata management and permission management of the previous heterogeneous computing engines. It not only improves the overall work efficiency, but also reduces the operation and maintenance cost, which plays a role in reducing cost and increasing efficiency.

The above picture shows the integrated lake and warehouse architecture based on MaxCompute+DLF+EMR built by Shuhe. The bottom layer is the OSS data lake storage. We have built metadata management, data lineage management, and data rights management through DLF. Through the method of JindoFS+MC, the hot and cold layering and local caching of data are realized. We combine MaxCompute and EMR to successfully implement intelligent data construction and data middle-end management.

In the future, Hucang will develop a unified management platform to realize one-stop management and governance of Hucang data. OSS object storage supports not only structured data, but also unstructured data. The entire platform can not only synchronize federated data sources, but also unify metadata services and metadata warehouses.

Original link
This article is the original content of Alibaba Cloud and may not be reproduced without permission.