Alibaba Cloud Cloud Native Integrated Data Warehouse – Interpretation of New Data Governance Capabilities

Time:2022-8-17

Introduction: This article introduces the latest product development of the big data development and governance platform DataWorks in the field of data governance, including the interpretation of the core product functions and the quantitative evaluation mechanism of data governance based on the whole-link concept of pre-event, in-process and post-event, as well as cost reduction and increase effective cost management best practices. Shared by: Tang Chen, Alibaba Cloud Smart Product Expert Students who haven't had time to watch the live broadcast can watch the live broadcast replay. Live playback:https://developer.aliyun.com/…1. Data Governance Center Product Introduction Alibaba Cloud DataWorks: One-stop Big Data Development and Governance Platform Architecture Big Picture Alibaba Cloud DataWorks is positioned as a one-stop big data development and governance platform. Hologres and other big data engines work closely together to provide rich product functions in the four key links of data acquisition, construction, management and use. It is the core platform product of Alibaba's internal data center, supporting new e-commerce retail, It is required for the digital construction of almost all business sectors and enterprise operation management, such as advertising marketing, local life & travel, smart logistics, and enterprise intelligent management.

With the deepening of data construction, we have become more and more aware that data governance is an indispensable key task in the construction of data assets and accelerating the release of data value. Within the Alibaba Group, we put forward the goal of building a data asset system with "reliable quality, safety and stability, economical production, and convenient consumption", and we carry out data governance work around this goal. Corresponding product modules and capabilities are also built in DataWorks for support, such as "Data Quality Management", "Data Asset Map", "Data Security Management" and "Data Governance Center" as shown in the figure above. Typical Pain Points of Enterprise Data Governance Implementation The data governance work has been widely carried out or is about to be carried out in many enterprises. The data governance implementation has the following four typical pain points: Data governance is difficult to start with data governance work, usually referring to DAMA or DCMM From the theoretical system, it can be found that data governance covers a wide range of content. Where to start first, and what path to advance, these are the first questions that enterprises need to answer when carrying out data governance work. The goal and execution path of data governance are not clear, which is the first typical pain point. It is difficult to implement data governance. Whether it is carried out spontaneously within the enterprise or by a professional consulting agency, after constructing a data governance consulting plan and producing a series of specifications and management methods, it often can only stay on paper without an appropriate governance platform. Tools to support the landing, this is the second typical challenge that will be faced. How to objectively evaluate governance, quantify and visualize governance effectiveness. When this work is not done well, the difficulty of advancing governance will be significantly increased. Unsustainable data governance work Data governance work is easy to fall into "movement governance", and see certain effects within a period of time through concentrated assaults. However, if it is not integrated into the daily data development and production chain, this work will not continue, and it will not be able to fundamentally solve the problem of governance for a long time.

In addition to the completion of the data governance system practiced by Alibaba in the sub-fields such as data quality management, metadata management, and data security management, Alibaba Group has innovatively built the following set of data governance systems that are common to the whole group. Governance from multiple dimensions, such as storage, quality, security, model and cost, adopts unified methods and strategies, builds a quantitative evaluation model, and uses a unified governance platform tool to undertake the implementation, and has achieved remarkable results.

This system is the same, there are several key points: First, the core objects of clear governance are tasks and tables related to ETL operations. Data governance is an objective object of governance, not a governance person. However, a key premise of governance implementation is to determine the attribution of basic objects such as tasks and tables, and to sort out and define the specific person in charge of the objects to ensure that governance issues are settled and followed up. By being specific to people, and then converging to departments and the whole group, secondly, the implementation path of data governance is "status quo analysis -> problem positioning -> optimization governance -> effect evaluation" to build a closed-loop process; finally, data governance The core of the project lies in quantification: quantify the problem and quantify the effect. And based on local details, it gives global decision-making suggestions, such as providing reference for the resource allocation of the whole group, the budget formulation of each department, and the setting of cost optimization goals. Moreover, these quantitative assessments and the discovery and repair of governance problems will be undertaken through a unified platform tool. This set of methods and capabilities, which have been proven effective in Alibaba's internal practice for many years, is now officially providing services to customers on the cloud in a productized way. This is the new product module of the DataWorks Data Governance Center.

Driven by governance problems, the Data Governance Center has built a closed-loop improvement mechanism of governance quantitative evaluation-problem discovery/prevention-optimization of governance problems. Based on the combination of pre-prevention and post-remediation, it provides several core product functions. To explain here, we define the "event" of "before" and "after" as the link of formal data production of ETL jobs in the data platform. Through the function of checking items, the data governance center can automatically scan and check the quality and performance consumption of SQL code in key links such as task submission and release to prevent the introduction of new problems. This is somewhat similar to the compilation and optimization hints. A practical problem currently faced is that the construction of data warehouses and data centers may have been carried out for a long time, and there will be many existing problems that need to be optimized and managed. The governance item function of the data governance center is designed for this purpose. It can find problems in the system that need to be optimized, and provide corresponding solutions. Like check items, this is a fully automatic way. The most distinctive feature of the Data Governance Center, or the feature of Alibaba's internal data governance practice, is this quantitative evaluation mechanism. Based on the concept of governance "health score", quantitative evaluation is carried out from the five basic dimensions of "computing", "storage", "quality", "security" and "R&D", and then the overall governance health evaluation is given. It is convenient to understand the status quo before the implementation of governance, and at the same time, it also provides an objective evaluation of the effectiveness of data governance after implementation. In addition, in terms of cost optimization and governance, the data governance center also provides a series of product capabilities such as resource usage analysis, which can clearly understand the resource consumption, cost estimation and resource changes of a single task, single table granularity, and help companies to have Targeted optimization and management of computing and storage to achieve the goal of reducing costs and increasing efficiency. DataWorks Data Governance Center Product Architecture Full Diagram

The data governance center is essentially a data application product driven by (meta) data, which can be roughly divided into data layer, application layer and management and operation layer. Data layer: It is the key foundation of the entire product module. The data governance center gathers the metadata information of a series of objects such as tasks, tables, models, and data service APIs, and builds a metadata warehouse for analysis and insight to support the upper layer. Governance applications. Governance application layer: The main function of the data governance center is located. Based on the built-in solution template, it provides a series of functions for automatic prevention of problems beforehand, automatic discovery of existing problems afterward, and corresponding optimization processing guidelines. Resource usage analysis is a product capability built for cost control, including resource details and transaction analysis, as well as intelligent resource optimization suggestions in planning. The object 360 is used to aggregate and display the panoramic information of the object, especially the problems that need to be managed and optimized, and track the event changes of the object in the whole life cycle. As an additional support system, the label system is convenient for effectively marking and distinguishing tasks by type, and then performing centralized management. Scenario-based governance is constructed based on the PDCA concept to help flexibly select the objects to be governed, assess the status quo, set governance goals, and effectively monitor the progress of governance implementation according to business needs, and finally achieve governance implementation. Management and operation layer: The core of the data governance center serves two types of user groups, data governance administrators and front-line students who specifically participate in data governance. In the management and operation layer, a series of functions such as governance evaluation report, governance health score, governance ranking list, and governance operation push are provided. DataWorks Data Governance Center Overview Usage Path

The use of the data governance center can be divided into three parts: status assessment, governance implementation, and governance operation & effectiveness review: Data governance status assessment The data governance center provides a built-in template function, which will be used in Alibaba's internal practice and serving external customers. The best practices precipitated from the process are packaged in a template manner to provide out-of-the-box capabilities. After selecting the template and opening the product module, you can use dozens of rich governance items and check items, and view the overall governance evaluation report, that is, the governance health score evaluation. After opening the product module, you can see the governance evaluation report. The Data Governance Center will provide reports from three perspectives of tenants, individual workspaces, and specific individuals, covering five dimensions of R&D, quality, security, computing, and storage, and giving quantitative specific assessments. The most critical point is that for different workspaces and different individuals, this evaluation model uses the same set of standards to ensure the objective consistency of evaluation. This report can be used as a basis for reference before the formal implementation of governance work.

Data Governance Health Score Evaluation Model The data governance health score is calculated according to the defined model based on the problems found by the governance items. The deduction logic adopted is a full score of 100 points. Through the built-in algorithm model, the health points are obtained after deducting the deduction points from the problems as needed. The Data Governance Center subdivides individual health scores in five dimensions of R&D, quality, storage, security, computing, and storage, and calculates the overall health score after synthesis. This logic may not seem complicated, but the complexity lies in the acquisition of underlying metadata, processing and construction, and insight into governance issues.

Data governance implementation: problem prevention, discovery, and optimization require the use of check items and governance items. Check items are oriented to pre-governance problem prevention. They will invade the submission and release of daily tasks. If the detection fails, the process will be blocked. This function is It is not enabled by default, and needs to be enabled on demand, and can control specific workspaces to enable specific check items. The governance item is oriented to the problem of post-event governance. It is found that this function does not require additional settings and can take effect after the template is enabled. Processing optimization of governance problems – automatic prevention (check items) After the check items are turned on, they can act on a specific space, and can automatically trigger scans during task submission or release.

Currently, the built-in templates of the Data Governance Center provide dozens of check items out of the box, and the rest of the check items are gradually enriched as they accumulate within the Alibaba Group and based on customer feedback.

If the built-in check items in the system cannot fully meet your individual needs, we also provide a flexible extension mechanism based on the DataWorks-based development platform. The extension core of the check item needs to use the functionality of open events, extension points, and extensions. Based on this mechanism, you can customize the development of personalized checkers, then register with the data governance center, and manage and use the built-in checkers in a unified manner.

Processing optimization of governance issues – automatic discovery (governance items) and post-event governance uses the ability of "governance items". Governance items are different from inspection items. Governance items are automatically enabled after the template is enabled. The system will automatically scan out the problems that need to be managed and optimized, and provide corresponding processing guidelines and guidance to optimize the problems.

Similar to the check items, the Data Governance Center has a total of 43 built-in governance items in the five dimensions of storage, computing, security, quality, and R&D through templates. These are accumulated from Alibaba's internal practices and customer needs. out of the box.

The long-term operation mechanism of data governance In the evolution of Alibaba's internal data governance, three obvious directions can be seen, which are described from the three directions of organization, platform and business. First of all, data governance is not simply a big data team that has been working on technology and building a platform, it is more a problem of organizational coordination, which will cross the original single technical team and affect the overall structure design of the company, as shown on the left side of the figure below. , There is a data platform team, a business team, and a collaborative team such as finance and risk control. When it comes to cross-teams, for the entire organization, a very difficult question is how to measure the effect? How to better develop the initiative of the organization? When doing governance within an enterprise, it is often found that there is a good standard, but there is no platform to implement it. Within Alibaba, this is a great starting point for designing governance health scores. For a BU, for example, one of the goals this year is to increase the health score from 70 to 80. It can start from various aspects such as computing, storage, R&D, governance, security, etc. Any needs can be submitted to the data platform The team will deposit these capabilities on the platform, and the goal will be shared by everyone. In this way, each team will have a unified assessment index to guide the work of data governance. In terms of long-term promotion, we will start various data governance campaigns, and conduct long-term operational work such as competition of governance effectiveness among various business teams. We can also continue to extend through healthy divisions to achieve the purpose of organizing data collaboration. , to give play to the initiative of the data governance organization.

As far as the specific data governance results are concerned, as a continuation, the data governance center will clearly quantify the statistical display of storage savings, computing savings, risk prevention, and problem repair, as well as the corresponding health score improvement, etc. The specific governance effect is clearly displayed.

The Center for Data Governance also looks at transforming data governance from a job of the few to a general job with a good base of people and participation. The Data Governance Ranking List allows the governance participants to clearly perceive their position, so that the excellent can be praised, and the poor can be encouraged; at the same time, it provides different perspectives for governance administrators and ordinary students, so that they can clearly understand the level of governance health and the level of governance health. Problems that need to be optimized should be optimized and managed in a targeted manner.

2. Best Practices for Cost Optimization Governance

Let's look at a concrete case of cost-optimized governance. In this case, our customer uses the DataWorks+MaxCompute product combination to build an offline data warehouse, and MaxComputes uses a post-paid model. With the rapid development of the business, the cost is unpredictable to a certain extent. The customer's demand for cost optimization and governance is to reduce the overall cost by 30% on the premise of supporting business development, and has high guarantee requirements for SLA. The cost optimization governance cannot reduce the commitment to business data output time. We have adopted three major categories of optimization governance measures, and achieved the goal of reducing the overall cost by 35%+ and the SLA of data production still maintaining a steady increase. Measure 1: Optimize governance for existing problems, offline tasks and tables, and reduce waste of resources

2. Use the resource usage detail function to schedule peak shifts based on job SLA tolerance and CU consumption.

3. Use the Task 360 feature to view specific issues that can optimize governance for specific tasks and deal with them

4. Use the governance workbench function to check the overall picture of the tasks that can be optimized for governance and optimize with reference to the processing guide

For cost optimization, you can focus on the following check items and governance items provided by the data governance center: Check items: partition table queries must include partitions Import Governance Item: Continuous Error Node Governance Item: Empty Run Node Governance Item: No Access to Leaf Node Governance Item: SELECT Invalid Scheduling Governance Item: Brute-force Scan Governance Item: Input is Empty Governance Item: Output is Empty Governance Item: No Life is Set Periodic Governance Item: Tables that have not been accessed for a long time Measure 2: MaxCompute project post-payment to pre-payment, using secondary Quota to reduce costs

There are two payment modes for MaxCompute resources: "postpaid" and "prepaid". Among them, the "post-paid" model is widely used because of its flexible resource allocation strategy, high assurance that it can meet the demands of large tasks for resource use in a timely manner, and accelerated task output time; however, there is a problem with the "post-paid" model. Planning and overall control of expenses in advance is prone to unexpectedly large bills. In contrast, the "prepaid" model supports the purchase of a fixed amount of resources, making it easier to control the overall budget. Therefore, there are currently many demands from "post-paid" to "pre-paid" to achieve control over the overall budget and refinement and optimization of costs. Postpaid to prepaid is a double-edged sword. After all, in the prepaid mode, the purchase amount is limited, which may affect the output completion time of the task. Before the conversion, it is necessary to understand the characteristics of the project in advance, such as whether there is a sudden use of resources, the peak and low peaks of resource use, and a comprehensive investigation should be carried out. The Data Governance Center provides the CU consumption trend value that converts resource usage into prepaid mode in the postpaid mode, which can be used as a reference for converting the purchased CU value. The empirical value is recommended to be 1.2 to 1.5 times the peak value of the trend graph. If you are looking to switch but not sure how many CUs to buy, you can also contact us for assistance with a capacity assessment. After transferring from post-paid to pre-paid, fully using the functions of the secondary quota group of MaxCompute can effectively help optimize resource allocation. There are three practical experiences to share: Strong isolation: setting the minimum guaranteed amount of resource groups = maximum guaranteed amount; ensuring resources allocation. For example, the "algorithm group" in the figure below. This is suitable for projects that require strong security during peak hours of night operations. Resource skew: If min < max is set, other quota groups can occupy resources when the quota group is idle. This method can provide better flexibility. Using the function of the Quota component, through the time-sharing setting, it can effectively balance the resource allocation of peak production operations at night and the resource demands of analysis and query projects during the day, thereby reducing the overall CU purchase peak. In addition, there are two points that need special attention: 1. It is necessary to sort out the priority of jobs and configure DataWorks baseline monitoring for high-quality jobs to ensure the priority allocation of resources; if the system speculates that key tasks are expected to have output delays, an alarm can be sent in advance notice, allowing sufficient lead time for disposal. 2. After switching to prepaid, MCQA query acceleration resources need to be re-planned. If you use this function, you need to pay special attention. Measure 3: For data supplement scenarios, use the usequota feature flexibly to make resource consumption controllable

Complementary data, that is, the function of refreshing data, is widely used in algorithm experiment scenarios. Usually, if a model verification effect is very good, algorithm students often need to refresh the data for a week, a month, or even half a year. A typical feature of the algorithm job is that the amount of scanned data is extremely large, but the SLA requirements for the completion time are relatively low. For example, it can be completed within one day. If the post-paid model is used, the fee will be charged in proportion to the amount of scanned data. Brings very high cost overhead. The left side of the figure below shows this situation. The cost of periodic scheduling tasks is relatively stable and controllable when split, but the uncertainty of the supplementary data cost brings a certain degree of uncontrollable overall cost. For this scenario, MaxCompute provides a new feature of use quota, which directs the job to a specific prepaid quota group and restricts the launch of a lower CU, which not only ensures the completion of the task, but also effectively controls the cost. For periodic scheduling tasks, it is not recommended to use use quota in principle. This method will have a greater impact on the SLA. It needs to be carefully evaluated before using this method. At least configure baseline monitoring, so that you can predict the delay of task output in advance. 3. The future planning of the Data Governance Center is based on the core demands of reducing costs and increasing efficiency, focusing on the automatic prevention and evolution of governance issues, and improving the efficiency of handling governance issues. The function construction is based on the best practices of Alibaba's internal and Dataworks customers, and continuously enriches the built-in governance items and check items, so that governance problems can be discovered and prevented more comprehensively. It provides elegant solutions for governance operations such as task offline and table deletion, and solves governance risks. Concerns, improve the efficiency of problem optimization and governance and the completion rate of disposal. Continue to consolidate the resource usage analysis and insight function, and effectively help control unreasonable resource usage costs. Expand the supported engine types, from only supporting MaxCompute to supporting more engine types such as EMRHive and Hologres. Commercialization of best practices and industry templates for different industries is fully open: July 2022, providing a one-month free experience for a limited time (no version limit) Product charges: August 2022 (ETA), as the core features of the enterprise version , no additional charge separately DataWorks value-added version billing and description More Alibaba Cloud big data products >>

Original link:http://click.aliyun.com/m/100…This article is original content of Alibaba Cloud and may not be reproduced without permission.