Data governance


Learning and combing data governance is not deep. If you lose all the content at once and your mind bursts, write it roughly first. The following is fully understood in combination with address data governance.

1、 What is data governance
Business, technical and management activities to improve data quality belong to the scope of data governance.
Current mainstream data governanceData standard, data model, metadata, master data, data distribution and storage, data lifecycle management, data quality, data security, etc

Data governance

The purpose of data governance is to control the data through effective data resource control means, so as to improve the data quality and the ability of data realization.

2、 Why data governance
In China, enterprise informatization in various industries has roughly experienced three stages: chimney system construction in the initial stage, integrated system construction in the middle stage and data management system construction in the later stage. It can be said that it is a process of construction before governance. There are often some data quality problems in the construction process:
1. Uneven data quality levels
In the information age, enterprises, governments and other organizations pay more and more attention to the management of “data assets”. The organizer produces a large amount of text, video and audio data every day, including high-quality effective data and low-quality data that are difficult to use. Eliminating low-quality data, governance and retaining valuable data are essential governance work.

2. Difficulties in data exchange and sharing
In the early stage of enterprise informatization construction, there is a lack of overall informatization planning, and the data is scattered in the system with inconsistent architecture, inconsistent development language and diversified database, resulting in the formation of “information islands” within the enterprise. Only by connecting data and eliminating these “information islands”, can data-driven business and data-driven management be realized, and data value be truly released.

3. Lack of effective management mechanism
Now many enterprises try to control the data flow through the business flow of the production system, but due to the lack of effective management mechanism and some human factors, there are data maintenance errors, data duplication, data inconsistency and incomplete data in the process of data flow, resulting in a large amount of junk data. Unclear data property rights, confused management responsibilities and unclear management and use processes are important factors causing data quality problems.

4. There are data security risks
The disclosure and abuse of 50 million Facebook users’ information in March 2018; In 2016, SF employees should steal a large amount of customer information and be taken to court; In 2017, JD employees stole 5 billion pieces of user personal information and sold them on the online black market through various ways. With the development of big data, such data security incidents are numerous. Data asset management is developing from traditional decentralized manual management to computer centralized management. People pay more and more attention to the security of data.

3、 Specific content of data governance
Each data governance field can be studied as an independent direction. The currently summarized data governance fields include but are not limited to the following contents:Data standard, metadata, data model, data distribution, data storage, data exchange, data life cycle management, data quality, data security and data sharing services.

1. Data standards
A good data standard system is conducive to data sharing, interaction and application. Data standards are applicable to business data description, information management and application system development, including basic standards and index standards (or application standards). It can be used not only as the basis of information management, but also as the basis for data definition during application system development. National standards, industry standards, enterprise standards and local standards are involved. They are associated when defining metadata entities or elements. Data standards are mainly composed of business definition, technical definition and management information.
(1) The business definition mainly defines the business subject to which the standard belongs and the business concept of the standard, including the rules for business use and the relevant sources of the standard. For code standards, coding rules and related code contents will be further clarified.
(2) Technical definition refers to the description of technical attributes such as data type, data format, data length and source system, so as to provide guidance and constraints on the construction and use of information system.
(3) Management information refers to clarifying the owner, management personnel, user department and other contents of the standard, so as to make the management and maintenance of the data standard have a clear responsibility subject, so as to ensure that the data standard can be continuously updated and improved.
For example, the address standard wants to be compatible with and access the existing multi-source address data of the application. The address data specification is designed with full reference to the national / provincial / municipal address standard and industrial address Standard Specification.

2. Data model
Data model is an important part of data governance. An appropriate, reasonable and compliant data model can effectively improve the rational distribution and use of data. It includes conceptual model, logical data model and physical data model. It is the key and focus of data governance. The data model consists of three parts: data structure, data operation and data constraint.
(1) Data structure data structure in data model is mainly used to describe the type, content, nature and relationship of data. Data structure is the basis of data model. Data operation and data constraint are basically based on data structure. Different data structures have different operations and constraints.
(2) Data operation the data operation in the data model is mainly used to describe the operation type and operation mode on the corresponding data structure.
(3) Data constraint data constraint in data model is mainly used to describe the syntax, word meaning relationship, constraints and dependencies between data in data structure, as well as the rules of data dynamic change, so as to ensure the correctness, effectiveness and compatibility of data.
It feels more inclined to the underlying data structure, data operation, data relationship, etc

Author: datahunter
Source: Zhihu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.

3. Metadata
Metadata management refers to the inventory, integration and management of business metadata, technical metadata and management metadata involved in an enterprise, and provides support for the development and maintenance process of enterprise business system and data analysis platform. Metadata is divided into business metadata, technical metadata and operation metadata, which are closely related. Business metadata guides Technical Metadata, which is designed with reference to business metadata, and operation metadata provides support for their management.
(1) Business metadata business metadata is information that defines business-related data and is used to assist in locating, understanding and accessing obligation information. The scope of business metadata mainly includes: business indicators, business rules, data quality rules, professional terms, data standards, conceptual data models, entities / attributes, logical data models, etc.
(2) Technology metadata can be divided into structural technology metadata and relational technology metadata. Structural technology metadata provides the description of data in the infrastructure of information technology, such as data storage location, data storage type, blood relationship of data, etc. Relevance technology metadata describes the association between data and the flow of data in the information technology environment. The scope of Technical Metadata mainly includes: technical rules (calculation / statistics / conversion / summary), data quality rules, technical description, fields, derived fields, facts / dimensions, statistical indicators, tables / views / files / interfaces, reports / multidimensional analysis, databases / view groups / file groups / interface groups, source codes / programs, systems, software, hardware, etc.
(3) Operation metadata operation metadata mainly refers to the organization, position, responsibility and process related to metadata management, as well as the operation data generated by the daily operation of the system. The content of operation metadata management mainly includes: organizations, positions, responsibilities, processes, projects and versions related to metadata management, as well as operation records during system production and operation, such as operation records, applications and operation jobs.
For example, address data, longitude and latitude data and zip code data (business metadata) of address products; Address data structure, address interface, program (Technical Metadata), operation record, interface call statistics (operation metadata)

4. Master data
Master data management is an effective management process of enterprise core data by using relevant processes, technologies and solutions. What we need to do is to integrate the most core data (master data) that needs to be shared from multiple business systems of various departments, centrally manage the data, and transfer the master data to the operational and analytical application systems that need to use these data in the enterprise in the form of services. Master data management involves all participants in master data, such as users, applications, business processes, etc. Master data is widely used and shared data inside and outside the enterprise. It is known as the “golden data” in enterprise data assets. Master data management is the fulcrum of leveraging enterprise digital transformation and the core part of enterprise data governance.
The standard data and longitude and latitude information of address products are the main data

Author: datahunter
Source: Zhihu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.

5. Data distribution and storage
Data distribution and storage mainly covers how to divide and store data, how to distribute data of the total system and subsystems, and how to manage master data and reference data (also known as replica data or secondary data). Only by reasonably distributing and storing data can we effectively improve the degree of data sharing and reduce the storage cost caused by data redundancy as much as possible. Taking commercial banks as an example, generally, considering the factors such as data scale, use frequency, use characteristics and service timeliness, from the perspective of storage system, data storage can be divided into four types of storage areas, namely transaction data area, integrated data area, analytical data area and historical data area
(1) Transaction data area transaction data area includes various online application data such as channel access, interactive control, business processing, decision support and management; It is used to store the original data generated in the process of customer self-service or business interaction with operators, including business processing data, internal management data and some external data. It stores the current status data.
(2) Integrated data area integrated data area includes operational data (OLTP) and data warehouse data (OLAP).
(3) Analytical data area analytical data is mainly used for various market applications for decision support and management. In order to conduct in-depth analysis of business execution, it is necessary to further summarize and analyze the original data, and the statistical analysis results are used for the final decision display. Therefore, the analytical data area stores the index data of these statistical and analysis model structures.
(4) The historical data area stores the data of all near line applications, archiving applications, external data platform applications, etc., mainly to meet the data storage and data query services after archiving of various historical data.

6. Data quality
High quality data is an important basis for analysis and decision-making and business development planning. Only by establishing a complete data quality management system, clarifying data quality management objectives, control objects and indicators, defining data quality inspection rules, implementing data quality inspection, and producing data quality reports. Through the data quality problem processing process and related functions, the closed-loop management of data quality problems from discovery to processing is realized, so as to promote the continuous improvement of data quality.
(1) From the technical level, the system and specification should completely and comprehensively define the evaluation dimensions of data quality, including integrity, timeliness, etc. according to the defined dimensions, the data quality should be tested and standardized according to the standards at all stages of system construction, and timely governance should be carried out to avoid subsequent cleaning.
(2) Clarify the corresponding management process. Data quality problems will occur in each stage. Therefore, it is necessary to clarify the data quality management process in each stage. For example, it is necessary to define the rules of data quality in the requirements and design stage, so as to guide the design of data structure and program logic; In the development and testing phase, the above-mentioned rules need to be verified to ensure that the corresponding rules can take effect; Finally, there should be corresponding inspection after production, so as to eliminate the problem of data quality in the bud as far as possible. For data quality management measures, the strategy of controlling increment and eliminating stock should be adopted to effectively control increment and continuously eliminate stock.
Address data accuracy and recall rate standards, and improve address quality through error correction, problem address quality inspection and user intervention

7. Data lifecycle
Everything has a certain life cycle, and data is no exception. There should be a scientific management method for the generation, processing, use and even extinction of data. The data that is rarely or no longer used should be separated from the system and retained through verified storage devices, which can not only improve the operation efficiency of the system and better serve customers, but also greatly reduce the storage cost caused by long-term storage of data.
The data life cycle generally includes three stages: online stage, archiving stage (sometimes further divided into online archiving stage and offline archiving stage) and destruction stage. The management content includes establishing reasonable data categories and formulating retention time, storage media, cleaning rules and methods, precautions, etc. for different types of data.
Access, use and handover of address data to users

8. Data services
Data service management refers to the research on how to make full use of the data accumulated for many years, analyze the industry business process and optimize the business process. Data usage methods usually include in-depth processing and analysis of data, including analyzing problems at the operation level through various reports and tools, and in-depth processing of data through data mining and other tools, so as to better serve managers. Through the establishment of a unified data service platform to meet cross departmental and cross system data applications. Unify data sources through a unified data service platform, change multiple sources into single sources, speed up data flow and improve the efficiency of data services.
Multi dimensional statistics of address data, system usage analysis and other data services

9. Data security data security should run through the whole process of data governance and ensure that management and technology walk on two legs. In terms of management, establish data security management system, set data security standards, and cultivate the data security awareness of all employees. Technically, data security includes data storage security, transmission security and interface security. Of course, security and efficiency are always a contradiction. The more strict the data security control is, the more limited the application of data may be. Enterprises need to find a balance between safety and efficiency. Data security management mainly includes the following three aspects:
(1) Data storage security includes physical security, system security and data storage security. Data storage security is mainly guaranteed through the procurement of security hardware.
(2) Data transmission security includes data encryption and data network security control. It is mainly designed and installed by professional encryption software manufacturers.
(3) Data use security needs to be controlled from the business system level to prevent unauthorized access, downloading and printing of customer data information; Deploy client security control tools, establish a perfect client information leakage prevention mechanism, and prevent unauthorized dissemination of personal customer information stored on the client; Establish a perfect data security management system, establish a data security standard system, establish a data security management organization, and establish an effective data security review mechanism; Strictly manage all kinds of sensitive data used in production, R & D and testing; Strictly manage the information security of individual customers in cooperation with foreign units.

Recommended Today

Apache sqoop

Source: dark horse big data 1.png From the standpoint of Apache, data flow can be divided into data import and export: Import: data import. RDBMS—–>Hadoop Export: data export. Hadoop—->RDBMS 1.2 sqoop installation The prerequisite for installing sqoop is that you already have a Java and Hadoop environment. Latest stable version: 1.4.6 Download the sqoop installation […]