Big data quality monitoring and data quality processing system

Time:2021-12-5

1 overview of data quality of big data platform

Data quality is the basis for the effectiveness of big data mining. There is a widely spread saying in the field of big data called “garbage in, garbage out”, that is, the quality of input data directly determines the results of data processing. In big data processing, due to the huge amount of data, the data value has the characteristics of low value density. If there are too many data quality problems, the noise density will directly affect the value density, resulting in the reduction of the final data value.
Data quality generally includes the following dimensions:

Big data quality monitoring and data quality processing system

Data Quality Dimention.PNG

Completeness: data integrity. Data integrity means that the source data obtained by the data platform must be complete compared with the data generation scale. If data records are lost, the overall form of data will be incomplete and the data law will not get a good response. Therefore, there will be a large gap between the final algorithm processing result and the world data law.
Consistency: data consistency. It means that there must be no loss of internal logic between data with internal business logic, such as sales= Σ (customer unit price * purchase quantity), sales volume, customer unit price and purchase quantity data obtained by the platform, and there should still be such a calculation relationship between them. If this calculation relationship does not hold, the data consistency will be lost.
Validity: data validity. It means that the value is valid compared with the definition. For example, a revenue field definition should be numeric. If the stored value is AAA, the value is invalid.
Accuracy: data accuracy. Generally, for numerical data, the data platform will have the definition range of maximum digits and decimal digits. When the definition of numerical type in the data platform cannot meet the requirements of actual data storage, the data accuracy may be lost. For example, the input price data of external system is 3.2536. On the big data platform, the Data Engineer sees that this field is a currency field, It may be taken for granted that if it is defined as a field with 2 decimal places, the data storage result will be 3.25, so the data accuracy will be lost, and other calculation accuracy derived from this field will also be affected. In terms of data accuracy, one thing to pay attention to is the consistency of value type definitions between different systems. For example, the field definition in the source system is decimal, but the big data platform defines float. In this case, there is no difference between values stored on both sides, but when this field is used as a calculation field, especially as a divisor field in calculation, Due to the different types of values actually stored, the final calculated accuracy will be different.
Integrity: data integrity. Data integrity refers to the overall consistency of data, including the consistency of the internal relationship of data field values in different tables, and the associated data needs to be complete in the associated table without loss.
Timelines: Data timeliness. In the big data system, the data generation speed is faster and faster, which requires that the data processing of the big data processing system should also have high timeliness, so as to show the fastest data analysis results and provide them to decision-makers for processing in time.
Uniqueness: Data specificity. It means that there can be no duplicate data. The data obtained by the data system shall be consistent with the metadata system records, and there shall be no duplicate data. The so-called duplicate data should be defined according to the actual needs of the business. In particular, it is necessary to analyze the rules of the data generated by the source data system and the actual meaning represented by each piece of data. It cannot be defined only through the dimension field. In the actual business, when multiple records have a common dimension field, if the dimension field is simply set as the associated primary key, It is easy to identify some valid data as invalid data, resulting in errors.

2. Data quality monitoring rules for source data

The design of data quality monitoring system is to define monitoring rules according to several dimensions defined by data quality. The setting of monitoring rules in the data monitoring system is the basis for dealing with data quality problems. If too few monitoring rules are set, many noise data (problem data) cannot be identified. If the monitoring rules are set excessively, some valuable data may be treated as problem data, which reduces the data scale and affects the final integrity of the data, Therefore, it is very important to set monitoring rules reasonably. Here, monitoring rules are defined according to the requirements of specific business systems and quality dimensions. Therefore, the rule definitions and settings of different business systems are completely different. The difference is just to make a hasty decision on the monitoring rules of your own system by referring to the practices of the industry. This is completely wrong.
The general rules we can use are as follows:
Completeness: null value check (note that in the source data, null values may take several forms: “null”, “n”, “a”)
Uniqueness: duplicate record check (the identification method of duplicate records needs to be determined according to the storage method of source data. The determination of the associated primary key determines whether the identification of duplicate records is effective)
Validity: data format check, including value type, decimal places, date field format type check, etc;
Integrity / consistency: check the consistency / integrity of data in different fields / tables by adding SQL. (the setting of this rule is highly dependent on specific business)

3 handling methods of data quality problems

The data monitoring system needs to identify abnormal data through monitoring rules, and the monitoring system should also have reasonable problem data processing methods, so as to solve problem data and improve data quality. The following describes the processing methods of some problem data.

3.1 removing problem data

Removing problem data is the simplest method to deal with data problems. However, before using the method of removing problem data, several questions need to be considered: 1. Is the data to be removed completely worthless? 2. If the data is still valuable, is there any other way to weaken the impact of the problem data.
Data removal method:
a. Identify problem data through monitoring rules;
b. Identify the problem data and correct data and place them in different places (e.g. clean data placement path: S3: / / {bucket} / {layer} / {domain} / {entity} / {data}; problem data placement path: S3: / / {bucket} / quartine / {layer} / {domain} / {entity} / {data})
c. It is necessary to set up an email notification system. When data is removed to the problem data path, it is necessary to automatically send an email to notify relevant personnel. The notification contents generally include: problem data records, number and the problem path removed.
d. After removing the problem data, the remaining clean data can continue other processing processes.

3.2 problem data formatting

As the question asked in the previous paragraph, when the problem data is still valuable, our first choice to deal with the problem data is to modify it rather than remove it. The correction data needs to be corrected according to the specific problems of the data. If the problem is in the data format, there may be the following situations:
a. Date / time format problem (e.g. 2020 / 10 / 20 – > 2020-10-20)
b. The value type is stored as string type in the source file (for example, CSV files store all data as string type by default)
c. Decimal places (the data platform value type definition and decimal places need to be consistent with the source data)
d. The spelling and case of the stored content of string type are different (for example, the same meaning, the storage case is inconsistent “men”, “men”, “men”)

3.3 processing of missing data

3.3.1 reasons for missing data

a. Data has not been collected;
b. Refuse to provide data;
c. Data that does not exist in the business (not applicable)

3.3.2 how to deal with missing data

a. Fill in the missing values with the average, median, mode, etc. of other values;
b. Set the default value (if the birthday value is missing, fill it with 1990-01-01)
c. Use the calculated values of other fields for filling or correction (only applicable to the case where the field has a calculation relationship with the fields of other tables)

3.4 correction of problem data

3.4.1 causes of problem data

a. The data provided by the user is wrong when collecting data;
b. Users can avoid not providing correct data (such as age, income and other data in the questionnaire)
c. The data provided by the external system is incorrect

3.4.2 correction method of problem data

  1. The monitoring rule sets the boundary value monitoring (if the age > = 5, if it is less than this value, it will be treated as invalid. When the age is invalid, it is necessary to analyze whether the user can provide false information when the age is invalid. If so, the data of other fields may be valid)
  2. Set the default value instead of the error value;
  3. Correct incorrect field values by calculating values from other fields.

4 data quality monitoring instrument panel

The data quality monitoring dashboard is very important for data analysts. Data analysts can monitor the current quality of data through the data quality dashboard and judge its data reliability.
The data quality dashboard can be merged into other data dashboards for display, and can be developed using tableau / power Bi and other tools:

Big data quality monitoring and data quality processing system

image.png

5 introduction to data quality monitoring tools

At present, there are some open source monitoring tools on the market, such as Amazon deequ and Apache Griffin, but most of these tools only have monitoring function and no data problem processing function. General data engineers can set monitoring rules, deal with data problems and display data monitoring reports according to the methods described in this paper.
Deequ Git:https://github.com/awslabs/deequ
Deequ Introduction: https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
Griffin Git: https://github.com/apache/griffin
Griffin Introduction: https://griffin.apache.org/docs/quickstart.html