Big data preprocessing methods, let’s see what you know


Big data contains great value and has attracted great attention from all walks of life. There are many sources of big data. The data collected from the real world are generally incomplete and inconsistent dirty data, which cannot be directly mined and analyzed, or the results of analysis and mining are not satisfactory. In order to improve the quality of data analysis and mining, it is necessary to preprocess the data.
Data preprocessing methods mainly include data cleaning, data integration, data conversion and data reduction.

1. Data cleaning

Real world data are often incomplete, noisy and inconsistent. The data cleaning process includes missing data processing, noise data processing and inconsistent data processing.
The missing data can be processed by ignoring the record, manually supplementing the missing value, filling the missing value with the default value, filling the missing value with the mean value, and filling the missing value with the most possible value.
For noise data, BIN method, cluster analysis method, man-machine combined detection method and regression method can be used.
For inconsistent data, you can use their association with the outside to manually solve such problems.

2. Data integration

Big data processing often involves data integration operations, that is, data from multiple data sources, such as databases, data cubes, ordinary files, etc., are combined to form a unified data set, so as to provide a complete data foundation for the smooth completion of data processing.
In the process of data integration, the following problems need to be considered and solved.

(1) Pattern integration problem.

Pattern integration refers to how to match real-world entities from multiple data sources, which involves entity recognition.
For example, how to determine whether “custom_id” in one database and “custom_number” in another database represent the same entity.

(2) Redundancy problem

Redundancy is another common problem in data integration. If an attribute can be deduced from other attributes, this attribute is redundant.
For example, the average monthly income attribute in a customer data table is the redundancy attribute. Obviously, it can be calculated according to the monthly income attribute. In addition, inconsistent attribute naming will also lead to data redundancy in the integrated dataset.

(3) Data value conflict detection and elimination

Data value conflict detection and elimination is another problem in data integration. In real-world entities, attribute values from different data sources may be different. The reason for this problem may be the difference of representation, scale, or coding.
For example, weight attributes are metric in one system and imperial in another; Price attributes use different currency units in different locations. These semantic differences bring many problems to data integration.

3. Data conversion

Data conversion is to convert or merge data to form a description form suitable for data processing. Common conversion strategies are as follows.

(1) Normalization processing

Normalization is to project an attribute value range into a specific range to eliminate the deviation of mining results caused by different sizes of numerical attributes. It is often used in data preprocessing of neural network, nearest neighbor classification based on distance calculation and clustering mining. For neural networks, the use of normalized data will not only help to ensure the correctness of learning results, but also help to improve the efficiency of learning. For distance based mining, the normalization method can help to eliminate the influence of different attribute value ranges on the fairness of mining results.

(2) Attribute construction processing

Attribute construction processing is to construct new attributes according to the existing attribute set to help the data processing process. The attribute construction method can use the existing attribute set to construct new attributes and add them to the existing attribute set to mine deeper pattern knowledge and improve the accuracy of mining results.

(3) Data discretization processing

Data discretization is the process of replacing the original values of numerical attributes with interval labels or concept labels. It can discretize the continuous attribute values. The essence of continuous attribute discretization is to convert continuous attribute values into a few finite intervals, so as to effectively improve the computational efficiency of data mining.

(4) Data generalization processing

Data generalization is to use more abstract (higher-level) concepts to replace the data objects of low-level or data layer. It is widely used in the transformation of nominal data. For example, street attributes can be generalized to higher-level concepts, such as city and country; Numeric attributes (such as age attributes) can be mapped to higher-level concepts, such as youth, middle age and old age.

4. Data reduction

Complex data analysis of large-scale data usually takes a lot of time, so data reduction technology is needed. The main purpose of data reduction technology is to obtain a simplified data set from the original huge data set and keep the integrity of the original data set. In this way, data mining on the reduced data set will improve the efficiency, and can ensure that the mining results are basically the same as those obtained by using the original data set.
The main strategies of data reduction are as follows.
(1) Data aggregation, such as constructing a data cube (data warehouse operation).
(2) Dimension reduction is mainly used to detect and eliminate irrelevant, weakly related or redundant attributes or dimensions (attributes in data warehouse), such as eliminating redundant attributes through correlation analysis.
(3) Data compression uses coding technology to compress the size of data sets.
(4) Data block reduction uses simpler data expression forms, such as parametric model and nonparametric model (clustering, sampling, histogram, etc.) to replace the original data. In addition, generalization based on concept tree can also reduce the data scale.
The above content is extracted from the book big data acquisition and processing.
Big data preprocessing methods, let's see what you know