Python data analysis: common data preprocessing methods


The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. If you have any questions, please contact us in time.

The following article comes from the data theory, author: wpc7113

Python data analysis: common data preprocessing methods


Introduction to Python data analysis


1. Standardization: to remove the mean value and scale the variance

Standardization: the distribution of characteristic data is adjusted to standard normal distribution, also known as Gaussian distribution, that is, the mean value of data is 0 and the variance is 1

The reason of standardization is that if the variance of some features is too large, it will dominate the objective function, so that the parameter estimator can not learn other features correctly.

The process of standardization consists of two steps: decentralizing the mean value (mean value becomes 0) and scaling the variance (variance becomes 1).

from sklearn import preprocessing
from sklearn.datasets import load_iris
iris = load_iris()
X, y =,
Standard transformation
scaler = preprocessing.StandardScaler().fit(X)


2. Min max normalization

Min max normalization transforms the original data linearly to [0,1] interval (it can also be other intervals with fixed minimum and maximum values)

min_max_scaler = preprocessing.MinMaxScaler()
x_train_minmax = min_max_scaler.fit_transform(X)



max_abs_scaler = preprocessing.MaxAbsScaler()
x_train_maxabs = max_abs_scaler.fit_transform(X)


4. Robustscaler: standardization of data with outlier

transformer = preprocessing.RobustScaler().fit(X)


5. Quantiletransformer quantile transformation

quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X)



Box Cox transformation is a generalized power transformation method proposed by box and Cox in 1964. It is a data transformation commonly used in statistical modeling. It is used when continuous response variables do not satisfy normal distribution. After box Cox transform, the unobservable error and the correlation of prediction variables can be reduced to a certain extent. The main feature of box Cox transform is to introduce a parameter, estimate the parameter through the data itself, and then determine the data transformation form. Box Cox transform can obviously improve the normality, symmetry and variance equality of data, and is effective for many practical data. The changes are as follows:

Python data analysis: common data preprocessing methods


pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)


Python data analysis: common data preprocessing methods


7. Normalization

Normalization is to map the values of different variation ranges to the same fixed range. The common one is [0,1], which is also called normalization.

X_normalized = preprocessing.normalize(X, norm='l2')


8. Hot coding

enc = preprocessing.OneHotEncoder(categories='auto'),1))


Binarizer binarization

binarizer = preprocessing.Binarizer(threshold=1.1)


10. Polynomial transformation

poly =preprocessing.PolynomialFeatures(2)


11. Custom transformation

transformer = preprocessing.FunctionTransformer(np.log1p, validate=True)