# Python data analysis: common data preprocessing methods

Time：2021-2-24

The following article comes from the data theory, author: wpc7113

Introduction to Python data analysis

``https://www.bilibili.com/video/BV18f4y1i7q9/``

1. Standardization: to remove the mean value and scale the variance

Standardization: the distribution of characteristic data is adjusted to standard normal distribution, also known as Gaussian distribution, that is, the mean value of data is 0 and the variance is 1

The reason of standardization is that if the variance of some features is too large, it will dominate the objective function, so that the parameter estimator can not learn other features correctly.

The process of standardization consists of two steps: decentralizing the mean value (mean value becomes 0) and scaling the variance (variance becomes 1).

``````from sklearn import preprocessing
X, y = iris.data, iris.target
Standard transformation
scaler = preprocessing.StandardScaler().fit(X)
x_scaler=scaler.transform(X)``````

2. Min max normalization

Min max normalization transforms the original data linearly to [0,1] interval (it can also be other intervals with fixed minimum and maximum values)

``````min_max_scaler = preprocessing.MinMaxScaler()
x_train_minmax = min_max_scaler.fit_transform(X)``````

3.MaxAbsScaler

``````max_abs_scaler = preprocessing.MaxAbsScaler()
x_train_maxabs = max_abs_scaler.fit_transform(X)``````

4. Robustscaler: standardization of data with outlier

``````transformer = preprocessing.RobustScaler().fit(X)
x_robust_scaler=transformer.transform(X)``````

5. Quantiletransformer quantile transformation

``````quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X)``````

6.Box-Cox

Box Cox transformation is a generalized power transformation method proposed by box and Cox in 1964. It is a data transformation commonly used in statistical modeling. It is used when continuous response variables do not satisfy normal distribution. After box Cox transform, the unobservable error and the correlation of prediction variables can be reduced to a certain extent. The main feature of box Cox transform is to introduce a parameter, estimate the parameter through the data itself, and then determine the data transformation form. Box Cox transform can obviously improve the normality, symmetry and variance equality of data, and is effective for many practical data. The changes are as follows:

``````pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
pt.fit_transform(X)``````

7. Normalization

Normalization is to map the values of different variation ranges to the same fixed range. The common one is [0,1], which is also called normalization.

``X_normalized = preprocessing.normalize(X, norm='l2')``

8. Hot coding

``````enc = preprocessing.OneHotEncoder(categories='auto')
enc.fit(y.reshape(-1,1))
y_one_hot=enc.transform(y.reshape(-1,1))
y_one_hot.toarray()``````

Binarizer binarization

``````binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.fit(X)
binarizer.transform(X)``````

10. Polynomial transformation

``````poly =preprocessing.PolynomialFeatures(2)
poly.fit_transform(X)``````

11. Custom transformation

``````transformer = preprocessing.FunctionTransformer(np.log1p, validate=True)
transformer.fit(X)
log1p_x=transformer.transform(X)``````

## Inventory 2021 interview high frequency questions: brush Java brain map, successfully scored the US group, byte, trembling sound, tiktok and so on 7 offer!

In 2021, we can easily obtain offers from five major manufacturers. In fact, these credits are due to the work of Ali teamJava core mind map, Xiaobian shares it here today and hopes to help more partners get more high paying offers; This article is divided into two parts Ali Java mind map Core summary […]