Machine learning D2 — encoding: summary of coding methods of category variables in machine learning

Time:2022-2-3

This article mainly introduces several types of encoding. The following is a basic classification:

Machine learning D2 -- encoding: summary of coding methods of category variables in machine learning

What is categorical data

When a data is characterized by finite discrete variables, it is divided into nominal, ordinal and continuous;

Why encoding

After treatment, it can be put into

Common encoding types

1. Label Encoding

Visual example:

code directly according to the classification of data, Example: classmap = {‘freshman’: 1, ‘sophomore’: 2, ‘Junior’: 3, ‘senior’: 4}

Application

• For discrete variables with fixed types and high and low sorting between types

code

• labelEncoder(sklearn), factorize (pandas)

advantage

• You can define the order freely

inferiority

• The value size of the order has no meaning and can be defined by yourself

• In the case of self definition, it is defined as ordinal encoding in some places

Next question

• If you want to sort in order according to the ranking level, what method can you use? Ordinal Encoding

2. Ordinal Encoding

• For discrete variables with fixed types and ranking between types; You can refer to label encoding

• code

• map in a dataframe(pandas)

• from sklearn.preprocessing import OrdinalEncoder

3. One hot Encoding

method

• According to the discrete eigenvalues, several eigenvectors are established for several values, and the value is (1, 0, 0)

Application

• When the value of discrete eigenvalue is meaningless

• It is suitable for numerical size sensitive models, such as SVM and LR

code

• Implementation method 1: get_ dummies from pandas

• Implementation method 2: onehotencoder () from preprocessing

advantage

• Order will not be introduced artificially

• Make the feature distance more reasonable: by projecting the feature vector into the Euclidean space, the distance between each variable is the same. If it is three vectors, it is similar to a cube, (1,0,0), (0,1,0), (0,0,1)

• Normalization of features can be carried out: it can be regarded as continuous features, so normalization can be carried out

inferiority

• The more features, the more features will be generated, which is easy to spark, leading to model overfitting, such as tree based model. If there are too many features, this method is not applicable

• Dummy variable trap

• If the features are highly similar, it is easy to fall into the dummy variable trap, that is, too many variables are used to describe similar features, but you need industry knowledge or experience in the early stage to support your inference.

4.Dummy Encoding

method

• It is very similar to one hot encoding, but one feature will be removed, that is, n-1 features will be used to represent all discrete variables, and the removed one will be represented by (0,0,0,0)

Machine learning D2 -- encoding: summary of coding methods of category variables in machine learning

5.Binary Encoding

method

• It can be understood as a variant of label encoding, which converts numbers into binary numbers, and then converts them into each line according to the number of digits

Machine learning D2 -- encoding: summary of coding methods of category variables in machine learning

code

• BinaryEncoder from category_encoders package

6.Target Encoding —-Mean Encoding

• method

• There are two factors involved here, one is the eigenvector itself, and the other is the target (often the label of 1,0). Calculate the average by calculating sum (target) / count (label) in a category, and take this value as the calculated average

• Regularization is usually required

• example

Machine learning D2 -- encoding: summary of coding methods of category variables in machine learning

• advantage

• Kaggle is often used in competitions, and the effect may be better

• It not only reflects the characteristics of different features (different values), but also reflects the relationship between target and label

• It does not affect the amount of data and the efficiency of machine learning

• inferiority

• It is difficult to validate, so cross validation is required

characteristic

• The encoding label is directly related to the target

7. Frequency Encoding

method

• Count the eigenvectors according to the categories, calculate the frequency of each category, and replace the value of the eigenvector with the frequency

code

• Pandas, group by, len (DF)

advantage

• Added frequency information

 Other encodings that haven’t been introduced yet:

8) Weight of Evidence Encoding

9) Probability Ratio Encoding

10) Hashing Encoding

11) Backward Difference Encoding

12) Leave One Out Encoding

13) James-Stein Encoding

14) M-estimator Encoding

15) Effect Encoding

16) BaseN Encoding

reference:

One hot Encoding

https://www.cnblogs.com/zongfa/p/9305657.html#:~:text=%E4%B8%BA%E4%BA%86%E8%A7%A3%E5%86%B3%E4%B8%8A%E8%BF%B0%E9%97%AE%E9%A2%98%EF%BC%8C%E5%85%B6%E4%B8%AD,%E5%85%B6%E4%B8%AD%E5%8F%AA%E6%9C%89%E4%B8%80%E4%BD%8D%E6%9C%89%E6%95%88%E3%80%82

Label encoding

https://zhuanlan.zhihu.com/p/42075740

other

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

https://www.datacamp.com/community/tutorials/encoding-methodologies

Here’s All you Need to Know About Encoding Categorical Data (with Python code)