Detailed implementation of one hot coding sklearn

Time:2019-11-18

One hot coding is necessary in feature processing, which is how we apply it in projects,

#Sklearn usage
from sklearn import preprocessing

enc = OneHotEncoder(sparse = False)
ans = enc.fit_transform([[0, 0, 3],
                         [1, 1, 0],
                         [0, 2, 1],
                         [1, 0, 2]])

The principle of analysis can be referred to: link
The core logic of onehot coding in sklearn is in the “fit” transform method

def _fit_transform(self, X):

 #Get the number of rows and columns of the input parameter
n_samples, n_features = X.shape
#Get the maximum value of each column plus 1 ""“
 n_values = np.max(X, axis=0) + 1
#"Cum sum, used to build sparse matrix" "later"“
indices = np.cumsum(n_values)
#Constructing column value of sparse matrix
column_indices = (X + indices[:-1]).ravel()
#Build [0.001.11.22.23.3 3]
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                        n_features)
#After one hot coding, either 0 or 1, construct the matrix of all 1 first
data = np.ones(n_samples * n_features)
#Storage format of coordinate format (COO) sparse matrix
out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()

An example, such as the first one
Let’s first look at the first feature, that is, the first column [0,1,0,1], that is, it has two values of 0 or 1, so one hot will use two bits to represent this feature, [1,0] represents 0, [0,1] represents 1, and the first two bits [1,0…] in the output result of the above example means that this feature is 0
The second feature, the second column [0,1,2,0], has three values, so one hot uses three bits to represent this feature, [1,0,0] represents 0, [0,1,0] represents 1, [0,0,1] represents 2, in the output result of the above example, the third to the sixth bits [… 0,1,0,0…] means that the feature is 1
The second feature, the third column [3,0,1,2], has four values, so one hot uses four bits to represent this feature, [1,0,0,0] represents 0, [0,1,0,0] represents 1, [0,0,1,0] represents 2, [0,0,0,1] represents 3, and the last four bits [… 0,0,0,1] in the output result of the above example means that this feature is 3

How is it implemented in the “fit” transform method, as shown in the following figure

Detailed implementation of one hot coding sklearn

After that, we will restore the matrix in the storage format of coordinate format (COO) sparse matrix,
In this way, one hot is finished