# Feature Engineering of machine learning

Time：2021-9-26

The focus of traditional programming is code. In machine learning projects, the focus becomes feature representation. That is, developers adjust the model by adding and improving features

### Mapping raw data to features

In Figure 1, the left side represents the original data from the input data source, and the right side represents the feature vector, that is, the floating-point value set constituting the samples in the data set. Feature engineering refers to converting the original data into feature vectors. It is estimated that a lot of time is required for feature engineering

Many machine learning models must represent the feature as a real vector, because the eigenvalue must be multiplied by the model weight ### Mapping value

Integers and floating-point numbers do not require special coding because they can be multiplied by the number weight. As shown in Figure 2, it does not make much sense to convert the original integer value 6 to the eigenvalue 6.0: ### Mapping classification values

A classification feature has a discrete set of possible values. For example, there may be a feature named street_ Name, with options including:

``{ 'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue' }``

Because the model can not multiply the string by the learned weight, we use feature engineering to convert the string to digital value

To achieve this, we can define a mapping from eigenvalues (which we call the vocabulary of possible values) to integers. Not every street in the world will appear in our dataset, so we can group all other streets into an all inclusive “other” category, called OOV (out of vocabulary) bucket

In this way, we can map street names to numbers as follows:

1. Map Charleston road to 0
2. Map north shoreline Boulevard to 1
3. Map shorebird way to 2
4. Map rengstorff avenue to 3
5. Map all other streets (oovs) to 4

However, if we directly incorporate these index numbers into the model, it will cause some possible problem limitations:
6. We will learn a single weight for all streets. For example, if we learn street_ If the weight of name is 6, we will multiply it by 0 for Charleston Road, 1 for North shoreline Boulevard, 2 for shorebird way, and so on_ Name as a feature to predict house prices. It is unlikely to adjust house prices linearly according to street names. In addition, it will assume that you have sorted streets according to average house prices. Our model needs to flexibly learn different weights for each street, and these weights will be added to the house prices estimated using other features
7. We didn’t put the street_ Name may have multiple values. For example, many houses are located at the corner of two connecting lines, so if the model includes a single index, it cannot be on the street_ This information is encoded in the name value

To remove these two limitations, we can create a binary vector for each classification feature in the model to represent these values, as follows:
8. For values that apply to the sample, set the corresponding vector element to 1
9. Set all other elements to 0

The length of the vector is equal to the number of elements in the vocabulary. When only one value is 1, this representation is called unique heat coding; When multiple values are 1, this representation is called multi thermal coding

Figure 3 shows the unique heat coding of street shorebird way. In this binary vector, the value of the element representing shorebird way is 0 #### Figure 3. Mapping street addresses by single heat coding

This method can effectively create Boolean variables for each eigenvalue (for example, street name). With this method, if the house is located on the shorebird way street, only the binary value of shorebird way is 1. Therefore, the model only uses the weight of shorebird way

Similarly, if the house is located at the corner of two streets, the two binary values are set to 1, and the model will use their respective weights

Note: the unique heat code will be extended to digital data that you do not want to multiply directly with the weight, such as postal code

### Sparse representation

Suppose there are 1 million different street names in the dataset that you want to include as streets_ The value of name. If you directly create a binary vector containing 1 million elements, in which only one or two elements are true, it is a very inefficient representation. When processing vectors, it will occupy a lot of storage space and consume a long calculation time. In this case, a common method is to use sparse representation, in which only non-zero values are stored. In sparse representation, Independent model weights are still learned for each eigenvalue, as described above