🙊 when it comes to feature engineering, it is the most important and time-consuming work in machine learning modeling, and it involves a lot of knowledge points. The experienced old drivers are naturally familiar with the road, but for the novice drivers, the knowledge points learned are a little bit East and a little west, not systematic. This article is reading a very high score Characteristics of engineering books

Introduction and practice of Feature EngineeringIn the next note, I will record the knowledge points of the relative system and the code that can be run and reproduced. I hope it will be helpful to all colleagues.

### Directory

- Characteristic understanding
- Feature enhancement
- Characteristic construction
- ✅ feature selection
- 💫 feature conversion
- Feature learning

You can see the mind map first:

### 🔍 01 feature understanding

When we get the data, the first step we need to do is to understand it. Generally, we can start from the following perspectives:

(Note: two datasets are used in this section, namely, salary ranges by job classification and globallandtemperaturesbycity)

#### 1. Distinguish structured data from unstructured data

For example, some data stored in the form of tables are structured data, while unstructured data is a pile of data, similar to text, message, log and so on.

#### 2. Distinguish between quantitative and qualitative data

Quantitative data: refers to some numerical values used to measure the quantity of something;

Qualitative data: refers to categories used to describe the nature of something.

In fact, the quantitative and qualitative data can be further divided into**Nominal, ordinal, interval, ratio**, let’s give an example of these four types of data to deepen your impression on them.

**1) nominal**

That is, classification, such as blood group (A / B / O / AB), gender (male / female), currency (RMB / USD / yen), and it is worth noting that there is no size comparability between these classifications. Generally speaking, we can only see the proportion of distribution, which can be represented by bar chart and pie chart.

**2) ordinal**

Compared with fixed class, fixed order has one more “sortable” attribute, that is to say, although they are class variables, there is a “size” difference between their variable values. For example: final grade points (a, B, C, D, e, f), questionnaire answers (very satisfied, satisfied, average, dissatisfied). In terms of visualization, it is the same as the fixed class, but there is one more**Box diagram**It can be used (because the ordered variable can have a median).

**3) interval**

In the case of fixed distance, the addition and subtraction method can be used between variable values, that is, the terms such as mean value and variance can be introduced, and there are many graphs that can be drawn, including the previous ones, as well as histogram.

**4) ratio**

Fixed ratio is more strict than fixed distance, not only has all the attributes of fixed distance, but also has a**Absolute zero point**For example, the price of one product is twice that of another. It is worth noting that the temperature is generally not included in the fixed ratio, but in the fixed distance. It is not said that 20 degrees is twice as much as 10 degrees.

Finally, summarize the above contents:

#### 3. Key code collection

The following code is just a core fragment, and the full code can enter the keyword in the background of the public address (SAMshare).**Feature Engineering**Obtain.

**1) common simple drawing**

```
#Draw bar chart
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(10).plot(kind='bar')
#Draw pie chart
salary_ranges['Grade'].value_counts().sort_values(ascending=False).head(5).plot(kind='pie')
#Draw box diagram
salary_ranges['Union Code'].value_counts().sort_values(ascending=False).head(5).plot(kind='box')
#Draw histogram
climate['AverageTemperature'].hist()
#Histogram average temperature for each century
climate_sub_china['AverageTemperature'].hist(by=climate_sub_china['Century'],
sharex=True,
sharey=True,
figsize=(10, 10),
bins=20)
#Draw a scatter diagram
x = climate_sub_china['year']
y = climate_sub_china['AverageTemperature']
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(x, y)
plt.show()
```

**2) check for missing**

```
#Remove missing values
climate.dropna(axis=0, inplace=True)
#Check the number of missing
climate.isnull().sum()
```

**3) variable category conversion**

```
#Date conversion, convert DT to date, take year, pay attention to the usage of map
climate['dt'] = pd.to_datetime(climate['dt'])
climate['year'] = climate['dt'].map(lambda value: value.year)
#China only
climate_sub_china = climate.loc[climate['Country'] == 'China']
climate_sub_china['Century'] = climate_sub_china['year'].map(lambda x:int(x/100 +1))
climate_sub_china.head()
```

### 🔋 02 feature enhancement

This step is actually data cleaning. Although the previous step also involves some cleaning work (such as clearing null values, date conversion, etc.), it is decentralized. This section focuses on some skills and practical codes of data cleaning for you to use in actual projects.

#### Step 1: carry out EDA (exploration data analysis) with the following ideas:

(1) first of all, look at the target share (for the dichotomous problem, that is, the share of 0 and 1). Directly`value_counts()`

It can be solved to see if the sample is unbalanced.

(2) then check whether there is a null value and make statistics directly`isnull().sum()`

However, it should be noted that the statistics may show that there is no missing, not because there is no missing, and there is no missing**Filled with a special value**Generally, they will use – 9, blank, unknown, 0 and so on. We need to pay attention to ⚠ identification, and then we need to carry out the missing**Reasonable filling**。

(2.1) how to identify missing values? Generally, it can be`data.describe()`

Obtain basic descriptive statistics, and judge according to the mean, standard deviation, minimax and other indicators, combined with the meaning of variables.

(3) then look at the distribution of eigenvalues among different categories. You can observe it by drawing histogram (numerical variable) and calculating the distribution of the proportion of variable values (category variable).

(4) observe the correlation between different variables by drawing**Thermodynamic diagram of correlation matrix**To observe the general situation.

#### Step 2: deal with data missing

There are many ways to deal with the missing, but the most commonly used authors say there are two ways: filling and deleting.

Before dealing with the missing, we have identified some manually filled missing in the above section, which needs to be restored as follows:

```
#Handle missing value 0 filled by error, revert to empty (handle separately)
pima['serum_insulin'] = pima['serum_insulin'].map(lambda x:x if x !=0 else None)
#Check for missing variables
pima['serum_insulin'].isnull().sum()
#Bulk operation restore missing value
columns = ['serum_insulin','bmi','plasma_glucose_concentration','diastolic_blood_pressure','triceps_thickness']
for col in columns:
pima[col].replace([0], [None], inplace=True)
#Check for missing variables
pima.isnull().sum()
```

**1) delete rows with missing values**

The words here are relatively simple, that is to use`dropna()`

At the same time, we can check how much data we have deleted:` round(data.shape[0]-data_dropped.shape[0])/float(data.shape[0])`

Then we can make statistics. Of course, after deletion, we also need to look at the distribution of data, and compare the target proportion, feature distribution with the previous**Whether there is obvious difference**, if so, this is not recommended.

**2) reasonable filling of missing value**

Missing padding, including mean padding, – 9 padding and median padding. It will be simpler here. We can usually use sklearn**Pipeline and computer**To implement, here is a simple and complete demo:

```
#Using pipeline and computer of sklearn to fill in missing values
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Imputer
#Reference candidate
knn_params = {'classify__n_neighbors':[1,2,3,4,5,6]}
#Instantiate KNN model
knn = KNeighborsClassifier()
#Piping design
mean_impute = Pipeline([('imputer', Imputer(strategy='mean')),
('classify',knn)
])
X = Pima. Drop ('onset ﹐ discards', axis = 1) ﹐ discard y
y = pima['onset_disbetes']
#Grid search
grid = GridSearchCV(mean_impute, knn_params)
grid.fit(x, y)
#Print model effects
print(grid.best_score_, grid.best_params_)
# 0.73177
```

#### Step3: standardization and normalization

After the above processing, the accuracy of the model can reach 0.73177, but can we continue to optimize it? That’s for sure.

Let’s first look at the distribution of all features (this can be seen when there are few features):

`pima_imputed_mean.hist(figsize=(15,15))`

From the above figure, we can see that there is a problem between each feature**dimension**They are all different, which is a “fatal” bug for the distance based model of KNN, so we need to standardize and normalize it.

We focus on three approaches:

**1) Z-score standardization**

The most commonly used standardization technology makes use of the Z-score idea in statistics, which is to convert the data into a distribution with a mean value of 0 and a standard deviation of 1. Its calling method in Python is as follows:

```
#Z-fraction Standardization (single feature)
from sklearn.preprocessing import StandardScaler
#Instantiation method
scaler = StandardScaler()
glucose_z_score_standarScaler = scaler.fit_transform(pima[['plasma_glucose_concentration']].fillna(-9))
#You can see if the mean and standard deviation after conversion are 0 and 1
glucose_z_score_standarScaler.mean(), glucose_z_score_standarScaler.std()
#Z-fraction Standardization (all features)
from sklearn.preprocessing import StandardScaler
#Instantiation method
scaler = StandardScaler()
pima_imputed_mean_scaled = pd.DataFrame(scaler.fit_transform(pima_imputed_mean), columns=pima_columns)
#Look at the distribution after standardization
pima_imputed_mean_scaled.hist(figsize=(15,15), sharex=True)
#Use in pipeline
model = Pipeline([
('imputer', Imputer()),
('standardize', StandarScaler())
])
```

**2) min max standardization**

Min max standardization is similar to Z-score, and its formula is: (x – xmin) / (xmax – xmin)

Call method in Python:

```
#Min max standardization
from sklearn.preprocessing import MinMaxScaler
#Instantiation method
min_max = MinMaxScaler()
#Use min max standardization
pima_min_maxed = pd.DataFrame(min_max.fit_transform(pima.fillna(-9)), columns=pima_columns)
```

**3) row normalization**

Row normalization aims at each row of data, which is different from the above two methods (for columns). The purpose of processing rows is to ensure that the vector length of each row is the same (i.e. unit norm), with L1 and L2 norms.

Call method in Python:

```
#Row normalization
from sklearn.preprocessing import Normalizer
#Instantiation method
normalize = Normalizer()
#Use row normalization
pima_normalized = pd.DataFrame(normalize.fit_transform(pima.fillna(-9)), columns=pima_columns)
#View the average norm of the matrix (1)
np.sqrt((pima_normalized**2).sum(axis=1)).mean()
```

### Feature construction of🔨03

If we deal with the variables, the effect is still not very ideal, we need to build features, that is, derive new variables.

Before that, we need to understand our data set. In the previous two sections, we learned that we can`data.info`

and`data.describe()`

To view and combine**Data level**(definite class, definite order, definite distance, definite ratio) to understand variables.

#### 🙊 basic operation

In this section we use a custom dataset.

```
#Data set used in this case
import pandas as pd
X = pd.DataFrame({'city':['tokyo',None,'london','seattle','san fancisco','tokyo'],
'boolean':['y','n',None,'n','n','y'],
'ordinal_column':['somewhat like','like','somewhat like','like','somewhat like','dislike'],
'quantitative_column':[1,11,-.5,10,None,20]})
X
```

First of all, we need to fill in the classification variables. Generally, the classification variables are filled with mode or special value. After reviewing the previous contents, we still use the pipeline method, so we can fill in the classification variables based on`TransformMixin`

The base class encapsulates the filled method, and then calls it directly in the pipeline. The code can refer to:

```
#Fill classification variable (custom filler based on transformermixin, filled with mode)
from sklearn.base import TransformerMixin
class CustomCategoryzImputer(TransformerMixin):
def __init__(self, cols=None):
self.cols = cols
def transform(self, df):
X = df.copy()
for col in self.cols:
X[col].fillna(X[col].value_counts().index[0], inplace=True)
return X
def fit(self, *_):
return self
#Call custom filler
cci = CustomCategoryzImputer(cols=['city','boolean'])
cci.fit_transform(X)
```

Or using scikit learn’s`Imputer`

Class, and this class has a`Strategy`

The method of “mean”, “media” and “most” frequency “are naturally inherited and used.

```
#Fill in classification variables (based on computer's custom filler, filled with mode)
from sklearn.preprocessing import Imputer
class CustomQuantitativeImputer(TransformerMixin):
def __init__(self, cols=None, strategy='mean'):
self.cols = cols
self.strategy = strategy
def transform(self, df):
X = df.copy()
impute = Imputer(strategy=self.strategy)
for col in self.cols:
X[col] = impute.fit_transform(X[[col]])
return X
def fit(self, *_):
return self
#Call custom filler
cqi = CustomQuantitativeImputer(cols = ['quantitative_column'], strategy='mean')
cqi.fit_transform(X)
```

Pipeline the above two kinds of filling:

```
#Fill all
from sklearn.pipeline import Pipeline
imputer = Pipeline([('quant',cqi),
('category',cci)
])
imputer.fit_transform(X)
```

After completing the filling of classification variables, we need to code the classification variables (because most machine learning algorithms cannot directly calculate the classification variables). Generally, there are two methods:**Unique heat code and label code.**

**1) single heat code**

Single hot coding is mainly for Fixed class variables, that is, there is no order size relationship between different variable values. We can generally use the`OneHotEncoding`

But we still use the custom method to deepen our understanding.

```
#Coding of category variable (single hot coding)
class CustomDummifier(TransformerMixin):
def __init__(self, cols=None):
self.cols = cols
def transform(self, X):
return pd.get_dummies(X, columns=self.cols)
def fit(self, *_):
return self
#Call custom filler
cd = CustomDummifier(cols=['boolean','city'])
cd.fit_transform(X)
```

**2) label code**

Label coding is for ordered variables, that is, category variables with order size, just like the value of the variable ordinal column in a case (disalike, somewhat like and like can be represented by 0, 1 and 2 respectively). Similarly, a custom label encoder can be written:

```
#Code of category variable (label code)
class CustomEncoder(TransformerMixin):
def __init__(self, col, ordering=None):
self.ordering = ordering
self.col = col
def transform(self, df):
X = df.copy()
X[self.col] = X[self.col].map(lambda x: self.ordering.index(x))
return X
def fit(self, *_):
return self
#Call custom filler
ce = CustomEncoder(col='ordinal_column', ordering=['dislike','somewhat like','like'])
ce.fit_transform(X)
```

**3) numerical variable box splitting operation**

The above content is some simple processing operations for category variables, which are also commonly used. Next, we will explain some simple processing methods for numerical variables.

Sometimes, although the variable value is continuous, it can only be explained if it is converted into a category. For example, age, we need to divide it into age groups. Here we can use pandas’s`cut`

Function.

```
#Numerical variable processing -- cut function
class CustomCutter(TransformerMixin):
def __init__(self, col, bins, labels=False):
self.labels = labels
self.bins = bins
self.col = col
def transform(self, df):
X = df.copy()
X[self.col] = pd.cut(X[self.col], bins=self.bins, labels=self.labels)
return X
def fit(self, *_):
return self
#Call custom filler
cc = CustomCutter(col='quantitative_column', bins=3)
cc.fit_transform(X)
```

To sum up, we can call the above customized methods in pipeline. The order of pipeline is:

**1) fill the missing value with computer**

**2) unique heat code city and Boolean**

**3) label code**

**4) quantitative column**

The code is:

```
from sklearn.pipeline import Pipeline
#Assembly line packaging
pipe = Pipeline([('imputer',imputer),
('dummify',cd),
('encode',ce),
('cut',cc)
])
#Training pipeline
pipe.fit(X)
#Conversion pipeline
pipe.transform(X)
```

#### 🙊 numerical variable extension

In this section, we use a new data set (human chest acceleration data set). First, we import the data:

```
#Human chest acceleration data set, the value of label activity is 1-7
'''
1 - working in front of a computer
2 - standing, walking and up and down stairs
3- standing
4- walking
5 - up and down stairs
6-talking while walking with people
7 - standing and talking
'''
df = pd.read_csv('./data/activity_recognizer/1.csv', header=None)
df.columns = ['index','x','y','z','activity']
df.head()
```

This is only a way to generate new features of polynomials. Call`PolynomialFeatures`

To achieve.

```
#Extended numerical features
from sklearn.preprocessing import PolynomialFeatures
x = df[['x','y','z']]
y = df['activity']
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
x_poly = poly.fit_transform(x)
pd.DataFrame(x_poly, columns=poly.get_feature_names()).head()
```

You can also view the correlation after deriving new variables. The darker the color, the greater the correlation:

```
#View heat map (darker color means stronger correlation)
%matplotlib inline
import seaborn as sns
sns.heatmap(pd.DataFrame(x_poly, columns=poly.get_feature_names()).corr())
```

Implementation code in pipeline:

```
#Import related libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
knn = KNeighborsClassifier()
#Use in assembly line
pipe_params = {'poly_features__degree':[1,2,3],
'poly_features__interaction_only':[True,False],
'classify__n_neighbors':[3,4,5,6]}
#Instantiation pipeline
pipe = Pipeline([('poly_features',poly),
('classify',knn)])
#Grid search
grid = GridSearchCV(pipe, pipe_params)
grid.fit(x,y)
print(grid.best_score_, grid.best_params_)
```

`0.721189408065 {'classify__n_neighbors': 5, 'poly_features__degree': 2, 'poly_features__interaction_only': True}`

#### 🙊 text variable processing

Text processing is generally the most widely used in NLP (natural language processing) field. Generally, text needs to be vectorized. The most common methods are**Bag of words, countvectorizer, TF IDF**。

**1）bag of words**

The word bag method is divided into three steps**Tokenizing, counting, normalizing**。

**2）CountVectorizer**

The text is transformed into a matrix, each column represents a word, and each row represents a document, so the general matrix will be very sparse`sklearn.feature_extraction.text`

Call in`CountVectorizer`

Ready to use.

**3）TF-IDF**

TF-IDF vectorizer is composed of two parts: TF part representing word frequency and IDF representing inverse document frequency. TF-IDF is a word weighting method for information retrieval and clustering`sklearn.feature_extraction.text`

Call in`TfidfVectorizer`

Yes.

TF: term frequency, word frequency, the frequency of words in the document.

IDF: inverse document frequency, which is used to measure the importance of words. If words appear in multiple documents, the weight will be reduced.

### ✅ 04 feature selection

OK, after the above feature derivation operation, we now have a lot of features (variables). How about throwing them all into the model training? Of course not. It’s a waste of resources and ineffective, so we need to do something about it**Feature selection**The methods of feature selection can be roughly divided into two categories:**Feature selection based on statistics and model**。

Before feature selection, we need to make a clear concept:**What is better? What indicators can be used to quantify?**

This can be roughly divided into two categories: one is**Model index**For example, accuracy, F1 score, R ^ 2, etc. there is another type**Meta index**That is, indicators that are not directly related to the prediction performance of the model, such as:**The time required for model fitting / training, the time required for the fitted model to predict new instances, and the data size to be persisted (permanently saved).**

We can encapsulate the above mentioned indicators by encapsulating a method to facilitate subsequent calls. The code is as follows:

```
from sklearn.model_selection import GridSearchCV
def get_best_model_and_accuracy(model, params, x, y):
grid = GridSearchCV(model,
params,
error_score=0.)
grid.fit(x,y)
#Classic performance indicators
print("Best Accuracy:{}".format(grid.best_score_))
#Get the best parameters of the best accuracy
print("Best Parameters:{}".format(grid.best_params_))
#Average time of fitting
print("Average Time to Fit (s):{}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
#Predicted average time
print("Average Time to Score (s):{}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
###############Use example###############
#Import related libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
knn = KNeighborsClassifier()
#Use in assembly line
pipe_params = {'poly_features__degree':[1,2,3],
'poly_features__interaction_only':[True,False],
'classify__n_neighbors':[3,4,5,6]}
#Instantiation pipeline
pipe = Pipeline([('poly_features',poly),
('classify',knn)])
#Grid search
get_best_model_and_accuracy(pipe, pipe_params, x, y)
```

Through the above operations, we can create a model performance baseline to compare the effect of subsequent optimization. Next, we introduce some common feature selection methods.

#### 1) feature selection based on statistics

For a single variable, we can use the**Pearson correlation coefficient and hypothesis test**To select features.

(1) Pearson correlation coefficient can be realized by corr(). The returned value is between – 1 and 1. The larger the absolute value is, the stronger the correlation is;

(2) as a statistical test, the principle of hypothesis test in feature selection is as follows:**”Feature has nothing to do with response variable“**Is it true or false. We need to test each variable to see if it has a significant relationship with target. have access to` SelectKBest`

and`f_classif`

To achieve. The general p value is**Between 0-1**In short,**The smaller the p value is, the greater the probability of rejecting the null hypothesis, that is, the greater the relationship between this feature and target**。

#### 2) feature selection based on Model

(1) for text features,`sklearn.feature_extraction.text`

Li`CountVectorizer`

The parameters for feature selection are**max_features、min_df、max_df、stop_words**, you can select features by searching these parameters, and you can combine`SelectKBest`

To implement the pipeline.

(2) for the tree model, we can directly call the**Feature importance**To return feature importance, such as in decisiontreeclassifier**feature_importances_**, (in addition to randomforest, gbdt, xgboost, extratrees classifier, etc.) can directly return the importance of each feature for this fitting, so we can remove features with low importance and combine`SelectFromModel `

To implement the pipeline.

(3) use regularization to filter variables (for linear models). There are two common regularization methods:**L1 regularization (lasso) and L2 regularization (ridge)**。

#### In summary, there are several methods and experiences for feature selection:

(1) if the feature is a classification variable, you can start from selectkbest and use the chi square or tree based selector to select the variable;

(2) if the feature is a quantitative variable, linear model and correlation based selector can be used to select the variable directly;

(3) if it is a two classification problem, we can consider using selectfrommodel and SVC;

(4) before feature selection, EDA is still needed.

### 💫 05 feature conversion

After the “baptism” of the above links, we come to**Feature transformation**There are two common methods to create a new column by using the hidden structure of the source dataset**PCA and LDA**。

#### ✅ PCA：

PCA (principal component analysis) is a common method of data compression, which projects the data sets of multiple related features onto the coordinate system with fewer related features. That is to say, the characteristics after transformation cannot be interpreted because you cannot explain the business logic of this new variable.

The principle of PCA will not be discussed here. Too many articles have explained it thoroughly. This is mainly to repeat PCA’s call method on sklearn. First, continue to be familiar with the use of pipeline, and second, understand the use of PCA.

```
#Import related libraries
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
#Import dataset
iris = load_iris()
iris_x, iris_y = iris.data, iris.target
#Instantiation method
pca = PCA(n_components=2)
#Training methods
pca.fit(iris_x)
pca.transform(iris_x)[:5,]
#Customize a visual method
label_dict = {i:k for i,k in enumerate(iris.target_names)}
def plot(x,y,title,x_label,y_label):
ax = plt.subplot(111)
for label,marker,color in zip(
range(3),('^','s','o'),('blue','red','green')):
plt.scatter(x=x[:,0].real[y == label],
y = x[:,1].real[y == label],
color = color,
alpha = 0.5,
label = label_dict[label]
)
plt.xlabel(x_label)
plt.ylabel(y_label)
leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title(title)
Visualization
plot(iris_x, iris_y,"original iris data","sepal length(cm)","sepal width(cm)")
plt.show()
plot(pca.transform(iris_x), iris_y,"Iris: Data projected onto first two PCA components","PCA1","PCA2")
```

The above is the simple call and effect display of PCA on sklearn. In addition, the author puts forward an interesting question:

Generally speaking, the normalization of features will help the effect of machine learning algorithm, but why the example in the book is the opposite?

The explanation is: after scaling the data, the covariance between columns will be more consistent, and the variance of each principal component interpretation will become scattered, rather than concentrated on a single principal component. Therefore, in the actual operation, it is the safest to test the performance of the zoomed and non zoomed data.

#### ✅ LDA：

LDA, or linear discriminant analysis, is a supervised algorithm (by the way, PCA is unsupervised), which is generally used in the preprocessing steps of classification pipeline. Similar to PCA, LDA also extracts a new coordinate axis to project the original high-dimensional data into the low-dimensional space. The difference is that LDA does not focus on the variance between the data, but directly optimizes the low-dimensional space to obtain the best category separability.

```
#Use of LDA
#Import related libraries
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
#Instantiate LDA module
lda = LinearDiscriminantAnalysis(n_components=2)
#Training data
x_lda_iris = lda.fit_transform(iris_x, iris_y)
Visualization
plot(x_lda_iris, iris_y, "LDA Projection", "LDA1", "LDA2")
```

### 📖 06 feature learning

Here’s the last chapter. The theme of this chapter is**“AI for AI”**。 It seems quite abstract. Anyway, I think it’s a little strange. The feature learning algorithm is a nonparametric method, that is, a new algorithm built without relying on data structure.

#### Parameter assumption of data

Parameter assumption refers to the basic assumption of the algorithm for data shape. For example, in the PCA of the previous chapter, we assume:

The shape of the original data can be decomposed by (eigenvalue) and expressed by a single linear transformation (matrix calculation).

The feature learning algorithm is to remove the “hypothesis” to solve the problem, because the algorithm does not rely on the shape of the data, but depends on the**Stochastic learning**It means that these algorithms do not output the same results every time, but check the data points one by one to find the best features to be extracted, and can fit out an optimal solution.

In the field of feature learning, there are two common methods, which are explained as follows:**Restricted Boltzmann machine (RBM) and word embedding.**

#### Restricted Boltzmann machine (RBM)

RBM is a simple in-depth learning architecture, which is a group of unsupervised feature learning algorithms. Learning a certain number of new features according to the probability model of data, RBM is often used to use linear models (linear regression, logical regression, perceptron, etc.) with excellent results.

Conceptually, RBM is a shallow (2-layer) neural network, belonging to**DBN (deep belief network) algorithm**One of them. It is also an unsupervised algorithm,**The number of features that can be learned is only limited by the computing power**, it may learn fewer or more features than the original, and the specific number of features to be learned depends on the problem to be solved.

The term “limited” is used because it only allows connections between layers (layer to layer connections), not nodes within the same layer (layer to layer connections).

We need to understand “Reconstruction”, that is, this operation, so that there can be several forward and backward propagation between the visible layer (input layer) and the hidden layer without involving the deeper network.

In the reconstruction stage, RBM reverses the network, the visible layer becomes the hidden layer, and the hidden layer becomes the visible layer. With the same weight, the activation variable a is transferred to the visible layer in reverse, but the deviation is different. Then, the original input vector is reconstructed with the forward conduction activation variable. RBM uses this method to carry out “self-evaluation”. By conducting the activation information in reverse and obtaining the approximate value of the original input, the network can adjust the weight to make the approximate value more close to the original input.

At the beginning of the training, because the weights are randomly initialized (general practice), the difference between the approximate value and the real value may be great. Next, the weight will be adjusted through the back propagation method to minimize the distance between the original input and the approximate value, and this process will be repeated until the approximate value is as close to the original input as possible. (the number of times this process occurs is called**Iteration times** ）

The general principle is the above statement, a more detailed explanation can be self Baidu oh. Now let’s talk about the application of RBM in machine learning pipeline. We still use MNIST data set, which was also used when we talked about keras before. It’s a pile of pixel data of numbers, and then used to identify numbers.

```
#Use of RBM
#We use MNIST dataset to explain
#Import related libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
#Import dataset
images = np.genfromtxt('./data/mnist_train.csv', delimiter=',')
print(images.shape)
#Partition data
images_x, images_y = images[:,1:], images[:,0]
#Scale feature to 0-1
images_x = images_x/255.
#Learning new features with RBM
rbm = BernoulliRBM(random_state=0)
lr = LogisticRegression()
#Set the parameter range of pipeline
params = {'clf__C':[1e-1, 1e0, 1e1],
'rbm__n_components':[100, 200]
}
#Create pipeline
pipeline = Pipeline([('rbm', rbm),
('clf', lr)])
#Instantiate grid search class
grid = GridSearchCV(pipeline, params)
#Fit data
grid.fit(images_x, images_y)
#Return the best parameter
grid.best_params_, grid.best_score_
```

#### Embedding words

It is widely used in NLP field. It can project strings (words or phrases) into n-dimensional feature set to understand the details of context and wording. We can use the`CountVectorizer`

and`TfidfVectorizer`

To convert these strings into vectors, but this is just a collection of word features. In order to understand these features, we need to pay more attention to a character called`gensim`

The bag.

There are two common ways to embed words: word2vec and glove.

**Word2vec：**Google invented an algorithm based on deep learning. Word2vec is also a shallow neural network, including input layer, hidden layer and output layer, in which the number of nodes in input layer and output layer is the same.

**GloVe： **Algorithm from Stanford University, through a series of matrix statistics for learning.

There are many applications of word embedding, such as information retrieval, which means that when we input keywords, search engines can recall and accurately return articles or news matching keywords.

This article is published by openwrite, a blog platform with one article and multiple posts!