#### 1. PCA and SVD

Dimension reduction algorithms in sklearns are all included in the module decomposition, which is essentially a matrix decomposition module. In the past ten years, if we want to discuss the pioneer of algorithm progress, matrix decomposition can be said to be unique. Matrix decomposition can be used in dimension reduction, in-depth learning, clustering analysis, data preprocessing, low-latitude feature learning, recommendation system, large data analysis and other fields. In 2006, Netflix hosted a $1 million recommendation system algorithm contest, and the winner used the star of matrix decomposition: singular value decomposition SVD. (~o 3)~Juan Sauce will talk about the application of SVD in recommendation system. Don’t miss it!

Both SVD and PCA belong to the introductory algorithm of matrix decomposition algorithm. They are all dimensionality reduction by decomposing characteristic matrices. They are also the focus of our discussion today. Although it is an introductory algorithm, it does not mean that PCA and SVD are simple: the following two graphs are two pages I randomly intercepted in an SVD paper, and you can see full mathematical formulas (basically linear algebra). In order to explain these formulas to you, I will not vomit blood after I finish, you will also vomit blood after listening to them. So today, I will show you the principle of dimensionality reduction algorithm in the simplest way, but it is doomed to mean that you can not see the whole picture of this algorithm.**It’s a bad way to avoid mathematics in machine learning**So we read more principles by ourselves.

In the process of dimension reduction, we will reduce the number of features, which means deleting data, and less data means less information can be obtained by the model, and the performance of the model may be affected. At the same time, in high-dimensional data, there must be some features without effective information (such as noise), or some features with information and other features are repetitive (such as some features may be linear correlation). We hope that we can find a way to help us measure the amount of information in the feature, so that we can reduce the dimension in the process.**That is to reduce the number of features while retaining most of the valid information.**—— Merge those features with repeated information, delete those features with invalid information, etc. – Gradually create new feature matrices that represent most of the information of the original feature matrix, with fewer features.

In last week’s Feature Engineering class, we mentioned an important feature selection method: variance filtering. If the variance of a feature is very small, it means that there are probably a large number of identical values on the feature (for example, 90% are 1, only 10% are 0, or even 100% are 1), then the value of this feature has no discrimination for the sample, and this feature does not have effective information. From the application of variance, it can be inferred that if the variance of a feature is large, it means that there is a lot of information on the feature. Therefore, in dimensionality reduction,**The information measure index used by PCA is sample variance, also known as explanatory variance. The larger the variance, the more information the feature carries.**。

$$

Var = \frac{1}{n-1}\sum_{i=1}^{n}(x_i – \hat{x})^2

$$

Var represents the variance of a feature, n represents the sample size, Xi represents the value of each sample in a feature, and xhat represents the mean of this column of samples.

Interview High Risk Questions |
---|

Why is the divisor n-1 in the variance formula? This is to get unbiased estimates of sample variance, more people can explore for themselves.~ |

#### 2. How to realize dimensionality reduction?

*class* `sklearn.decomposition.PCA`

(*n_components=None*, *copy=True*, *whiten=False*, *svd_solver=’auto’*, *tol=0.0*, *iterated_power=’auto’*, *random_state=None*)

As the core algorithm of matrix decomposition algorithm, PCA has not many parameters, but unfortunately, the meaning and application of each parameter are very difficult, because almost every parameter involves profound mathematical principles. To clarify the use and significance of parameters, let’s look at a simple set of two-dimensional data dimensionality reduction.

We now have a simple set of data with features x 1 and x2. The coordinate points of the three sample data are (1,1), (2,2), (3,3) respectively. We can use X1 and X2 as two eigenvectors to describe the data in a two-dimensional plane. This set of data now has an average of 2 for each feature, and the variance is equal to:

$$

x1\_var = x2\_var = \frac{(1-2)^2 + (2-2)^2 + (3-2)^2}{2} = 1

$$

The data of each feature are identical, so the variance is 1, and the total variance of the data is 2.

Now our goal is to use only one eigenvector to describe this set of data, that is, to reduce the two-dimensional data to one-dimensional data, and to retain as much information as possible, that is, to make the total variance of the data as close as possible to 2. Thus, we rotate the original rectangular coordinate system 45 degrees counterclockwise to form a new plane composed of the new eigenvectors x1* and x2* in which the coordinate points of the three sample data can be expressed as $(\ sqrt {2}, 0)$, $(2 sqrt {2}, 0)$, $(3\ sqrt {2}, 0)$. It can be noted that the values on x2* become zero at this time, so x2* obviously does not carry any valid information (the variance of x2* is also zero at this time). At this point, the data mean on the x1* feature is $2 sqrt {2}$, and the variance can be expressed as:

$$

x2^*\_var = \frac{(\sqrt{2} – 2\sqrt{2})^2+(2\sqrt{2} – 2\sqrt{2})^2+(3\sqrt{2} – 2\sqrt{2})^2}{2} = 2

$$

The mean value of the data on x1* is 0 and the variance is 0.

At this time, according to the ranking of information content, we take a feature with the largest information content, because we want one-dimensional data. So we can delete the X 2* and the X 2* feature vectors in the graph. The remaining x 1* represents the three sample points that once needed two features to represent. By rotating the coordinate axis of the original eigenvector to find the new eigenvector and the new coordinate plane, we compress the information of three sample points into a straight line, realize two-dimensional to one-dimensional, and try to retain the information of the original data. A successful dimensionality reduction is achieved.

It is not difficult to note that there are several important steps in this dimension reduction process:

process | Two-Dimensional Characteristic Matrix | N-Dimensional Characteristic Matrix |
---|---|---|

1 | Input original data, structure is (3,2) The essence of finding the Cartesian coordinate system corresponding to the original two features is to find out the two-dimensional plane composed of the two features. |
Input original data, structure is (m, n) Finding the n-dimensional space V composed of N original eigenvectors |

2 | Determine the number of features after dimensionality reduction: 1 | Determine the number of features after dimensionality reduction:k |

3 | Rotate to find a new coordinate system The essence is to find out two new eigenvectors and the new two-dimensional plane formed by them. New feature vectors allow data to be compressed to a few features, and the total amount of information is not lost too much. |
By some change, we can find n new eigenvectors and their new n-dimensional space V. |

4 | Find out the coordinates of data points on two new coordinate axes in the new coordinate system | Find out the corresponding values of the original data on n new eigenvectors in the new feature space V, that is, “Map the data into the new space” |

5 | Selecting the first eigenvector with the largest variance and deleting the features that were not selected, the 2-D plane was successfully reduced to 1-D. | Selecting the feature with the largest amount of information in the first k, deleting the feature that has not been selected, and successfully reducing the n-dimensional space V to k-dimensional space |

In step 3, we use the**Matrix decomposition is the technique of finding n new eigenvectors so that data can be compressed to a few features and the total information can not be lost too much.**。 PCA and SVD are two different dimension reduction algorithms, but they all follow the above process to achieve dimension reduction, but the matrix decomposition methods of the two algorithms are different, the measurement index of information is different. PCA uses difference as a measure of information quantity, and eigenvalue decomposition to find space V. In dimensionality reduction, it will decompose the eigenmatrix X into three matrices through a series of mathematical mysterious operations (e.g., generating the covariance matrix $ frac {1} {n} {XX {T}$), where $Q $and $Q ^{-1} are auxiliary matrices, is a diagonal matrix (i.e., a matrix with values on the diagonal line and all other locations are zero), and its diagonal is a diagonal matrix. The element on the line is variance. After dimensionality reduction, each new feature vector found by PCA is called “principal component”, and the discarded feature vector is considered to have little information, which is probably noise.

$$

SVD uses singular value decomposition to find space V, where_is also a diagonal matrix, but the elements on the diagonal line are singular values, which is also the index used to measure the amount of information on features in SVD. U and V^{T} are left and right singular matrices respectively, and they are also auxiliary matrices.

$$

In mathematical principles, both PCA and SVD need to traverse all features and samples to calculate the information index. In the process of matrix decomposition, a matrix larger than the original characteristic matrix will be generated, such as the structure of the original data is (m, n). In order to find the best new feature space V in matrix decomposition, it may be necessary to generate (n), (m, m) size matrix, and also need to generate covariance matrix to calculate more information. Now, whether Python or R, or any other language, is not particularly good at large matrix operations. No matter how simplified the code, we inevitably have to wait for the computer to complete this very large mathematical calculation process. Therefore, dimension reduction algorithms are computationally intensive and run slowly, but in any case, their functions are irreplaceable, and they are still the darlings in the field of machine learning.

Think: PCA and feature selection technology are both part of feature engineering. What are the differences between them? |
---|

There are three ways in feature engineering: feature extraction, feature creation and feature selection. Looking at the dimensionality reduction examples above and the feature selection we explained last week, do you find any differences?
Feature selection carries the most information from the existing features. After the selection, the features are still interpretable. We still know where the feature is in the original data and what the meaning of the original data is. PCA is a new feature that compresses the existing features. The feature after dimensionality reduction is not any feature in the original feature matrix, but a combination of some new features. Generally speaking, It is conceivable that PCA is generally not suitable for exploring the relationship between features and tags (such as linear regression), because the relationship between new features and tags that cannot be explained is not meaningful. In linear regression model, we use feature selection. |

<div STYLE=”page-break-after: always;”></div>