Data dimension reduction: principal component analysis

Time:2021-4-19

preface

What is called principal component analysis? Let’s first look at a graph of an ellipse. If you were asked to find a line so that all the points on the ellipse mapped on the line were the most scattered and the most information remained, how would you choose this line? In the figure below, horizontal lines will be selected to represent as many two-dimensional data as possible in a one-dimensional way. Can multidimensional data be represented as much as possible in a lower dimensional way.

Data dimension reduction: principal component analysis

How to use a two-dimensional plane to represent an ellipsoid as much as possible?

Data dimension reduction: principal component analysis

thought

Principal component analysis is a statistical method, a way to simplify the data, is a linear transformation, the data into a new coordinate system, so that any projection of the first major variance mapping to the first principal component, the second major variance mapping to the second principal component. If we abandon the high-dimensional principal components, we can generally retain the features that make the greatest contribution to the square difference. In some aspects, we can retain the main features of the data. Of course, in order to make the data look better, we will move the center of the coordinate axis to the center of the data, which will make the data processing more convenient.

Data dimension reduction: principal component analysis

In Mathematics

In mathematics, we use the square of the $L ^ 2 $norm (the square of the $L ^ 2 $norm gets the minimum value at the same position with itself, which is monotonically increasing and has better properties) to calculate. X is the input and $C ^ * $is the optimal code

$$

It can be seen from the above that only one matrix multiplication is needed to get C. Define refactoring operations:

$$

After complex deduction, it can be proved by mathematical induction that the matrix D can be composed of eigenvectors corresponding to the largest eigenvalues of the first $x ^ TX $.

summary

Principal component analysis is mainly used for data dimensionality reduction, the goal is to minimize the loss of the original data, as much as possible to reduce the amount of data.

  • This paper is first published by rais