Principal component analysis is a commonly used dimensionality reduction algorithm in data mining. It is a multivariable statistical method proposed by Pearson in 1901 and later developed by hotel in 1933. Its main purpose is “dimensionality reduction”. By extracting the largest individual differences shown by principal components, it can also be used to reduce the number of variables in regression analysis and cluster analysis, Similar to factor analysis.
The so-called dimensionality reduction is to reduce the number of relevant variables and replace the original variables with fewer variables. If the original variables are orthogonal to each other, that is, there is no correlation, principal component analysis has no effect.
Correspondence analysis (CA) is an extension of principal component analysis suitable for analyzing large strain tables formed by two qualitative variables (or classified data). This paper analyzes the individual differences of husband and wife’s occupation by disjuncting principal components.
Husband and wife occupation data
Consider the following data, corresponding to the occupation of a couple. We have the following frequency table
Traditionally, for this kind of data, we are used to using chi square test, chi square distance and chi square contribution to check the difference of data
Mosaic plot is often used to display categorical data (for different data categories, mosaic plot is powerful in that it can well display the relationship between two or more categorical variables. It can also be defined as displaying categorical data in the form of images.
When the variables are category variables and the number is more than three, mosaic can be used. In the mosaic, the area of the nested matrix is proportional to the cell frequency, which is the frequency in the multidimensional contingency table. Colors and shadows represent the residual values of the fitted model.
We can visualize the result with mosaic.
The husband is in the row and the wife is in the column. The important links are blue or red, which correspond to “positive” links (higher joint probability than in the case of independence) or “negative” links (lower joint probability than in the case of independence).
In the other direction
But the conclusion is the same as before: there are strong blue values on the diagonal.
In other words, these couples are relatively similar and single in career.
Principal component analysis and correspondence analysis
In correspondence analysis, we look at the probability table, in rows or columns. For example, we can define a row, which is a probability vector
be aware , we can write
The center of gravity of our line vector is here
Similarly, it is noted that , we can write it in a matrix, .
For each point, we associate the (relative) frequency as a weight,This is equivalent to using a matrix. To measure the distance between two points, we will weight the Euclidean distance by the reciprocal of the probability,。 What is the distance between the two lines
Then we will use these different weights for principal component analysis. From a matrix Perspective
We notice the eigenvector, we define the principal component
The projection of the first two components of the line is given here
Our idea is to visualize the individual corresponding to the row. In the second step, we do the same thing in the column
principal component analysis
Then we can do a principal component analysis
Look at personal visualization.
The magic of correspondence analysis is that we “can” represent two projections of an individual on the same plane.
give the result as follows
Most popular insights