# Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Time：2021-8-16

Principal component analysis is a commonly used dimensionality reduction algorithm in data mining. It is a multivariable statistical method proposed by Pearson in 1901 and later developed by hotel in 1933. Its main purpose is “dimensionality reduction”. By extracting the largest individual differences shown by principal components, it can also be used to reduce the number of variables in regression analysis and cluster analysis, Similar to factor analysis.

The so-called dimensionality reduction is to reduce the number of relevant variables and replace the original variables with fewer variables. If the original variables are orthogonal to each other, that is, there is no correlation, principal component analysis has no effect.

Correspondence analysis (CA) is an extension of principal component analysis suitable for analyzing large strain tables formed by two qualitative variables (or classified data). This paper analyzes the individual differences of husband and wife’s occupation by disjuncting principal components.

# Husband and wife occupation data

Consider the following data, corresponding to the occupation of a couple. We have the following frequency table

``read.table(data.csv",header=TRUE)`` Traditionally, for this kind of data, we are used to using chi square test, chi square distance and chi square contribution to check the difference of data

``chisq.test(M)`` # mosaic image

Mosaic plot is often used to display categorical data (for different data categories, mosaic plot is powerful in that it can well display the relationship between two or more categorical variables. It can also be defined as displaying categorical data in the form of images.

When the variables are category variables and the number is more than three, mosaic can be used. In the mosaic, the area of the nested matrix is proportional to the cell frequency, which is the frequency in the multidimensional contingency table. Colors and shadows represent the residual values of the fitted model.

We can visualize the result with mosaic.

``plot(tM)``

The husband is in the row and the wife is in the column. The important links are blue or red, which correspond to “positive” links (higher joint probability than in the case of independence) or “negative” links (lower joint probability than in the case of independence). In the other direction

``plot(M)``

But the conclusion is the same as before: there are strong blue values on the diagonal. In other words, these couples are relatively similar and single in career.

# Principal component analysis and correspondence analysis

In correspondence analysis, we look at the probability table, in rows or columns. For example, we can define a row, which is a probability vector ``N/apply(N,1,sum)``

be aware , we can write The center of gravity of our line vector is here Similarly, it is noted that , we can write it in a matrix, . ``L0=(t(L)-Lbar)``

For each point, we associate the (relative) frequency as a weight, This is equivalent to using a matrix. To measure the distance between two points , we will weight the Euclidean distance by the reciprocal of the probability, 。 What is the distance between the two lines Then we will use these different weights for principal component analysis. From a matrix Perspective We notice the eigenvector , we define the principal component The projection of the first two components of the line is given here

``PCA(L0,scal=FALSE``  Our idea is to visualize the individual corresponding to the row. In the second step, we do the same thing in the column ``N/apply(N,2,sum))``

Center: ``C0=C-Cbar``

# principal component analysis

Then we can do a principal component analysis

``PCA(matC0``

Look at personal visualization.  # correspondence analysis

The magic of correspondence analysis is that we “can” represent two projections of an individual on the same plane.

``> plot(C\[,1:2\])`` give the result as follows

``> afc=CA(N)``  Most popular insights

## “Self test” stay up late to summarize 50 Vue knowledge points, all of which will make you God!!!

preface Hello everyone, I’m Lin Sanxin. A lot of things have happened these days (I won’t say what’s specific). These things have scared me to treasure my collection these yearsVue knowledge pointsI took out my notes and tried my best to recall them. Finally, I realized these 50Knowledge points(let’s not be too vulgar. It’s not […]