Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Time:2021-8-16

Original link:http://tecdat.cn/?p=22762 

Principal component analysis is a commonly used dimensionality reduction algorithm in data mining. It is a multivariable statistical method proposed by Pearson in 1901 and later developed by hotel in 1933. Its main purpose is “dimensionality reduction”. By extracting the largest individual differences shown by principal components, it can also be used to reduce the number of variables in regression analysis and cluster analysis, Similar to factor analysis.

The so-called dimensionality reduction is to reduce the number of relevant variables and replace the original variables with fewer variables. If the original variables are orthogonal to each other, that is, there is no correlation, principal component analysis has no effect.

Correspondence analysis (CA) is an extension of principal component analysis suitable for analyzing large strain tables formed by two qualitative variables (or classified data). This paper analyzes the individual differences of husband and wife’s occupation by disjuncting principal components.

Husband and wife occupation data

Consider the following data, corresponding to the occupation of a couple. We have the following frequency table

read.table(data.csv",header=TRUE)

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Traditionally, for this kind of data, we are used to using chi square test, chi square distance and chi square contribution to check the difference of data

chisq.test(M)

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

mosaic image

Mosaic plot is often used to display categorical data (for different data categories, mosaic plot is powerful in that it can well display the relationship between two or more categorical variables. It can also be defined as displaying categorical data in the form of images.

When the variables are category variables and the number is more than three, mosaic can be used. In the mosaic, the area of the nested matrix is proportional to the cell frequency, which is the frequency in the multidimensional contingency table. Colors and shadows represent the residual values of the fitted model.

We can visualize the result with mosaic.

plot(tM)

The husband is in the row and the wife is in the column. The important links are blue or red, which correspond to “positive” links (higher joint probability than in the case of independence) or “negative” links (lower joint probability than in the case of independence).

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

In the other direction

plot(M)

But the conclusion is the same as before: there are strong blue values on the diagonal.

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

In other words, these couples are relatively similar and single in career.

Principal component analysis and correspondence analysis

In correspondence analysis, we look at the probability table, in rows or columns. For example, we can define a row, which is a probability vector

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

N/apply(N,1,sum)

be aware  Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization, we can write

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

The center of gravity of our line vector is here

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Similarly, it is noted that  Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization, we can write it in a matrix,  Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization.

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

L0=(t(L)-Lbar)

For each point, we associate the (relative) frequency as a weight,Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualizationThis is equivalent to using a matrix. To measure the distance between two pointsExtension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization, we will weight the Euclidean distance by the reciprocal of the probability,Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization。 What is the distance between the two lines

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Then we will use these different weights for principal component analysis. From a matrix Perspective

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

We notice the eigenvectorExtension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization, we define the principal component

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

The projection of the first two components of the line is given here

PCA(L0,scal=FALSE

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Our idea is to visualize the individual corresponding to the row. In the second step, we do the same thing in the column

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

N/apply(N,2,sum))

Center:

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

C0=C-Cbar

principal component analysis

Then we can do a principal component analysis

PCA(matC0

Look at personal visualization.

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

correspondence analysis

The magic of correspondence analysis is that we “can” represent two projections of an individual on the same plane.

> plot(C\[,1:2\])

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization

give the result as follows

> afc=CA(N)

Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization


Extension data tecdat: R language PCA (principal component analysis), Ca (correspondence analysis) husband and wife occupational differences and mosaic visualization 

Most popular insights

1.Matlab partial least squares regression (PLSR) and principal component regression (PCR)

2.Dimension reduction and visual analysis of principal component PCA and t-sne algorithms for high-dimensional data in R language

3.Basic principle of principal component analysis (PCA) and analysis examples

4.Lasso regression analysis based on R language

5.Using lasso regression to predict stock return data analysis

6.Lasso regression, ridge ridge regression and elastic net model in R language

7.Partial least squares regression PLS Da data analysis in R language

8.Partial least squares PLS regression algorithm in R language

9.R language linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and canonical discriminant analysis (RDA)