## Original link:http://tecdat.cn/?p=22762

Principal component analysis is a commonly used dimensionality reduction algorithm in data mining. It is a multivariable statistical method proposed by Pearson in 1901 and later developed by hotel in 1933. Its main purpose is “dimensionality reduction”. By extracting the largest individual differences shown by principal components, it can also be used to reduce the number of variables in regression analysis and cluster analysis, Similar to factor analysis.

The so-called dimensionality reduction is to reduce the number of relevant variables and replace the original variables with fewer variables. If the original variables are orthogonal to each other, that is, there is no correlation, principal component analysis has no effect.

Correspondence analysis (CA) is an extension of principal component analysis suitable for analyzing large strain tables formed by two qualitative variables (or classified data). This paper analyzes the individual differences of husband and wife’s occupation by disjuncting principal components.

# Husband and wife occupation data

Consider the following data, corresponding to the occupation of a couple. We have the following frequency table

`read.table(data.csv",header=TRUE)`

Traditionally, for this kind of data, we are used to using chi square test, chi square distance and chi square contribution to check the difference of data

`chisq.test(M)`

# mosaic image

Mosaic plot is often used to display categorical data (for different data categories, mosaic plot is powerful in that it can well display the relationship between two or more categorical variables. It can also be defined as displaying categorical data in the form of images.

When the variables are category variables and the number is more than three, mosaic can be used. In the mosaic, the area of the nested matrix is proportional to the cell frequency, which is the frequency in the multidimensional contingency table. Colors and shadows represent the residual values of the fitted model.

We can visualize the result with mosaic.

`plot(tM)`

The husband is in the row and the wife is in the column. The important links are blue or red, which correspond to “positive” links (higher joint probability than in the case of independence) or “negative” links (lower joint probability than in the case of independence).

In the other direction

`plot(M)`

But the conclusion is the same as before: there are strong blue values on the diagonal.

In other words, these couples are relatively similar and single in career.

# Principal component analysis and correspondence analysis

In correspondence analysis, we look at the probability table, in rows or columns. For example, we can define a row, which is a probability vector

`N/apply(N,1,sum)`

be aware , we can write

The center of gravity of our line vector is here

Similarly, it is noted that , we can write it in a matrix, .

`L0=(t(L)-Lbar)`

For each point, we associate the (relative) frequency as a weight,This is equivalent to using a matrix. To measure the distance between two points, we will weight the Euclidean distance by the reciprocal of the probability,。 What is the distance between the two lines

Then we will use these different weights for principal component analysis. From a matrix Perspective

We notice the eigenvector, we define the principal component

The projection of the first two components of the line is given here

`PCA(L0,scal=FALSE`

Our idea is to visualize the individual corresponding to the row. In the second step, we do the same thing in the column

`N/apply(N,2,sum))`

Center:

`C0=C-Cbar`

# principal component analysis

Then we can do a principal component analysis

`PCA(matC0`

Look at personal visualization.

# correspondence analysis

The magic of correspondence analysis is that we “can” represent two projections of an individual on the same plane.

`> plot(C\[,1:2\])`

give the result as follows

`> afc=CA(N)`

Most popular insights

1.**Matlab partial least squares regression (PLSR) and principal component regression (PCR)**

3.**Basic principle of principal component analysis (PCA) and analysis examples**

4.**Lasso regression analysis based on R language**

5.**Using lasso regression to predict stock return data analysis**

6.**Lasso regression, ridge ridge regression and elastic net model in R language**

7.**Partial least squares regression PLS Da data analysis in R language**

8.**Partial least squares PLS regression algorithm in R language**