Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Time:2021-7-26

Original link:http://tecdat.cn/?p=22838 

Problems in using iris dataset: R

(a) Part: K-means clustering
The data were clustered into two groups by K-means clustering method.
  Draw a graph to show the clustering
The data were clustered into three groups by K-means clustering method.
Draw a graph to show the clustering
(b) Part: hierarchical clustering
The observed values were clustered by full connection method.
The observations were clustered using average and single join.
Draw the tree view of the above clustering methods.
 

Q01: use the iris dataset established in R.

(a) : K-means clustering

Discuss and / or consider standardizing the data.

data.frame(
  "Average" = apply (iris \ [, 1:4 \],   2,   mean
  "Standard deviation" = apply (iris \ [, 1:4 \],   2,   sd)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

In this case, we will standardize the data because the width of the petal is much smaller than all other measurements.

The data were clustered into two groups by K-means clustering method

Using a large enough nstart, it is easier to get the model corresponding to the minimum RSS value.

kmean(iris, nstart = 100)

Draw a graph to show the clustering

#   Draw data
plot(iris, y = Sepal.Length, x = Sepal.Width)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

In order to better consider the length and width of petals, it is more appropriate to use PCA to reduce the dimension first.

#    Create model

PCA.mod<- PCA(x = iris)

#Put the predicted group last
PCA$Pred <-Pred

#Draw a chart
plot(PC, y = PC1, x = PC2, col = Pred)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

In order to better explain the PCA diagram, the variance of principal components is considered.

##   Look at the variance explained by the main components

for (i in 1:nrow) {
  pca\[\["PC"\]\]\[i\] <- paste("PC", i)
}

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

plot(data  =  pca,x  =  Principal components,   y  =  Variance ratio,   group  =  1)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

80% of the variance in the data is explained by the first two principal components, so this is a very good data visualization.

The data were clustered into three groups by K-means clustering method

In the previous principal component diagram, clustering looks very obvious, because in fact, we know that there should be three groups, and we can execute the model of three clusters.

kmean(input, centers = 3, nstart = 100)
#   Production data
groupPred %>% print()

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Draw a graph to show the clustering

#    Draw data
Plot (sepal length, sepal width,   col  = pred)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

PCA diagram

In order to better consider the length and width of petals, it is more appropriate to use PCA to reduce the dimension first.

#Create model
prcomp(x = iris)

#Put the predicted group last
Pcadf $kmeans forecast<-   Pred

#Draw a chart
plot(PCA,   y  =  PC1,   x  =  PC2,col  = " Predict \ \ ncluster ",   caption  = " For the first two principal components of iris data, the ellipse represents 90% normal confidence, and the k-means algorithm is used to predict the two classes ")  +

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

PCA hyperbola

Sepal length sepal width graph has reasonable separation. In order to select which variables to use on X and y, we can use hyperbolic graph.

biplot(PCA)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

This hyperbolic chart shows that petal length and sepal width can explain most of the differences in the data. The more appropriate chart is:

plot(iris,   col  =  Km (forecast)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Evaluate all possible combinations.

iris %>%
  pivot_longer()  %>% 
plot(col  =  Km forecast,   facet\_ grid(name  ~ .,  scales  = ' free\_ y',   space  = ' free_ y',  ) +

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

hierarchical clustering

The observed values are clustered by full connection method.

The observations can be clustered using the full connection method (pay attention to the standardization of the data).

hclust(dst, method = 'complete')

The observations were clustered using average and single join.

 hclust(dst, method = 'average')
hclust(dst, method = 'single')

Draw prediction chart

Now that the model has been established, the tree view is divided by specifying the required number of groups.

#    data
Iris $kmeans forecast<-   groupPred


#   Draw data
plot(iris,col  =  Kmeans forecast)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Draw the tree view of the above clustering methods

Shade the tree view.

type<-   C ("average",  " All ",  " Single ")

for (hc in models) plot(hc, cex = 0.3)

Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

 Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

 Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset


Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Most popular insights

1.R language K-shape algorithm stock price time series clustering

2.Comparison of different types of clustering methods in R language

3.The k-medoids clustering modeling and gam regression of power load time series data are carried out in R language

4.Hierarchical clustering of iris data set in R language

5.Python Monte Carlo K-means clustering practice

6.Web comment text mining and clustering with R

7.Python for NLP: multi label text LSTM neural network using keras

8.Analysis of MNIST data set with R language and exploration of handwritten numeral classification data

9.Deep learning image classification of small data set based on keras in R language