# Extension data tecdat: R language K-means clustering, hierarchical clustering, principal component (PCA) dimensionality reduction and visual analysis of iris dataset

Time：2021-7-26

## Problems in using iris dataset: R

(a) Part: K-means clustering
The data were clustered into two groups by K-means clustering method.
Draw a graph to show the clustering
The data were clustered into three groups by K-means clustering method.
Draw a graph to show the clustering
(b) Part: hierarchical clustering
The observed values were clustered by full connection method.
The observations were clustered using average and single join.
Draw the tree view of the above clustering methods.

## Q01: use the iris dataset established in R.

(a) : K-means clustering

Discuss and / or consider standardizing the data.

``````data.frame(
"Average" = apply (iris \ [, 1:4 \],   2,   mean
"Standard deviation" = apply (iris \ [, 1:4 \],   2,   sd)``````

In this case, we will standardize the data because the width of the petal is much smaller than all other measurements.

## The data were clustered into two groups by K-means clustering method

Using a large enough nstart, it is easier to get the model corresponding to the minimum RSS value.

``kmean(iris, nstart = 100)``

## Draw a graph to show the clustering

``````#   Draw data
plot(iris, y = Sepal.Length, x = Sepal.Width)``````

In order to better consider the length and width of petals, it is more appropriate to use PCA to reduce the dimension first.

``````#    Create model

PCA.mod<- PCA(x = iris)

#Put the predicted group last
PCA\$Pred <-Pred

#Draw a chart
plot(PC, y = PC1, x = PC2, col = Pred)``````

In order to better explain the PCA diagram, the variance of principal components is considered.

``````##   Look at the variance explained by the main components

for (i in 1:nrow) {
pca\[\["PC"\]\]\[i\] <- paste("PC", i)
}``````

``plot(data  =  pca,x  =  Principal components,   y  =  Variance ratio,   group  =  1)``

80% of the variance in the data is explained by the first two principal components, so this is a very good data visualization.

## The data were clustered into three groups by K-means clustering method

In the previous principal component diagram, clustering looks very obvious, because in fact, we know that there should be three groups, and we can execute the model of three clusters.

``````kmean(input, centers = 3, nstart = 100)
#   Production data
groupPred %>% print()``````

## Draw a graph to show the clustering

``````#    Draw data
Plot (sepal length, sepal width,   col  = pred)``````

## PCA diagram

In order to better consider the length and width of petals, it is more appropriate to use PCA to reduce the dimension first.

``````#Create model
prcomp(x = iris)

#Put the predicted group last
Pcadf \$kmeans forecast<-   Pred

#Draw a chart
plot(PCA,   y  =  PC1,   x  =  PC2,col  = " Predict \ \ ncluster ",   caption  = " For the first two principal components of iris data, the ellipse represents 90% normal confidence, and the k-means algorithm is used to predict the two classes ")  +``````

## PCA hyperbola

Sepal length sepal width graph has reasonable separation. In order to select which variables to use on X and y, we can use hyperbolic graph.

``biplot(PCA)``

This hyperbolic chart shows that petal length and sepal width can explain most of the differences in the data. The more appropriate chart is:

``plot(iris,   col  =  Km (forecast)``

Evaluate all possible combinations.

``````iris %>%
pivot_longer()  %>%
plot(col  =  Km forecast,   facet\_ grid(name  ~ .,  scales  = ' free\_ y',   space  = ' free_ y',  ) +``````

# hierarchical clustering

## The observed values are clustered by full connection method.

The observations can be clustered using the full connection method (pay attention to the standardization of the data).

``hclust(dst, method = 'complete')``

## The observations were clustered using average and single join.

`````` hclust(dst, method = 'average')
hclust(dst, method = 'single')``````

## Draw prediction chart

Now that the model has been established, the tree view is divided by specifying the required number of groups.

``````#    data
Iris \$kmeans forecast<-   groupPred

#   Draw data
plot(iris,col  =  Kmeans forecast)``````

## Draw the tree view of the above clustering methods

Shade the tree view.

``````type<-   C ("average",  " All ",  " Single ")

for (hc in models) plot(hc, cex = 0.3)``````

Most popular insights

## Programming Xiaobai must understand the network principle

How is the network composed? Why can we easily surf the Internet now?Whether you are a computer major or not, you may always have such questions in your heart!And today we will solve this matter and tell you the real answer! Basic composition of network First, let’s look at this sentence Connect all computers together […]