# Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

Time：2021-7-28

## Original source:Tuo end data tribal official account

Article_ Twenty two_ Catch-22 is a novel written by American writer Joseph Heller. Based on the Second World War, the novel reveals an irrational, disorderly and nightmarish absurd world through the description of a series of events of the U.S. Air Force Flight brigade stationed on a Mediterranean island called pianoza (which is fictional by the author). I like the creative use of language and the interaction of absurd characters in the whole book. This paper makes text mining and visualization of the novel.

## Data set

The article has about 175000 words and is divided into 42 chapters. I found the original text version of the book on the Internet.

I use a combination of regular expressions and simple string matching to parse text in Python.

I want shiny to visualize these datasets interactively in R.

## Geographical map

``````geo<- catch22\[( geo\$Time > chapters\[1\]) & ( geo\$Time < (chapters\[2\] + 1)),\]
paths_sub <-  paths\[( paths\$time > chapters\[1\]) & ( paths\$time < (chapters\[2\] + 1)),\]

#   mapping
p <- ggplot() + borders("world", colour="black", fill="lightyellow") +
ylab(NULL) + xlab(NULL) +

#   Try to draw the position and path only if conditions permit

if (nrow( geo_sub) != 0) {
p + geom\_point(data= geo\_sub, aes(x = Lon, y = Lat), size=3, colour='red') +
geom\_point(data= paths\_sub\[1,\], aes(x = lon, y = lat), size=3, colour='red') +
geom\_path(data= paths\_sub, aes(x = lon, y = lat, alpha=alpha), size=.7,`````` The visualization maps the locations around the Mediterranean mentioned in the whole book.

## Character chapter relationship

``````ggplot(catch22, aes(x=Chapter, y=Character, colour=cols)) +
geom_point(size=size, shape='|', alpha=0.8) +
scale\_x\_continuous(limits=c(chapters\[1\],(chapters\[2\] + 1)), expand=c(0,0), breaks=(1:42)+0.5, labels=labs) +
ylab(NULL) + xlab('Chapter') +
theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
axis.text.y = element_text(colour = "black"),
axis.title.x = element_text(vjust=5),
plot.title = element_text(vjust=1)) +`````` The figure basically represents the sequence of different characters mentioned in the book.

I plot the data as a standard scatter chart, the chapters as the x-axis (because it is similar to time), and the characters as the discrete y-axis.

## Character co-occurrence matrix

``````ggplot(coloca, aes(x=Character, y=variable, alpha=alpha)) +
geom_tile(aes(fill=factor(cluster)), colour='white') +
ylab(NULL) + xlab(NULL) +
theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
axis.text.y = element_text(colour = "black"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.minor = element_line(colour = "white", size = 1),
panel.grid.major = element_blank()) +
scale\_fill\_manual(values = cols, guide = FALSE) +
scale\_alpha\_continuous(guide = FALSE)`````` The data used to build this visualization is exactly the same as that used in the previous one, but requires a lot of transformation.

Clustering adds another dimension to this graph. Apply hierarchical clustering throughout the book to try to find communities in roles. Agnes algorithm is used to cluster characters. Manual inspection of different clustering schemes shows the optimal clustering, because the more frequent roles dominate the least. This is a tree of six clusters:

``````ag <- agnes(cat2\[,-1\], method="complete", stand=F)
#   Cut out clusters from tree view
cluster <- cutree(ag, k=clusters)`````` It should be noted that clustering is performed on the entire text, not chapters. Sorting by cluster will bring the characters into a close community, so that the audience can also see some interactions between the characters.

## Characteristic words

`````` ggplot( pos2, aes(Chapter, normed, colour=Word, fill=Word)) +
scale\_color\_brewer(type='qual', palette='Set1', guide = FALSE) +
scale\_fill\_brewer(type='qual', palette='Set1') +
scale\_y\_continuous(limits=c(0,y_max), expand=c(0,0)) +
ylab('Relative Word Frequency') + xlab('Chapter') +`````` Stacked bar charts better show the chapter in which the word is located.

## conclusion

I learned a lot in this process, both in use and shiny. ## Open source technology exchange – Introduction to Chengying, a one-stop fully automated operation and maintenance manager

1、 Live broadcast introduction On May 30, kangaroo cloud one-stop fully automated operation and maintenance steward Chengying (background) officially opened source. We know that opening source is not the end, but the beginning. How to make more partners better understand Chengying, use Chengying, and build Chengying is a problem that the students of the open […]