Original link:http://tecdat.cn/?p=5673
Original source:Tuo end data tribal official account
Article_ Twenty two_ Catch-22 is a novel written by American writer Joseph Heller. Based on the Second World War, the novel reveals an irrational, disorderly and nightmarish absurd world through the description of a series of events of the U.S. Air Force Flight brigade stationed on a Mediterranean island called pianoza (which is fictional by the author). I like the creative use of language and the interaction of absurd characters in the whole book. This paper makes text mining and visualization of the novel.
Data set
The article has about 175000 words and is divided into 42 chapters. I found the original text version of the book on the Internet.
I use a combination of regular expressions and simple string matching to parse text in Python.
I want shiny to visualize these datasets interactively in R.
Geographical map
geo<- catch22\[( geo$Time > chapters\[1\]) & ( geo$Time < (chapters\[2\] + 1)),\]
paths_sub <- paths\[( paths$time > chapters\[1\]) & ( paths$time < (chapters\[2\] + 1)),\]
# mapping
p <- ggplot() + borders("world", colour="black", fill="lightyellow") +
ylab(NULL) + xlab(NULL) +
# Try to draw the position and path only if conditions permit
if (nrow( geo_sub) != 0) {
p + geom\_point(data= geo\_sub, aes(x = Lon, y = Lat), size=3, colour='red') +
geom\_point(data= paths\_sub\[1,\], aes(x = lon, y = lat), size=3, colour='red') +
geom\_path(data= paths\_sub, aes(x = lon, y = lat, alpha=alpha), size=.7,
The visualization maps the locations around the Mediterranean mentioned in the whole book.
Character chapter relationship
ggplot(catch22, aes(x=Chapter, y=Character, colour=cols)) +
geom_point(size=size, shape='|', alpha=0.8) +
scale\_x\_continuous(limits=c(chapters\[1\],(chapters\[2\] + 1)), expand=c(0,0), breaks=(1:42)+0.5, labels=labs) +
ylab(NULL) + xlab('Chapter') +
theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
axis.text.y = element_text(colour = "black"),
axis.title.x = element_text(vjust=5),
plot.title = element_text(vjust=1)) +
The figure basically represents the sequence of different characters mentioned in the book.
I plot the data as a standard scatter chart, the chapters as the x-axis (because it is similar to time), and the characters as the discrete y-axis.
Character co-occurrence matrix
ggplot(coloca, aes(x=Character, y=variable, alpha=alpha)) +
geom_tile(aes(fill=factor(cluster)), colour='white') +
ylab(NULL) + xlab(NULL) +
theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
axis.text.y = element_text(colour = "black"),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.minor = element_line(colour = "white", size = 1),
panel.grid.major = element_blank()) +
scale\_fill\_manual(values = cols, guide = FALSE) +
scale\_alpha\_continuous(guide = FALSE)
The data used to build this visualization is exactly the same as that used in the previous one, but requires a lot of transformation.
Clustering adds another dimension to this graph. Apply hierarchical clustering throughout the book to try to find communities in roles. Agnes algorithm is used to cluster characters. Manual inspection of different clustering schemes shows the optimal clustering, because the more frequent roles dominate the least. This is a tree of six clusters:
ag <- agnes(cat2\[,-1\], method="complete", stand=F)
# Cut out clusters from tree view
cluster <- cutree(ag, k=clusters)
It should be noted that clustering is performed on the entire text, not chapters. Sorting by cluster will bring the characters into a close community, so that the audience can also see some interactions between the characters.
Characteristic words
ggplot( pos2, aes(Chapter, normed, colour=Word, fill=Word)) +
scale\_color\_brewer(type='qual', palette='Set1', guide = FALSE) +
scale\_fill\_brewer(type='qual', palette='Set1') +
scale\_y\_continuous(limits=c(0,y_max), expand=c(0,0)) +
ylab('Relative Word Frequency') + xlab('Chapter') +
Stacked bar charts better show the chapter in which the word is located.
conclusion
I learned a lot in this process, both in use and shiny.