Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

Time:2021-7-28

Original link:http://tecdat.cn/?p=5673

Original source:Tuo end data tribal official account

Article_ Twenty two_ Catch-22 is a novel written by American writer Joseph Heller. Based on the Second World War, the novel reveals an irrational, disorderly and nightmarish absurd world through the description of a series of events of the U.S. Air Force Flight brigade stationed on a Mediterranean island called pianoza (which is fictional by the author). I like the creative use of language and the interaction of absurd characters in the whole book. This paper makes text mining and visualization of the novel.

Data set

The article has about 175000 words and is divided into 42 chapters. I found the original text version of the book on the Internet.

I use a combination of regular expressions and simple string matching to parse text in Python.

I want shiny to visualize these datasets interactively in R.

Geographical map

geo<- catch22\[( geo$Time > chapters\[1\]) & ( geo$Time < (chapters\[2\] + 1)),\]
  paths_sub <-  paths\[( paths$time > chapters\[1\]) & ( paths$time < (chapters\[2\] + 1)),\]
  
  #   mapping
  p <- ggplot() + borders("world", colour="black", fill="lightyellow") + 
    ylab(NULL) + xlab(NULL) +

 #   Try to draw the position and path only if conditions permit

  if (nrow( geo_sub) != 0) {
     p + geom\_point(data= geo\_sub, aes(x = Lon, y = Lat), size=3, colour='red') +
      geom\_point(data= paths\_sub\[1,\], aes(x = lon, y = lat), size=3, colour='red') +
      geom\_path(data= paths\_sub, aes(x = lon, y = lat, alpha=alpha), size=.7,

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

The visualization maps the locations around the Mediterranean mentioned in the whole book.

Character chapter relationship

ggplot(catch22, aes(x=Chapter, y=Character, colour=cols)) +
      geom_point(size=size, shape='|', alpha=0.8) +
      scale\_x\_continuous(limits=c(chapters\[1\],(chapters\[2\] + 1)), expand=c(0,0), breaks=(1:42)+0.5, labels=labs) +
      ylab(NULL) + xlab('Chapter') +
      theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
            axis.text.y = element_text(colour = "black"),
            axis.title.x = element_text(vjust=5),
            plot.title = element_text(vjust=1)) +

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

The figure basically represents the sequence of different characters mentioned in the book.

I plot the data as a standard scatter chart, the chapters as the x-axis (because it is similar to time), and the characters as the discrete y-axis.

Character co-occurrence matrix

ggplot(coloca, aes(x=Character, y=variable, alpha=alpha)) + 
    geom_tile(aes(fill=factor(cluster)), colour='white') + 
    ylab(NULL) + xlab(NULL) +
    theme(axis.text.x = element_text(colour = "black", angle = 45, hjust = 1, vjust=1.03),
          axis.text.y = element_text(colour = "black"),
          axis.ticks.y = element_blank(),
          axis.ticks.x = element_blank(),
          panel.grid.minor = element_line(colour = "white", size = 1),
          panel.grid.major = element_blank()) +
    scale\_fill\_manual(values = cols, guide = FALSE) +
    scale\_alpha\_continuous(guide = FALSE)

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

The data used to build this visualization is exactly the same as that used in the previous one, but requires a lot of transformation.

Clustering adds another dimension to this graph. Apply hierarchical clustering throughout the book to try to find communities in roles. Agnes algorithm is used to cluster characters. Manual inspection of different clustering schemes shows the optimal clustering, because the more frequent roles dominate the least. This is a tree of six clusters:

ag <- agnes(cat2\[,-1\], method="complete", stand=F)
    #   Cut out clusters from tree view
    cluster <- cutree(ag, k=clusters)

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

It should be noted that clustering is performed on the entire text, not chapters. Sorting by cluster will bring the characters into a close community, so that the audience can also see some interactions between the characters.

Characteristic words

 ggplot( pos2, aes(Chapter, normed, colour=Word, fill=Word)) + 
      scale\_color\_brewer(type='qual', palette='Set1', guide = FALSE) +
      scale\_fill\_brewer(type='qual', palette='Set1') +
      scale\_y\_continuous(limits=c(0,y_max), expand=c(0,0)) +
      ylab('Relative Word Frequency') + xlab('Chapter') +

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r

Stacked bar charts better show the chapter in which the word is located.

conclusion

I learned a lot in this process, both in use and shiny.

Extension data tecdat: a case of text mining and hierarchical clustering visual analysis of novels by Python and r