Application case of topic detection algorithm in R language community


Original link:

Original source:Tuo end data tribal official account

Application case of topic detection algorithm in R language community

Create topic network

Research publications in Social Sciences, computers and informatics by analyzing texts and co-author social networks.

One of the questions I have encountered is: how to measure the relationship (relevance) between themes? I want to create a web visualization that connects similar topics and helps users browse a large number of topics more easily.

Data preparation

Our first step is to load the topic matrix of LDA output. LDA has two outputs: word topic matrix and document topic matrix.

As an alternative to loading files, you can use the output of the LDA function of the topic models package to create word topic and document topic matrices.

#   Loaded into the author topic matrix, the first column is words
author.topic <- read.csv("topics.csv", stringsAsFactors = F)
#   Loaded into the word topic matrix, the first column is the word

#   Rename theme
colnames(author.topic) <- c("author\_name",name$topic\_name)

Unlike the standard LDA, I run an “author centric” LDA in which the abstracts of all authors are merged and treated as a document for each author. This is because my ultimate goal is to use topic modeling as an information retrieval process to determine the expertise of researchers.

Create static network

In the next step, I use the correlation between the word probabilities of each topic to create a network.

First, I decided to keep only relationships (edges) with significant correlation (0.2 + correlation). I use 0.2 because it has a statistical significance level of 0.05 for 100 observation samples.

cor_threshold <- .2
Next, we use the correlation matrix to create the iGraph data structure and delete all edges with a minimum threshold correlation less than 0.2.

Let's draw a simple iGraph network.

title( cex.main=.8)

Application case of topic detection algorithm in R language community

Each number represents a topic, and each topic has a number to identify it.

Community detection, especially the label propagation algorithm in iGraph, is used to determine the clusters in the network.

clp <- cluster\_label\_prop(graph)

The community test found 13 communities, as well as communities with multiple isolated themes (i.e. themes without any connection).

Similar to my initial observations, the algorithm found the three main clusters we identified in the first graph, but also added other smaller clusters, which do not seem to be suitable for any of the three main clusters.

V(graph)$community <- clp$membership
V(graph)$degree <- degree(graph, v = V(graph))

Dynamic visualization

In this section, we will use the visnetwork interactive network diagram.

First, let’s call the library and run the Visigraph interactive network, set up to run on the iGraph structure (graph).

We create the visnetwork data structure, and then divide the list into two data frames: nodes and edges.

data <- toVisNetworkData(graph)nodes <- data\[\[1\]\]

Delete unconnected nodes (Topics) (degree = 0).

nodes <- nodes\[nodes$degree != 0,\]

Add colors and other network parameters to improve the network.

col <- brewer.pal(12, "Set3")\[as.factor(nodes$community)\]
nodes$shape <- "dot"s$betweenness))+.2)*20 
#   Node size
nodes$color.highlight.background <- "orange"

Finally, we create our network with interactive charts. You can use the mouse wheel to zoom.

visNetwork(nodes, edges) %>%visOptions(highlightNearest = TRUE, selectedBy = "community", nodesIdSelection = TRUE)

First, there are two drop-down menus. The first drop-down list allows you to find any topic by name (the top five words by word probability).

The second drop-down list highlights the communities detected in our algorithm.

The three largest seem to be:

  • Calculation (grey, Cluster 4)
  • Social (green, blue, cluster 1)
  • Health (yellow, cluster 2)

What is unique about the smaller communities detected? Can you explain?

Application case of topic detection algorithm in R language community

Application case of topic detection algorithm in R language community

Most popular insights

1.On the research hotspots of big data journal articles

2.618 online shopping data inventory – what are hand choppers paying attention to

3.R language text mining, TF IDF topic modeling, emotion analysis, n-gram modeling

4.Python topic Modeling Visualization LDA and t-sne interactive visualization

5.News data observation under epidemic situation

6.Python topic LDA modeling and t-sne visualization

7.Topic modeling analysis of text data in R language

8.Theme model: listen to the “online affairs” on the message board of people’s network

9.Web crawling LDA topic semantic data analysis by Python crawler