Analysis report on web crawling LDA topic semantic data by Python crawler

Time:2022-5-31

Original link:Python crawler web crawling LDA topic semantic data analysis | extension data technology / welcome to tecdat

Original source:Tuoduan data tribe official account

What is web crawling?

The method of extracting data from a website is called network crawling. Also known as network data extraction or network collection. This technology will not be used for more than 3 years.

Why do I crawl a web page?

The purpose of web crawling is to obtain data from any website, thus saving a lot of manual labor in collecting data / information. For example, you can collect all reviews of movies from the IMDB website. After that, you can perform text analysis to gain insights about the movie from the large number of comments collected.

First page of crawl start

If we change the page number on the address space, you will be able to see pages from 0 to 15. We’ll start crawling the first pageopencodez | develope | share | reuse

In the first step, we will send the request to the URL and store its response in a variable named response. This will send all web code as a response.

url= https://www.opencodez.com/page/0
response= requests.get(url)

Then we have to use html Parser parses HTML content.

soup = BeautifulSoup(response.content,"html.parser")

Analysis report on web crawling LDA topic semantic data by Python crawler

We will use the collation function to organize them.

Analysis report on web crawling LDA topic semantic data by Python crawler

Let’s look at the parts of the page that must extract the details. If we examine its elements through the right-click method described earlier, we will see that the details of href and the title of any article are in tag H2 with a class named title.

Analysis report on web crawling LDA topic semantic data by Python crawler

The HTML code for the article title and its links is in the blue box above.

We will pull them all out with the following command.

soup_title= soup.findAll("h2",{"class":"title"})
len(soup_title)

A list of 12 values is listed. From these files, we will use the following command to extract the titles and hrefs of all published articles.

for x in range(12):
print(soup_title\[x\].a\['href'\])
 
for x in range(12):
print(soup_title\[x\].a\['title'\])

Analysis report on web crawling LDA topic semantic data by Python crawler

Analysis report on web crawling LDA topic semantic data by Python crawler

In order to collect short descriptions of posts, authors, and dates, we need a div tag containing a class named “post content image-caption-format-1”.

Analysis report on web crawling LDA topic semantic data by Python crawler

What about the data we capture?

You can perform a variety of actions to explore the data collected in Excel tables. The first is wordcloud generation. We will introduce another topic modeling under NLP.

Ci Yun

1) What is word cloud:

This is a visual representation that highlights the high-frequency words that appear in the text data corpus after we delete the least important conventional English words (called stop words) (including other alphanumeric letters) from the text.

2) Use word cloud:

This is an interesting way to view text data and gain useful insights immediately without having to read the entire text.

3) Tools and knowledge required:

python

4) Summary:

In this article, we treat Excel data as input data again.

5) Code

Analysis report on web crawling LDA topic semantic data by Python crawler

6) Explanation of some terms used in the code:

Stop words are general words used for sentence creation. These words usually do not add any value to the sentence, nor do they help us gain any insight. For example, a, the, this, that, who, etc.

7) Word cloud output

Analysis report on web crawling LDA topic semantic data by Python crawler

8) Read output:

The prominent words are QA, SQL, testing, developers, micro services, etc. these words provide us with relevant data frames article_ Information about the most commonly used words in para.

Topic modeling

1) What is topic modeling:

This is the subject under the NLP concept. What we should do here is to try to identify various topics in the text or document corpus.

2) Use theme modeling:

Its purpose is to identify all available theme styles in a particular text / document.

3) Tools and knowledge required:

  • python
  • Gensim
  • NLTK

4) Code summary:

We will merge LDAS (potential dirichlets) used for topic modeling to generate topics and print them to see the output.

5) Code

Analysis report on web crawling LDA topic semantic data by Python crawler

Analysis report on web crawling LDA topic semantic data by Python crawler

6) Read output:

We can change the value in the parameter to get any number of topics or the number of words to display in each topic. Here, we want five topics, each containing seven words. We can see that these topics are related to Java, salesforce, unit testing, and microservices. If we increase the number of topics, such as 10, we can also find other forms of existing topics.


Analysis report on web crawling LDA topic semantic data by Python crawler

Most popular insights

1.Analysis on research hotspots of big data journal articles

2.618 online shopping data inventory – what are the hand choppers focusing on

3.R language text mining, TF IDF topic modeling, emotion analysis, n-gram modeling

4.Python theme Modeling Visualization LDA and t-sne interactive visualization

5.R language text mining, NASA data network analysis, TF IDF and topic modeling

6.Python theme LDA modeling and t-sne visualization

7.Topic modeling analysis of text data in R language

8.Topic modeling and analysis of text mining for NASA metadata using R language

9.Analysis of web crawling LDA topic semantic data by Python crawler

Recommended Today

The solution to the power cannot be turned off after the Linux system is shut down

If some motherboards do not automatically turn off the power after shutdown, you need to manually turn off the power, pleasegrubadd: Quote: #boot=/dev/sda default=0 timeout=5 splashimage=(hd0,7)/boot/grub/splash.xpm.gz hiddenmenu title Fedora (2.6.23.1-42.fc8)     root (hd0,7)     kernel /boot/vmlinuz-2.6.23.1-42.fc8 ro root=LABEL=/1234 rhgb quiet acpi=force     initrd /boot/initrd-2.6.23.1-42.fc8.img As long as the sentence in red is […]