Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Time:2020-4-5

When it comes to crawlers, you may think of comments on Netease cloud music at the first time. There are many treasures in Netease cloud music review. Let’s learn how to dig treasures with Python!

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Since it’s a treasure, it must be encrypted with a key. Open chrome analysis headers as follows.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

This parameter looks complicated. We don’t need requests to call this link.

This time, we use selenium! A browser automation testing framework! It can simulate the manual operation of the browser!

To do this, we need to prepare the drive chrome driver and Chrome browser.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Chrome driver can be downloaded from Taobao image, and the version corresponding to Chrome browser can be selected for download. The download address is as follows.
http://npm.taobao.org/mirrors…

The whole project uses Python 3 and some third-party libraries. See below.

from selenium import webdriver
import jieba
from wordcloud import WordCloud
from PIL import Image
import numpy as np

Then configureconfig.json

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

{
  "id":"1336789644",
  "page": 200,
  "useCache": true,
  "font_path": "SimHei.ttf",
  "mask": "mask.png",
  "chromedriver": "chromedriver"
}

Functionsound.pyA cloud of words will be generated.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

And all the comment data

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

See the use method, and then enter the analysis phase!

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Find the address of Netease cloud music and find the rules, and use webdriver to open it!

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

driver = webdriver.Chrome(CONFIG['chromedriver'])
driver.get(f'https://music.163.com/#/song?id={SOUND_ID}')

Then let the driver jump into the frame of the comment box.

driver.switch_to.frame('g_iframe')

Why? Because it can’t be parsed with XPath in the frame structure. The comment data is in this iframe.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Select one of the comments and analyze its format structure. You can see that they are all in the same class name.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Write the corresponding XPath to get a list of all comments.

element_list = driver.find_elements_by_xpath('//div[@class="cnt f-brk"]')

Select the next button and analyze its format structure. You can see that the class name starts with a prefix.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Write the corresponding XPath, get the next button, and simulate clicking when needed.

next_button = driver.find_element_by_xpath('//a[starts-with(@class,"zbtn znxt js-n-")]')
driver.execute_script('arguments[0].click();', next_button)

After data analysis, it’s time to generate results.

Crawl to Netease cloud music review! Python crawler introduction practice (VI) selenium introduction!

Save the comment list as JSON.

with open(filePath,'w') as f:
    json.dump(comments_list,f, ensure_ascii=False, indent=4)

Use Jieba segmentation and wordcloud to generate a word cloud.

#Word cloud processing
image_mask = np.array(Image.open(CONFIG['mask']))
wordlist = jieba.cut(';'.join(comments_list))
wordcloud = WordCloud(font_path=CONFIG['font_path'], background_color='white', mask=image_mask, scale=1.5).generate(' '.join(wordlist))
Conservation map
wordcloud.to_file(f'./result/{SOUND_ID}-{PAGES}.png')

The above is the whole step of using selenium to crawl Netease cloud music reviews!


This article is only for personal learning and communication. Please do not use it for other purposes!


Complete code
Reference material

Recommended Today

Configure Apache to support PHP in the Apache main configuration file httpd.conf Include custom profile in

In Apache’s main configuration file / conf/ http.conf Add at the bottom Include “D:workspace_phpapache-php.conf” The file path can be any In D: workspace_ Create under PHP file apache- php.conf file Its specific content is [html] view plain copy PHP-Module setup LoadFile “D:/xampp/php/php5ts.dll” LoadModule php5_module “D:/xampp/php/php5apache2_2.dll” <FilesMatch “.php$”> SetHandler application/x-httpd-php </FilesMatch> <FilesMatch “.phps$”> SetHandler application/x-httpd-php-source </FilesMatch> […]