Use Python to crawl the most dazzling Guoman Wushan Wuxing to see what 100000 netizens are saying

Time:2021-2-21

The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time

This article is from Tencent cloud by Python sophomore在这里插入图片描述
Watching animation partners should know that recently a Shenman “Wushan Wuxing” was released, which was highly praised for its unique ink painting style and flaming fighting scenes. The first episode was broadcast in less than 24 hours, and the hot search of station B was the first, Douban was the first 9.5, the popularity can be seen. As far as fighting scenes are concerned, it’s not too much to say that it’s the most dazzling animation. Of course, the only shortcoming is that the number of episodes is a little small, with only 3 episodes.

在这里插入图片描述
After looking at the animation, do you think what I said is the most dazzling animation? It’s not empty words. Next, let’s crawl some comments to understand your views on this animation. Here we select three platforms: B station, microblog and Douban to crawl the data.

Crawling station B

Let’s first crawl the barrage data of station B. the animation link is: https://www.bilibili.com/bangumi/play/ep331423 The bullet screen link is: http://comment.bilibili.com/186803402.xml The crawling code is as follows:

url = "http://comment.bilibili.com/218796492.xml"
req = requests.get(url)
html = req.content
html_ Doc = str (HTML, "UTF-8") #
#Analysis
soup = BeautifulSoup(html_doc, "lxml")
results = soup.find_all('d')
contents = [x.text for x in results]
#Save the results
dic = {"contents": contents}
df = pd.DataFrame(dic)
df["contents"].to_csv("bili.csv", encoding="utf-8", index=False)

 

If you don’t know about the data of crawling B station barrage, you can have a look: crawling B station barrage.

We then generate the word cloud from the crawled barrage data. The code implementation is as follows:

def jieba_():
    #Open comment data file
    content = open("bili.csv", "rb").read()
    #Jieba participle
    word_list = jieba.cut(content)
    words = []
    #Filtered words
    stopwords = open("stopwords.txt", "r", encoding="utf-8").read().split("\n")[:-1]
    for word in word_list:
        if word not in stopwords:
            words.append(word)
    global word_cloud
    #Separate words with commas
    word_cloud = ','.join(words)

def cloud():
    #Open the background image of word cloud
    cloud_mask = np.array(Image.open("bg.png"))
    #Define some attributes of word cloud
    wc = WordCloud(
        #The background image segmentation color is white
        background_color='white',
        #Background pattern
        mask=cloud_mask,
        #Display the maximum number of words
        max_words=500,
        #Show Chinese
        font_path='./fonts/simhei.ttf',
        #Maximum size
        max_font_size=60,
        repeat=True
    )
    global word_cloud
    #Word cloud function
    x = wc.generate(word_cloud)
    #Generate word cloud image
    image = x.to_image()
    #Show word cloud pictures
    image.show()
    #Save word cloud image
    wc.to_file('cloud.png')

jieba_()
cloud()

 

Take a look at the effect:
在这里插入图片描述

Crawling microblog

We then crawled the micro blog comments of animation. The target we chose was the comment data of this micro blog on top of Wushan five elements official blog, as shown in the figure:
在这里插入图片描述

The crawling code implementation is as follows:

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

#Crawl a page of comments
def get_one_page(url):
    headers = {
        'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3880.4 Safari/537.36',
        'Host' : 'weibo.cn',
        'Accept' : 'application/json, text/plain, */*',
        'Accept-Language' : 'zh-CN,zh;q=0.9',
        'Accept-Encoding' : 'gzip, deflate, br',
        'cookie':'Own cookie ',
        'DNT' : '1',
        'Connection' : 'keep-alive'
    }
    #Get HTML
    response = requests.get(url, headers = headers, verify=False)
    #Crawling success
    if response.status_code == 200:
        #The return value is an HTML document, which is passed into the parsing function
        return response.text
    return None

#Parsing and saving comment information
def save_one_page(html):
    comments = re.findall('(.*?)', html)
    for comment in comments[1:]:
        result = re.sub('', '', comment)
        If 'reply @'not in result:
            with open('wx_comment.txt', 'a+', encoding='utf-8') as fp:
                fp.write(result)

for i in range(50):
    url = 'https://weibo.cn/comment/Je5bqpmCn?uid=6569999648&rl=0&page='+str(i) 
    html = get_one_page(url)
    Print ('crawling comments on page% d '% (I + 1))
    save_one_page(html)
    time.sleep(3)

 

For those unfamiliar with crawling microblog comments, you can refer to crawling microblog comments.

Similarly, we will generate a word cloud from comments to see the effect:
在这里插入图片描述
Crawling Douban

Finally, we crawled the Douban review data of animation. The Douban address of animation is: https://movie.douban.com/subject/30395914/ The implementation code of crawling is as follows:

def spider():
    url = 'https://accounts.douban.com/j/mobile/login/basic'
    headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
    #Comment website, in order to turn the page dynamically, the formatted number is added after the start, the short review page has 20 pieces of data, each page adds 20 pieces
    url_comment = 'https://movie.douban.com/subject/30395914/comments?start=%d&limit=20&sort=new_score&status=P'
    data = {
        'ck': '',
        'name':'user name ',
        'password':'password ',
        'remember': 'false',
        'ticket': ''
    }
    session = requests.session()
    session.post(url=url, headers=headers, data=data)
    #Initialize four lists to store user name, comment star, time and comment text
    users = []
    stars = []
    times = []
    content = []
    #Grab 500, 20 per page, which is the upper limit of Douban
    for i in range(0, 500, 20):
        #Get HTML
        data = session.get(url_comment % i, headers=headers)
        #Status code 200 indicates success
        Print ('page ','I','status Code: ', data.status_ code)
        #Pause for 0-1 second to prevent IP from being blocked
        time.sleep(random.random())
        #Parsing HTML
        selector = etree.HTML(data.text)
        #Get all comments on a single page with XPath
        comments = selector.xpath('//div[@class="comment"]')
        #Traverse all comments for details
        for comment in comments:
            #Get user name
            user = comment.xpath('.//h3/span[2]/a/text()')[0]
            #Get star reviews
            star = comment.xpath('.//h3/span[2]/span[2]/@class')[0][7:8]
            #Acquisition time
            date_time = comment.xpath('.//h3/span[2]/span[3]/@title')
            #Some time is empty, need to judge
            if len(date_time) != 0:
                date_time = date_time[0]
                date_time = date_time[:10]
            else:
                date_time = None
            #Get comment text
            comment_text = comment.xpath('.//p/span/text()')[0].strip()
            #Add all information to the list
            users.append(user)
            stars.append(star)
            times.append(date_time)
            content.append(comment_text)
    #Packing with dictionaries
    comment_dic = {'user': users, 'star': stars, 'time': times, 'comments': content}
    #Convert to dataframe format
    comment_df = pd.DataFrame(comment_dic)
    #Save data
    comment_df.to_csv('db.csv')
    #Save the comments separately
    comment_df['comments'].to_csv('comment.csv', index=False)

spider()

 

For those unfamiliar with crawling Douban comments, you can refer to crawling Douban comments.

Take a look at the generated word cloud effect:
在这里插入图片描述

Recommended Today

Background management system menu management module

1 menu management page design 1.1 business design Menu management, also known as resource management, is the external manifestation of system resources. This module is mainly to add, modify, query and delete the menu. CREATE TABLE `sys_menus` ( `id` int(11) NOT NULL AUTO_INCREMENT, `Name ` varchar (50) default null comment ‘resource name’, `URL ` varchar […]