Python crawls the barrage of houlang and displays the data in a word cloud.

Time:2021-2-20

The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time

This article is from Tencent cloud by Python sophomore在这里插入图片描述
在这里插入图片描述
A few days ago, station B launched a small video called “houlang”, which has aroused warm repercussions in the whole network, including praise and criticism https://www.bilibili.com/video/BV1FV411d7u7 In this paper, we climb the video barrage to understand the views of B station netizens on the video.

Video Barrage is the existence of XML
In the file, the format of the link is: http://comment.bilibili.com/+cid+.xml We just need to get the video’s CID
OK, let’s take a look at the access method. Let’s open the video link first https://www.bilibili.com/video/BV1FV411d7u7 , and then press
Press F12 to open the developer tool, select network, and refresh the page. We can enter CID in the filter, as shown below:
在这里插入图片描述
After obtaining the CID, we can know that the bullet screen file link is: http://comment.bilibili.com/186803402.xml , open the link to see:

在这里插入图片描述

The implementation code of barrage crawling is as follows:

url = "http://comment.bilibili.com/186803402.xml"
req = requests.get(url)
html = req.content
html_ Doc = str (HTML, "UTF-8") #
#Analysis
soup = BeautifulSoup(html_doc, "lxml")
results = soup.find_all('d')
contents = [x.text for x in results]
#Save the results
dic = {"contents": contents}
df = pd.DataFrame(dic)
df["contents"].to_csv("bili.csv", encoding="utf-8", index=False)

 

Now that we have obtained the barrage data, we will make a word cloud display of the data, and the implementation code is as follows:

def jieba_():
    #Open comment data file
    content = open("bili.csv", "rb").read()
    #Jieba participle
    word_list = jieba.cut(content)
    words = []
    #Filtered words
    stopwords = open("stopwords.txt", "r", encoding="utf-8").read().split("\n")[:-1]
    for word in word_list:
        if word not in stopwords:
            words.append(word)
    global word_cloud
    #Separate words with commas
    word_cloud = ','.join(words)

def cloud():
    #Open the background image of word cloud
    cloud_mask = np.array(Image.open("bg.png"))
    #Define some attributes of word cloud
    wc = WordCloud(
        #The background image segmentation color is white
        background_color='white',
        #Background pattern
        mask=cloud_mask,
        #Display the maximum number of words
        max_words=500,
        #Show Chinese
        font_path='./fonts/simhei.ttf',
        #Maximum size
        max_font_size=60,
        repeat=True
    )
    global word_cloud
    #Word cloud function
    x = wc.generate(word_cloud)
    #Generate word cloud image
    image = x.to_image()
    #Show word cloud pictures
    image.show()
    #Save word cloud image
    wc.to_file('cloud.png')

 

Take a look at the effect:
在这里插入图片描述