Python analysis of the movie “south station Party” to see what the movie is about?


The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time

This article is from Tencent cloud by Python sophomore在这里插入图片描述

Directed by Diao Yinan, the main actors include Hu Ge, GUI lunmei, Liao Fan and Wan Xi. The film premiered at Cannes Film Festival on May 18, 2019 and officially released in China on December 6, 2019. Inspired by real news events, the story mainly tells the story of Zhou Zenong (Hu Ge), the leader of the theft gang, who is on the road of escape with a heavy reward and is struggling to seek self salvation.

The film has been released for more than a week, and the box office is close to 200 million. As a literary film, this performance should be regarded as an upper middle level. Next, open the Douban , take a look at the score, as shown in the figure below:

First, we input the mobile phone number / email and password at random (do not input the correct user name and password), then press F12 to open the developer tool, and finally click the login Douban button. The result is as follows:
We click the basic item in the figure above, and the result is as follows:
All the things needed are found. The next step is the specific implementation. The specific implementation of Douban login and movie review data crawling is as follows:

import requests
import time
import random
from lxml import etree
import csv

#New CSV file
Csvfile = open ('party at south station. CSV ','w', encoding '='utf-8', newline = ')
writer = csv.writer(csvfile)
writer.writerow (['time ','star','comment content '])

def spider():
    url = ''
    headers = {"User-Agent": 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
    comment_url = ''
    data = {
        'ck': '',
        'name':'your own user name ',
        'password':'Own password ',
        'remember': 'false',
        'ticket': ''
    session = requests.session(), headers=headers, data=data)
    #500 in total, 20 per page
    for i in range(0, 500, 20):
        #Get HTML
        data = session.get(comment_url % i, headers=headers)
        Print ('page ','I','status Code: ', data.status_ code)
        #Pause 0-1 seconds
        #Parsing HTML
        selector = etree.HTML(data.text)
        #Get all comments on current page
        comments = selector.xpath('//div[@class="comment"]')
        #Traverse all comments
        for comment in comments:
            #Get stars
            star = comment.xpath('.//h3/span[2]/span[2]/@class')[0][7]
            #Acquisition time
            t = comment.xpath('.//h3/span[2]/span[3]/text()')
            #Get comments
            content = comment.xpath('.//p/span/text()')[0].strip()
            #Exclude items with empty time
            if len(t) != 0:
                t = t[0].strip()
                writer.writerow([t, star, content])


Next, we use word cloud to intuitively show the overall review situation, and the specific implementation is as follows:

import csv
import jieba
from wordcloud import WordCloud
import numpy as np
from PIL import Image

#Jieba word segmentation processing
def jieba_():
    csv_ list =  csv.reader (open ('party at Nanfang station. CSV ','r', encoding '='utf-8'))
    comments = ''
    for i,line in enumerate(csv_list):
        if i != 0:
            comment = line[2]
            comments += comment
    #Jieba participle
    words = jieba.cut(comments)
    new_words = []
    #Words to exclude
    remove_ Words = ['and ','lie in','some ','one scene','only ',
                    'but', 'things',' scenes', 'all', 'so',
                    'but','the whole film ','before','one film ','one film',
                    'as',' though ',' everything ',' how ',' performance ',
                    'character','No ','not','a kind ','personal'
                    'if','after ','Come out','Start ','Is',
                    'movie ',' or ',' not ',' Wuhan ',' lens']
    for word in words:
        if word not in remove_words:
    global word_cloud
    #Separate words with commas
    word_cloud = ','.join(new_words)

#Generative word cloud
def world_cloud():
    #Background image
    cloud_mask = np.array('bg.jpg'))
    wc = WordCloud(
        #Background image分割颜色
        #Background image样
        #Display the maximum number of words
        #Show Chinese
        #Size limit of Chinese characters
    global word_cloud
    x = wc.generate(word_cloud)
    #Generative word cloud图片
    image = x.to_image()
    #Show word cloud pictures
    #Save word cloud image


Cloud picture of overall comments

Some people have said that the film’s word-of-mouth is divided into two levels. Next, let’s take a look at the effect of 1-star and 5-star word cloud. The main realization is as follows:

for i,line in enumerate(csv_list):
    if i != 0:
        star = line[1]
        comment = line[2]
        #One star reviews with 1, five-star reviews with 5
        if star == '1':
            comments += comment


Cloud picture of one star comments

Cloud picture of five star comments
We only use the comment content information above, and there are time and star information not used. Finally, we can use these two data to analyze the fluctuation of movie stars over time, and count the fluctuation of movie stars from the premiere (may 2019) to the current time (December 2019) by month. The specific implementation is as follows:

import csv
from pyecharts.charts import Line
import pyecharts.options as opts
import numpy as np
from datetime import datetime

def score():
    csv_ list =  csv.reader (open ('party at Nanfang station. CSV ','r', encoding '='utf-8'))
    print('csv_list', csv_list)
    comments = ''
    ts = []
    ss = set()
    for i, line in enumerate(csv_list):
        if i != 0:
            t = line[0][0:7]
            s = line[1]
    new_times = []
    new_starts = []
    new_ss = []
    for i in ss:
    arr = np.array(new_ss)
    new_ss = arr[np.argsort([datetime.strptime(i, '%Y-%m') for i in np.array(new_ss)])].tolist()
    for i in new_ss:
        x = 0
        y = 0
        z = 0
        for j in ts:
            t = j.split(':')[0]
            s = int(j.split(':')[1])
            if i == t:
                x += s
                z += 1
        new_starts.append(round(x / z, 1))
    c = (
           .add_ Yaxis ('party at South Station ', new_ starts)
            .set_ global_ opts(title_ opts= opts.TitleOpts (title ='douban star fluctuation chart ')


The star fluctuation effect of the film is shown in the figure below

According to the fluctuation of movie stars, we can roughly predict the fluctuation of movie ratings.

Recommended Today

Background management system menu management module

1 menu management page design 1.1 business design Menu management, also known as resource management, is the external manifestation of system resources. This module is mainly to add, modify, query and delete the menu. CREATE TABLE `sys_menus` ( `id` int(11) NOT NULL AUTO_INCREMENT, `Name ` varchar (50) default null comment ‘resource name’, `URL ` varchar […]