Box office and word of mouth dominate the National Day archives. Use Python to climb the cat’s eye review area to see how good the movie “my hometown and I” is

Time:2021-2-23

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie
This year’s national day film market performance is still relatively strong, two main “my hometown and I” and “Jiang Ziya” played a very good leading role.

Jiang Ziya broke 200 million yuan on the first day, breaking the box office record of the first day of animated films in China’s film market maintained by the magic boy of Nezha. However, due to the decline of its follow-up word-of-mouth, Jiang Ziya has been comprehensively surpassed by me and my hometown in word-of-mouth and box office. If you don’t expect, me and my hometown will be the biggest winner of this year’s National Day file.

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

From the picture above, we can see that “my hometown and I” currently has 296000 ratings on cat’s eye, with an overall score of 9.3, which can be said to be a pretty good result. In this paper, we crawl the cat’s eye film review of the film and analyze the content of the film review area.

Crawling

First of all, let’s crawl the data of cat’s eye movie reviews. As the PC can only see a few reviews on cat’s eye, we need to use the app interface to crawl the data. The interface format is:http://m.maoyan.com/mmdb/comments/movie/movieid.json?_v_=yes&offset=15&startTime=xxxThe two parameters are described as follows:

  • Movieid: the unique ID of each movie in the website
  • Starttime: the time of the first comment on the current page. There are 15 comments on each page

The main implementation code of crawling is as follows:

#Get page content
def get_page(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'
                      '/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1',
        'accept': '*/*'
    }
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        return r.text
    except requests.HTTPError as e:
        print(e)
    except requests.RequestException as e:
        print(e)
    except:
        Print ("wrong")

#Parsing the data returned by the interface
def parse_data(html):
    json_data = json.loads(html)['cmts']
    comments = []
    #Parse the data and store it in the array
    try:
        for item in json_data:
            comment = []
            comment.append (item ['nickname '])
            comment.append (item ['cityname '] if' cityname 'in item else' ')
            comment.append (item ['content ']. Strip(). Replace (' \ n ',') ා content
            comment.append (item ['score '])
            comment.append(item['startTime'])
            comment.append (item ['time '])
            comment.append (item ['approve '])
            comment.append (item ['reply '])
            if 'gender' in item:
                comment.append (item ['gender '])
            comments.append(comment)
        return comments
    except Exception as e:
        print(comment)
        print(e)

#Save the data and write it to CSV
def save_data(comments):
    filename = 'comments.csv'
    dataObject = pd.DataFrame(comments)
    dataObject.to_csv(filename, mode='a', index=False, sep=',', header=False, encoding='utf_8_sig')

In this paper, we crawled about 2W comments data, and saved the crawled data to the CSV file.

Data analysis

Star rating

First of all, let’s look at the proportion of each rating star in the crawl data. The main implementation codes are as follows:

#Star rating
rates = []
for s in df.iloc[:, 3]:
    rates.append(s)
SX = ["five star", "four star", "three star", "two star", "one star"]
sy = [
    str(rates.count(5.0) + rates.count(4.5)),
    str(rates.count(4.0) + rates.count(3.5)),
    str(rates.count(3.0) + rates.count(2.5)),
    str(rates.count(2.0) + rates.count(1.5)),
    str(rates.count(1.0) + rates.count(0.5))
]
(
    Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK, width='700px', height='400px'))
    .add("", list(zip(sx, sy)), radius=["40%", "70%"])
    .set_ global_ opts(title_ opts= opts.TitleOpts (title = "star rating ratio", subtitle = "data source: cat's eye movie", POS_ left = "left"))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%", font_size=12))
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

We can see from the picture: nearly 90% of the people gave the film 5 stars, and the total proportion of 1, 2 and 3 stars was only about 5%, indicating that the quality of the film was recognized by most people.

Sex ratio

Next, let’s look at the gender of the reviewers. The main implementation code is as follows:

#Sex ratio
rates = []
for s in df.iloc[:, 8]:
    if s != 1 and s != 2:
        s = 3
    rates.append(s)
GX = ["male", "female", "unknown"]
gy = [
    rates.count(1),
    rates.count(2),
    rates.count(3)
]
(
    Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK, width="700px", height="400px"))
    .add("", list(zip(gx, gy)))
    .set_ global_ opts(title_ opts= opts.TitleOpts (title = sex ratio, subtitle = data source: cat's eye movies, POS_ left = "left"))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%", font_size=12))
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

From the above figure, we can find that most people pay more attention to their privacy and do not show their gender. From the gender visible data, we can find that men and women are more active in the comment area, while women are slightly higher.

Location distribution

Next, let’s look at the location distribution of commentators. First, let’s look at the location coordinates of the top 100 commentators. The main code implementation is as follows:

cities = []
for city in df.iloc[:, 1]:
    if city != "":
        cities.append(city)
data = Counter(cities).most_common(100)
gx1 = []
gy1 = []
for c in data:
    gx1.append(c[0])
    gy1.append(c[1])
geo = Geo(init_opts=opts.InitOpts(width="700px", height="400px", theme=ThemeType.DARK, bg_color="#404a59"))
(
    geo.add_schema(maptype="china", itemstyle_opts=opts.ItemStyleOpts(color="#323c48", border_color="#111"))
    . add ("number of comments", list (ZIP (GX1, Gy1)))
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
       toolbox_opts=opts.ToolboxOpts,
       title_ opts= opts.TitleOpts (Title: "location distribution geographic coordinates", subtitle: "data source: cat's eye movie", POS_ left = "left"),
       visualmap_opts=opts.VisualMapOpts(max_=500, is_piecewise=True)
    )
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

Next, we will show the top 15 cities in terms of comments through the bar chart. The main code implementation is as follows:

data_top15 = Counter(cities).most_common(15)
gx2 = []
gy2 = []
for c in data_top15:
    gx2.append(c[0])
    gy2.append(c[1])
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK, width="700px", height="400px"))
    .add_xaxis(gx2)
    .add_yaxis("", gy2)
    .set_global_opts(
        title_ opts= opts.TitleOpts (Title: "top 15 city source", subtitle: "cat's eye movie", POS_ left = "center")
    )
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

Through the above two pictures, we can intuitively see which cities people in the number of comments under the film, and then we can know their interest in the film.

Number of reviews

Let’s look at the number of comments in 24 hours. The main code implementation is as follows:

times = df.iloc[:, 5]
hours = []
for t in times:
    hours.append(str(t[11:13]))
hdata = sorted(Counter(hours).most_common())
hx = []
hy = []
for c in hdata:
    hx.append(c[0])
    hy.append(c[1])
(
    Line(init_opts=opts.InitOpts(theme=ThemeType.CHALK, width="700px", height="400px"))
    .add_xaxis(hx)
    .add_yaxis("", hy, areastyle_opts=opts.AreaStyleOpts(opacity=0.5))
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(
        title_ opts= opts.TitleOpts (Title: "number of reviews in 24 hours", subtitle: "data source: cat's eye movies", POS_ left = "center")
    )
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

As can be seen from the figure, you are more active in the afternoon and evening, because it’s dinner time around 19, so it’s reasonable for the number of comments to decline during this period.

leading artist

Next, let’s look at the situation where the main actors (including their roles) are mentioned in the comment area. The main code implementation is as follows:

cts_list = df.iloc[:, 2]
cts_str ="".join([str(i) for i in cts_list])
PX = ["Huang Bo", "Wang Baoqiang", "Liu Haoran", "Ge You", "Liu mintao", "Fan Wei", "Zhang Yi", "Deng Chao", "Yan Ni", "Shen Teng", "Ma Li"]
py = [cts_ str.count ("Huangbo") + CTS_ str.count (Huang Dabao), CTS_ str.count ("Wang Baoqiang") + CTS_ str.count (Old Tang Dynasty),
      cts_ str.count (Liu Haoran) + CTS_ str.count (Xiaoqin), CTS_ str.count ("Ge You") + CTS_ str.count (Zhang Beijing),
      cts_ str.count (Liu mintao) + CTS_ str.count (Lingzi), CTS_ str.count ("Fan Wei") + CTS_ str.count (Lao Fan),
      cts_ str.count (Zhang Yi) + CTS_ str.count (front of ginger), CTS_ str.count (Deng Chao) + CTS_ str.count (Qiao Shulin),
      cts_ str.count (Yan Ni) + CTS_ str.count (Yan Feiyan), CTS_ str.count ("Shen Teng") + CTS_ str.count (Ma Liang),
      cts_ str.count ("Mary") + CTS_ str.count (autumn glow)
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK, width="700px", height="400px"))
    .add_xaxis(px)
    .add_yaxis("", py)
    .set_global_opts(
        title_ opts= opts.TitleOpts (Title: "the number of mentions of the main actors and their roles", subtitle: "data source: cat's eye movies", POS_ left = "center")
    )
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

From the figure, we can see that the top three main actors in the comment area are Shen Teng, Fan Wei and Deng Chao, which further shows that these actors are very popular and have aroused wide discussion in the comment area.

Film unit

Next, let’s see how each movie unit is mentioned in the comment area. The main code implementation is as follows:

MX = ["a UFO falls from the sky", "Beijing good man", "the last lesson", "the way back home", "magic pen Ma Liang"]
my = [cts_ str.count (a UFO falls from the sky), CTS_ str.count "Beijing good man", CTS_ str.count (last lesson), CTS_ str.count (the road back home), CTS_ str.count (magic pen Ma Liang)
(
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK, width="700px", height="400px"))
    .add_xaxis(mx)
    .add_yaxis("", my)
    .set_global_opts(
        title_ opts= opts.TitleOpts (title = "the number of times a movie unit has been mentioned", subtitle =, POS)_ left = "center")
    )
).render_notebook()

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

From the picture, we can see that the number of times the film unit “last lesson” is mentioned is more than the sum of the number of times other units are mentioned, and then we can see that its popularity is relatively high, causing everyone’s resonance, a bit of a unique feeling.

Word cloud display

Whole word cloud

First, let’s take a look at the word cloud of the overall review. The code implementation is as follows:

cts_list = df.iloc[:, 2]
cts_str ="".join([str(i) for i in cts_list])
stylecloud.gen_stylecloud(text=cts_str, max_words=400,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fas fa-home",
                          size=800,
                          output_name="total.png")
Image(filename="total.png")

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

From the picture, we can intuitively see that: good-looking, very good-looking, worth a look, good, the last lesson has been mentioned more times, which shows that most people are satisfied with the film, and the last lesson of the film unit is quite popular and resonates with many people.

Hot words

Finally, let’s take a look at the word cloud of popular comments (comments with more likes and Replies). The code implementation is as follows:

hot_str = ""
for index, row in df.iterrows():
    content = row[2]
    support = row[6]
    reply = row[7]
    if(support > 30):
        hot_str += content
    elif (reply > 5):
        hot_str += content
stylecloud.gen_stylecloud(text=hot_str, max_words=200,
                          collocations=False,
                          font_path="SIMLI.TTF",
                          icon_name="fas fa-fire",
                          size=800,
                          output_name="hot.png")
Image(filename="hot.png")

The results are as follows

Box office and word of mouth dominate the National Day archives. Use Python to climb the cat's eye review area to see how good the movie

The painting style of this popular comment is a little different from that before. The most eye-catching words are: UFO, ugly, I didn’t watch the movie, I broke up with the object ten minutes before the show… Finally, I don’t want to say much about it, let’s experience it by ourselves~

Due to the limited data collected, there may be some deviation from the actual situation, so we can treat it rationally.

official accountPython sophomoreBackstage reply201005Get the source code.