Python crawls all video screens of any up main station B

Time:2019-10-18

It is not difficult to climb the barrage of station B. To get all the video barrages of the up main, we first enter the up main video page, which is https://space.bilibilili.com/id/video. Press F12 to open the developer menu. Refresh it. There is a getsubmitvideo file in the XHR file of the network. This file has the AV number we need. If you grab the page directly, you can’t get it, because the video is loaded asynchronously.

 

Under the data tab in this file, there is a count that is the total number of videos. Pages is the page number. Vlist is the video information we are looking for. The aid in it is the AV number of each video. Its request link is https://space.bilibilili.com/ajax/member/getsubmitvideos? Mid = AV number & PageSize = 30 & TID = 0 & page = 1 & keyword = & order = PubDate. PageSize is how many video messages are delivered at a time.

 

After getting all the Video AV numbers, we open the video page. Similarly, press F12 to open the developer menu. Refresh it. There are two files in XHR of network, one starts with pagelist and the other starts with list.so. The first of these two files contains the CID of the video, and the second is the barrage file obtained according to the CID. Similarly, we can access the request URL of the first file according to the Video AV number, get the CID, and then access the second request URL according to the CID.

 

 

 

In the end, we will sort out the documents we have obtained. Mainly from theThe bullet screen text is extracted from the tag, then it is de duplicated, counted and stored in the file.

import requestsfrom lxml import etree
import os
import json
from bs4 import BeautifulSoup
from requests import exceptions
import re
import time


def download_page(url):
    headers = {  'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"  }
    data = requests.get(url, headers=headers)
    return data


def get_video_page(space_num):
    base_url = "https://www.bilibili.com/av"
    url = "https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&pagesize=99&tid=0&page=1&keyword=&order=pubdate".format(space_num)
    data = json.loads(download_page(url).content)['data']
    total = data['count']
    page_num = int(total/99) + 1
    video_list = data['vlist']
    video_url = []
    for video in video_list:
        video_url.append(base_url + str(video['aid']))
    for i in range(2, page_num+1):
        time.sleep(1)
        url = "https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&pagesize=99&tid=0&page={}&keyword=&order=pubdate".format(space_num, i)
        data = json.loads(download_page(url).content)['data']
        video_list = data['vlist']
        for video in video_list:
            video_url.append(base_url + str(video['aid']))
    return video_url


def get_barrage(name, space_num):
    video_list = get_video_page(space_num)
    aid_to_oid = 'https://api.bilibili.com/x/player/pagelist?aid={}&jsonp=jsonp'
    barrage_url = 'https://api.bilibili.com/x/v1/dm/list.so?oid={}'
    For URL in video list: reduce the crawling speed to prevent it from being forbidden
        time.sleep(1)
        Aid = re. Search (R '\ D + $', URL). Group() 񖓿 there are some unexplained errors here.
        try:
            oid = json.loads(download_page(aid_to_oid.format(aid)).content)['data'][0]['cid']
            barrage = download_page(barrage_url.format(oid)).content
        except requests.exceptions.ConnectionError:
            print('av:',aid)
            continue
        if not os.path.exists('barrage/{}'.format(name)):
            os.makedirs('barrage/{}'.format(name))
        with open('barrage/{}/av{}.xml'.format(name,aid),'wb') as f:
            f.write(barrage)


def reorganize_barrage(name):
    results = {}
    for filename in os.listdir('barrage/{}'.format(name)):
        HTML = etree. Parse ('barrage / {} / {} '. Format (name, filename), etree. Htmlparser()) ාextract the text in the label in the XML file
        barrages = html.xpath('//d//text()')
        For barrage in barrages:
            barrage = barrage.replace('\r', '')
            if barrage in results:
                results[barrage] += 1
            else:
                results[barrage] = 1
    if not os.path.exists('statistical result'):
        os.makedirs('statistical result')
    with open('statistical result/{}.txt'.format(name), 'w', encoding='utf8') as f:
        for key,value in results.items():
            f.write('{}\t:\t{}\n'.format(key.rstrip('\r'),value))


In the space list.txt file, I use the format of "up main name: Id".
    with open('space list.txt', 'r') as f:
        for line in f.readlines():
            name, num = line.split(':')
            print(name)
            get_barrage(name, space_number)
            reorganize_barrage(name)