Learn that! Crawling a blog comment with python (no duplicate data)

Time:2021-1-22

Python crawling comments

preface

Some time ago, there was a serious polarization in the comments on a diary on a blog. Out of curiosity, I wanted to make a simple analysis of the comments and related users, so I found the relevant code on the Internet, simply modified the parameters such as cookies, and ran.
Since there is no mistake!! I’m shocked. I’ve never been so happy~
Learn that! Crawling a blog comment with python (no duplicate data)

After a while, we found that: things are not simple, the data is repetitive!! **
Learn that! Crawling a blog comment with python (no duplicate data)

Yes, the repetition rate is so high, it’s appalling. So I started my exploration~
Note: the overall code or learn from the previous boss, mainly to solve the problem of data duplication, if there is infringement, please contact!

1、 Overall thinking

I also have a clear idea. I drew a very simple flow chart:
Learn that! Crawling a blog comment with python (no duplicate data)

As for why the main comments and sub comments should be obtained separately, this is also the key to solve the problem of duplication. Through the test, we can know that if we directly modify the page parameters according to the rules or crawl through the. CN page, after reaching a certain number (about a few hundred), we can not get the data or duplicate the data.

2、 Get a blog address

The request URL for accessing the microblog user page is:

https://weibo.com/xxxxxxxxx?is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page=1

The page number can be controlled by modifying the page parameter. However, careful partners should find that in addition to the HTML loaded directly, the data of a page is obtained dynamically through Ajax twice

start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)

In other words, each page of data consists of three parts:

Learn that! Crawling a blog comment with python (no duplicate data)

1. Get Ajax address

Through the main interface, get the corresponding Ajax request address:

def get_ajax_url(user):
    url = 'https://weibo.com/%s?page=1&is_all=1'%user
    res = requests.get(url, headers=headers,cookies=cookies)
    html  = res.text
    page_id = re.findall("CONFIG\['page_id'\]='(.*?)'",html)[0]
    domain = re.findall("CONFIG\['domain'\]='(.*?)'",html)[0]
    start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
    start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
    return start_ajax_url1,start_ajax_url2

2. Resolve the address of a blog in the page

After sending the request, resolve the microblog address in the page (the same as the main page request or Ajax request)

def parse_home_url(url): 
    res = requests.get(url, headers=headers,cookies=cookies)
    response = res.content.decode().replace("\", "")
    every_ id =  re.compile ('name = (\ D +) ', re. S). Findall (response) # get the ID needed for the secondary page
    home_url = []
    for id in every_id:
        base_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={}&from=singleWeiBo'
        url = base_url.format(id)
        home_url.append(url)
    return home_url

3. Get the address of a blog of the specified user

By integrating the above two functions, we get the following results:

def get_home_url(user,page): 
    start_url = 'https://weibo.com/%s?page={}&is_all=1'%user
    start_ajax_url1,start_ajax_url2 = get_ajax_url(user)
    for i in range(page): 
        home_ url = parse_ home_ url(start_ url.format (I + 1)) # get the microblog of each page
        ajax_ url1 = parse_ home_ url(start_ ajax_ Url1. Format (I + 1)) # microblog of Ajax loaded page
        ajax_ url2 = parse_ home_ url(start_ ajax_ URL2. Format (I + 1)) # the microblog of the second page loaded by Ajax
        all_url = home_url + ajax_url1 + ajax_url2
        Print ('page% d parsing completed '% (I + 1))
    return all_url

The parameters are the user’s ID and the number of pages crawled. The returned result is the address of each microblog.

Welcome to come in and study with me(click here)

3、 Get main comments

By simply analyzing the request data, we can know that the interface to get microblog comments is as follows:

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4498052401861557&root_comment_max_id=185022621492535&root_comment_max_id_type=0&root_comment_ext_param=&page=1&from=singleWeiBo

A very dazzling page parameter comes into view, and it seems that the request is normal after the other parameters are removed. Maybe your first reaction is to write a loop to get it directly, emmmm, and then you will be trappedData duplicationThe origin of terror. It seems to be root_ comment_ max_ The ID parameter is also very important. You have to get it by thinking. Through further analysis, it can be found that in fact, in the data returned by the request, it has already been usedContains the address of the next requestYou just need to extract it and go on
Learn that! Crawling a blog comment with python (no duplicate data)

The code is as follows:

def parse_comment_info(data_json): 
    html = etree.HTML(data_json['data']['html'])
    name = html.xpath("//div[@class='list_li S_line1 clearfix']/div[@class='WB_face W_fl']/a/img/@alt")
    info = html.xpath("//div[@node-type='replywrap']/div[@class='WB_text']/text()")
    info = "".join(info).replace(" ", "").split("\n")
    info.pop(0)
    comment_ time =  html.xpath ("//div[@class='WB_ from S_ Txt2 '] / text() "); comment time
    name_url = html.xpath("//div[@class='WB_face W_fl']/a/@href")
    name_url = ["https:" + i for i in name_url]
    ids = html.xpath("//div[@node-type='root_comment']/@comment_id")    
    try:
        next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/div[%d]/@action-data'%(len(name)+1))[0]+'&__rnd='+str(int(time.time()*1000))
    except:
        try:
            next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/a/@action-data')[0]+'&__rnd='+str(int(time.time()*1000))
        except:
            next_url = ''
    comment_info_list = []
    for i in range(len(name)): 
        item = {}
        item["id"] = ids[i]
        Item ["name"] = name [i] # stores the online name of the reviewer
        item["comment_ Info "] = info [i] [1:] # store comment information
        item["comment_ time"] = comment_ Time [i] # store comment time
        item["comment_ url"] = name_ URL [i] # stores the related homepage of the reviewer
        try:
            action_data = html.xpath("/html/body/div/div/div[%d]//a[@action-type='click_more_child_comment_big']/@action-data"%(i+1))[0]
            child_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&' + action_data
            item["child_url"] = child_url
        except:
            item["child_url"] = ''    
        comment_info_list.append(item)
    return comment_info_list,next_url

The parameter is the JSON data of the request, and the parseddata, andNext addressThe data format is as follows:
Learn that! Crawling a blog comment with python (no duplicate data)

Among them, child_ The URL is the address of the corresponding sub comment to get the sub comment.

4、 Get sub comments

The idea of getting sub comments is the same as that of getting main comments. When we get all the main comments, we will traverse the results_ When the URL is not empty (that is, there are sub comments), request to get the sub comments.

1. Analysis of sub comments

def parse_comment_info_child(data_json): 
    html = etree.HTML(data_json['data']['html'])
    name = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/text()")
    info=html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/text()")
    info = "".join(info).replace(" ", "").split("\n")
    info.pop(0)
    comment_ time =  html.xpath ("//div[@class='WB_ from S_ Txt2 '] / text() "); comment time
    name_url = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/@href")
    name_url = ["https:" + i for i in name_url]
    ids = html.xpath("//div[@class='list_li S_line1 clearfix']/@comment_id")
    try:
        next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div[%d]/div/a/@action-data'%(len(name)+1))[0]+'&__rnd='+str(int(time.time()*1000))
    except:
        next_url = ''
    comment_info_list = []
    for i in range(len(name)): 
        item = {}
        item["id"] = ids[i]
        Item ["name"] = name [i] # stores the online name of the reviewer
        item["comment_ Info "] = info [i] [1:] # store comment information
        item["comment_ time"] = comment_ Time [i] # store comment time
        item["comment_ url"] = name_ URL [i] # stores the related homepage of the reviewer
        comment_info_list.append(item)
    return comment_info_list,next_url

2. Get sub comments

Integrate and call the previous function to get the corresponding sub comments:

def get_childcomment(url_child):
    Print ('Start getting sub comments... ')
    comment_info_list = []
    res = requests.get(url_child, headers=headers, cookies=cookies)
    data_json = res.json()
    count = data_json['data']['count']
    comment_info,next_url = parse_comment_info_child(data_json)
    comment_info_list.extend(comment_info)
    Print ('got% d pieces of_ info_ list))
    while len(comment_info_list) < count:
        if next_url == '':
            break
        res = requests.get(next_url,headers=headers,cookies=cookies)
        data_json = res.json()
        comment_info,next_url = parse_comment_info_child(data_json)
        comment_info_list.extend(comment_info)
        Print ('got% d pieces of_ info_ list))
    return comment_info_list

The parameter is child_ URL, return to the corresponding sub comment.

5、 Main function call

1. Import related libraries

import re
import time
import json
import urllib
import requests
from lxml import etree

2. Main function execution

if "__main__" == __name__: 
    #Set the corresponding parameters
    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
            'Accept': '*/*',
            'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
            'Content-Type': 'application/x-www-form-urlencoded',
            'X-Requested-With': 'XMLHttpRequest',
            'Connection': 'keep-alive',
    }
    Cookies = {} # microblog cookies (you need to get the request yourself)
    Userid = the microblog user ID that needs to be crawled
    Page = 1 # number of pages crawled
    #Start crawling
    all_urls = get_home_url(userid,page)
    for index in range(len(all_urls)):
        url = all_urls[index] 
        Print ('Start getting the% d microblog main comment... '% (index + 1))
        comment_info_list = []
        res = requests.get(url, headers=headers, cookies=cookies)
        data_json = res.json()
        count = data_json['data']['count']
        comment_info,next_url = parse_comment_info(data_json)
        comment_info_list.extend(comment_info)
        Print ('got% d pieces of_ info_ list))
        while True:
            if next_url == '':
                break
            res = requests.get(next_url,headers=headers,cookies=cookies)
            data_json = res.json()
            comment_info,next_url = parse_comment_info(data_json)
            comment_info_list.extend(comment_info)
            Print ('got% d pieces of_ info_ list))
        for i in range(len(comment_info_list)):
            child_url = comment_info_list[i]['child_url']
            if child_url != '':
                comment_info_list[i]['child'] = get_childcomment(child_url)
            else:
                comment_info_list[i]['child'] = []
        With open ('article% d comments on a blog. TXT '% (index + 1),'w') as F:
            f.write(json.dumps(comment_info_list))

3. Results

Ten data of a blog are obtained as follows:
Learn that! Crawling a blog comment with python (no duplicate data)

Write at the end

Of course, there are many shortcomings, such as the speed is not ideal, the structure is chaotic. Considered to join multithreading and other ways to speed up, but due to the need to log in, so there are risks Oh, we are careful~

Welcome to come in and communicate with us(click here)oh

Recommended Today

Don’t be a tool man. Touching hands teaches you Jenkins!

Hello everyone, I’m a piece of cake, a piece of cake eager to be Cai Bucai in the Internet industry. Soft or hard, praise is soft, white whoring is just!Ghost ~ remember to give me a third company after watching it! This article mainly introducesJenkins If necessary, please refer to If it helps, don’t forgetgive […]