Simple analysis of 15W boiling point of nuggets (I)

Time:2021-7-26

Data collection for data analysis (web crawler only). This paper continues to start with reptiles. But this time it’sPython

1、 Another way of reptile

Crawlers usually: ① get the URL of the target web page; ② Initiate HTTP request to obtain web page data; ③ Analyze the web page in various ways to get the desired data;

Usually, in step ②, the JS code in the page will not be executed. Some websites use ajax to load some data asynchronously, and then render it to the page; Or use js to make some changes to the page dom. This will result in missing or even no target data in the page requested in step ②. This requires the execution of the JS code in the page after obtaining the web page data.

It was first usedphantomjs+selenium。 There’s something wrong with chrome in the backheadlessMode, basically always use chrome. The processing logic is as follows: ① request to obtain the web page and execute JS code; ② Then save the processed page data; ③ Subsequent processing (parsing web pages to obtain data).

1.1 selenium usage example

We’ll takeNuggets essayFor example, get all the comments under the article.

Note: Although it can be obtained directly through the interface, we assume that the data cannot be obtained directly, and the target data can be obtained only after JS is executed.

To use selenium + chrome, you first need to download the corresponding version of chromeChromeDriver

The example code is as follows:

self.driver.get(self.article_url)
#Wait for the comment list to appear
WebDriverWait(self.driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'comment-list'))
)
self.save_page()

When using selenium to manipulate chrome to load web pages, we usually encounter this problem: the network delay causes the target data not to be downloaded in time, but the web page has been saved at this time. The simplest way is to call thetime.sleep(5)Similar way, but this way is simple but rough. A better way is to use the software provided by seleniumWebDriverWaitTo handle it.

Official documents, do not miss:selenium-python

1.2 subsequent processing of pages

After saving the rendered web page, the next step is to parse and extract the data. This time we useXPathTo parse the data.

Or analyze the web page first

Simple analysis of 15W boiling point of nuggets (I)

The location of the data is://div[@class="comment-list-box"]/div[contains(@class, "comment-list")]/div[@class="item"]

For processing convenience, we only get the first level of comment users and content.

The example code is as follows:

root = etree.HTML(page_source)
comments = root.xpath('//div[@class="comment-list-box"]/div[contains(@class, "comment-list")]/div[@class="item"]')
for comment in comments:
    username = self.fix_content(comment.xpath('.//div[@class="meta-box"]//span[@class="name"]/text()'))
    content = self.fix_content(comment.xpath('.//div[@class="content"]//text()'))
    print(f'{username} --> {content}')

Result data:

Simple analysis of 15W boiling point of nuggets (I)

2、 Acquisition of nuggets boiling point

Let’s look directly at the API interface of boiling point.

Simple analysis of 15W boiling point of nuggets (I)

Interface analysis:

Boiling point interface address:https://apinew.juejin.im/reco…
Request method: Post
Request parameter type and format: JSON format data,

{
    Cursor: "0", // cursor. It is "0" at the first request, and this field will be included in the request response. Subsequent requests can be used directly
    id_ Type: 4, // boiling point category( (irrelevant)
    Limit: 20, // page size 
    sort_ Type: 200 // some sort type( (irrelevant)
}

Then we can use Python to simulate the request and obtain the boiling point data.

We userequestsTo simulate the request, please refer to the official document for specific use.

Code example:

HOT_URL = 'https://apinew.juejin.im/recommend_api/v1/short_msg/hot'
json_form = {
    'cursor': '0',
    'id_type': 4,
    'limit': 20,
    'sort_type': 200,
}
resp = requests.post(HOT_URL, json=json_form)
print(resp.json())

#Data can be returned normally
# {'err_ no': 0, 'err_ msg': 'success', 'data': [{'msg_ id': '6864704084000112654', 'msg_ Info': {'id': 980761, 'msg_ id': '6864704084000112654', 'user_ id': '2207475080373639', 'topic_ ID ':'6824710203301167112','content ':'don't find your girlfriend, your girlfriend is here',

2.1 save all data

According to the above example, we optimize the following code:

def save_pins(idx=0, cursor='0'):
    json_data = {
        'id_type': 4,
        'sort_type': 200,
        'cursor': cursor,
        'limit': 200 # this value is unlimited, but it is too large, the server reports an error, or the user information is missing
    }
    resp = sess.post(url, json=json_data)
    if resp.ok:
        resp_json = resp.json()
        with open(f'json_data/pins-{idx:04}.json', 'w+') as json_file:
            json_file.write(resp.content.decode('UTF-8'))
        #Is there more
        if resp_json['err_no'] == 0 and resp_json['err_msg'] == 'success':
            logging.debug(f'no error, idx={idx}')
            if resp_json['has_more']:
                logging.debug(f'has more, next idx={idx+1}')
                time.sleep(5)
                save_pins(idx+1, cursor=resp_json['cursor'])
        else:
            #Something's wrong
            logging.warning(resp_json['err_msg'])
            logging.debug(f'sleep 10s, retry idx={idx}')
            time.sleep(10)
            save_pins(idx, cursor)

Some data are as follows:

Simple analysis of 15W boiling point of nuggets (I)

The source code has been uploaded toGitHub, Gitee

3、 Postscript

So far, the whole hotspot data acquisition has ended. Because it is the data returned by the direct request interface, there is almost no need to do any data processing except for some data duplication.

3.1 subsequent treatment

Plan to simply process data, store it in MySQL database, and then usesupersetMake charts.

3.2 suggestions for nuggets hotspots

  • PaginglimitThe maximum value is limited to 20?
  • Limit the amount of visible data when you are not logged in.