Py crawl to a website live broadcast Collection – work needs, live broadcast

Time:2022-4-26

Before viewing this article, you need to understand the following techniques

  1. Five steps of a reptile

    a) Requirements analysis, programmer, artificial intelligence

    b) Find content related URL programmers

    c) Get the return information program (urllib, requests) of the web address according to the web address

    d) Locate the required information location program (re regular expression, XPath, CSS selector)

    e) Stored content program (file system open, pymysql, pymongo)
  1. What needs to be done today

    a) HTTP/HTTPS

    b) How can I observe HTTP packets

    c) Package using requests, get, post
  1. Description of important information in Baotou

    a) Cookie: it can store some server-side information and complete the identification work together with session

    b) User agent: what are your tags

    c) Referer: which page did you jump from

  2. If the browser can access but you can’t, add headers, add user agent first, and then add

    Referer, add cookies last, add all last

  3. When crawling to a website, you need to make sure whether the information is on the website

  4. Note that in the process of capturing packets, it is best to save_ Log check up

    Right click – > check – > Network – > preserve log

    The preserve log in the Google developer tool: keep the request log. When you jump to the page, check it to see the request before the jump. It can also be applied to the problem of capturing packets in the chrome developer tool

  5. If we need to log in to access the content, we can log in first and then access it

    Here we need to use a class, session

    Just change all requests to instances of session

  6. If the IP is sealed, it can be usedhttps://www.xicidaili.com/api

    Western thorn agent, 15W IPS for you every day

demand

Py crawl to a website live broadcast Collection - work needs, live broadcast

Match the specified Li according to the conditions, and click to enter the collection.

Take the collection list away.

Py crawl to a website live broadcast Collection - work needs, live broadcast

1. Target website

Live broadcast:https://www.zhibo8.cc

Event: the following one is the completion plate.

Match the home team and the visiting team according to the date, enter the game and take away the highlights.

2. Analyze web pages

Let’s take a look at how the finish section of the home page is implemented, AJAX or JQ hidden display

To see if there are requests:

No request, it must be JQ hidden display control. That is, once the home page is opened, these HTML elements and data are loaded.

That clear goal: the first access data we climb must be the home page.

Second: find the HTML elements and feature points we want to finish the game.

Py crawl to a website live broadcast Collection - work needs, live broadcast

Py crawl to a website live broadcast Collection - work needs, live broadcast

It is found that clicking tab has no characteristics.

Find the finished div

Py crawl to a website live broadcast Collection - work needs, live broadcast

Third, find rules

Py crawl to a website live broadcast Collection - work needs, live broadcast

Find the data we want to match.

Date; Home team, visiting team.

He was found to exist, as shown by the arrow.

  1. The date is div class content – > div class titlebar!
  2. The names of the home team and the visiting team are in div class content – > Li. And have an attribute, left team.
  3. Visiting team name: div class content – > Li – > img text.

So we found the rules. Let’s start writing scripts.

3. Write crawler script

Now start writing the script for the first part.

It is expected to be divided into two steps.

1. Find the URL of the specified game on the homepage and get the details of the next script

Details page crawl collection

Stored in database

Screenshot of live broadcast results

Two get methods have been written. At present, the data has been obtained and the rule matching is being written

Py crawl to a website live broadcast Collection - work needs, live broadcast

This work adoptsCC agreement, reprint must indicate the author and the link to this article

Thank you for your attention