Python crawler tutorial (3)

Time:2022-11-24

Python crawler tutorial (3)

1. Handle cookies and log in to 17K Novels.com

Some websites need to log in to obtain the required data. For this example, if we want to obtain the bookshelf data of the novel website, we need to log in to obtain the information belonging to this account.
Python crawler tutorial (3)
1. Session (Session) tracking is a commonly used technology in Web programs to track the user’s entire session. Commonly used session tracking technologies are Cookie and Session. Cookies determine user identity by recording information on the client side, and Session determines user identity by recording information on the server side.
2. Use browser development tools to find the URL required for login.
Python crawler tutorial (3)
3. Use the session to get the desired content.
    Python crawler tutorial (3)
Python crawler tutorial (3)
4. Get cookies.
      Python crawler tutorial (3)
Python crawler tutorial (3)
5. Get the page data and use the session to keep the login status.
Python crawler tutorial (3)
Python crawler tutorial (3)
6. The second acquisition method.
Python crawler tutorial (3)
Python crawler tutorial (3)
We can see that using requests directly cannot enter the login state, but we can also solve this problem through cookies. Get cookies in browser dev tools.
Python crawler tutorial (3)
Python crawler tutorial (3)
Python crawler tutorial (3)

2. Requests handle anti-leeching and obtain pear video.

Select any video in Pear Video, and use the browser development tool to see its video link, but we cannot find the video download link in the page source code
Python crawler tutorial (3)
Refresh the page, and you can get the URL of Request URL and srcUrl in the developer tools.
Python crawler tutorial (3)
Python crawler tutorial (3)
Python crawler tutorial (3)
Enter it into the browser, and an error will appear. We compare the correct video URL with this URL, and we can see that it is the same before and after, only the middle part is different.
Python crawler tutorial (3)
Find sources in different locations and compare the two interfaces.
Python crawler tutorial (3)
The following is the crawling of the video.
1. Get contID
      Python crawler tutorial (3)
2. Get videoStatus and return json
Python crawler tutorial (3)
Python crawler tutorial (3)
There is an improvement that the article has been offline, but the content appears in the browser. We will deal with the problem of anti-leeching. The anti-leeching is mainly to trace the source and find the upper-level link of this request.
Python crawler tutorial (3)
Python crawler tutorial (3)
Python crawler tutorial (3)
Filter the content and get json.
       Python crawler tutorial (3)
3. Modify and replace the content of the obtained URL.
     Python crawler tutorial (3)
Python crawler tutorial (3)
4. Download the video.
      Python crawler tutorial (3)

3. Comprehensive training, crawling Netease cloud music review information

1. Use the developer tools to find the desired content location.
Python crawler tutorial (3)
2. It finds that its data is in an encrypted form, and finds the encrypted location according to its request to run the process.
    Python crawler tutorial (3)
    Python crawler tutorial (3)
3. Find unencrypted parameters
Python crawler tutorial (3)
Python crawler tutorial (3)
4. Refer to the logic of Netease to find its encryption logic, and it can be found that two parameters, params and encSecKey, are required.
Python crawler tutorial (3)
Python crawler tutorial (3)
5. Corresponding to the above logic, find the corresponding content of d, e, f, g respectively.
Python crawler tutorial (3)
Python crawler tutorial (3)
Python crawler tutorial (3)
6. Crawl the comments below.
Python crawler tutorial (3)
Python crawler tutorial (3)
Python crawler tutorial (3)
        Python crawler tutorial (3)
7. Run to get the result, and get the comment information of NetEase Cloud Music songs.
Python crawler tutorial (3)