Research on the actual combat of Python crawler

Time:2021-7-26

Statement: the following contents are my personal understanding. If you find any errors or questions, you can contact me for discussion

Introduction to reptiles

Website introduction

The website crawled this time isRadish investment and research, is an intelligent investment and research platform for stock fundamental analysis established by using artificial intelligence, big data and mobile application technology. It can use the term research report and various data for analysis when making investment transactions.

Reasons and uses for writing Crawlers

In my spare time, I will learn about investment and financial management. Radish investment research can obtain a lot of financial information and a lot of research reports. This time, I want to write a crawler to complete the goal of persistent storage of target data and program reminder of relevant public opinion. Due to the rich content of the website, it is difficult to crawl it all at one time, This time, I want to get the investment and research information on the home page through scrapy and complete the goal of turning the page. In the later stage, I will continue to update and try to climb down the whole website( (for personal research only)

Scrapy

brief introduction

ScrapyIt is a crawler framework written to crawl website data and extract structural data. It can crawl relevant data with little code.

Scrapy is an asynchronous network framework using twisted, which can greatly improve our download speed.

Using tutorials

The relevant use tutorials of scripy can be found throughOfficial documentsTo get started and understand the role of each module in the framework. The official documents are very powerful. It is recommended to start using them after systematic learning.

The most important thing to learn about scrapy is to understand the workflow of scrapy and follow itExamples of official documentsTo analyze the operation of each step in detail, and the correlation and difference with the process of writing crawlers before.

Bag grabbing tool

What is a grab tool

Packet capture tool is a software that intercepts and views the contents of network packets. By analyzing the captured data packets, useful information can be obtained.

Why

For more complex websites, it will be troublesome to use the debugging tool in the browser when crawling data analysis. At this time, you can use the packet capture tool to analyze the corresponding request, so as to find the URL of the data we need and the whole request process faster

Recommendations for using and capturing tools

Recommended learning for the use of packet capture toolsZhu Anbang’s blogIn the tutorial, he talked about three: Charles, fiddler and Wireshark. These packet capture tools have different functions, but the basic principles are the same. It is basically enough to find a smooth learning.

Business logic analysis

Find URL to load data

Through the packet capturing and analysis of the whole home page loading process, it is found that the URL of the home page data ishttps://gw.datayes.com/rrp_mammon/web/feed/list, the URL of the next page is:https://gw.datayes.com/rrp_mammon/web/feed/list?timeStamp=20210401170127&feedIds=66233,66148

Page turning parameter analysis

By observing the URL, it is found that timestamp and feedids are two parameters that control page turning. Further, multiple pages are requested. It is found that 20210401170127 can be understood as a time node. It is found that 20200228 guesses the time of this refresh, and the last 6 bits are the time stamp of seconds of the current time, which can be written after organization''.join(str(datetime.now())[:10].split('-'))+str(time.clock( )).split('.')[1]

After multi page data acquisition, it is found that the first four of the feedids parameters are the IDs of the first four data in the first response, and the last number is the ID of the last data in the response, which will increase with the number of accesses. Each time, the ID of the last data is added. The URLs of the next page are spliced for access, and the data that cannot be requested to the next page is found. By copying the original timestamp, it is found that it can be accessed. The problem lies in the previous timestamp parameters. Just after splicing the feedids field, it is found that three fields are in the form of date, namely:"insertTime"updateTimepublishTimeFurther analysis shows that removing the last three zeros is a timestamp, and the conversion discovery is the result we need

Simulation request test

scrapy shell

The scratch shell can help us simulate the request address and enter an interactive terminal. In the interactive terminal, we can view all kinds of requested information and debug. At this time, the shell can be combined with the Scrabble format, but the postscript can not be combined with the Scrabble format.

Research on the actual combat of Python crawler

Postman

brief introduction

PostmanIt is a chrome plug-in for web page debugging and sending web page HTTP requests. We can easily simulate various types of requests to debug the interface. It can be used to verify our ideas in reptiles.

Usage and Sinicization

The official use tutorial of postman is very detailed. You can learn from the official use tutorial. If you want to use the Chinese version, you canPostman SinicizationDownload from.

Actual use

By sending a request through postman, we can get the data we want and the formatted data. It looks more organized. Combined with the script shell debugging, we can easily get the data we need

Research on the actual combat of Python crawler

Write crawler

Create crawler project

scrapy startproject datayes

Robots protocol

In settings.py, you can setROBOTSTXT_OBEY = TrueFollow the rules of robots.txt.

Create crawler

scrapy genspider mammon gw.datayes.com

Modify start_ urls

Default start_ URLs is not the link we want to climb, but the link we need

start_urls = ['https://gw.datayes.com/rrp_mammon/web/feed/list']

Complete parse method

According to the results of previous analysis, the design scheme completes parse. This time, the difficulty mainly lies in how to splice next_ URL. Because the feedids parameter has an accumulation relationship, it is placed outside the parse function so that it can be accumulated when it can be accessed again.

class MammonSpider(scrapy.Spider):

The crawler observes that the cookie has changed for two days. Only after logging in, the cookie will be maintained, and the parameters in the cookie will be detected and foundcloud-sso-tokenSelect the necessary parameters and add them to settings.py.

Complete data storage

First, configure pipeline and database related parameters in settings.py

# Configure item pipelines

Add open to the datayesipeline class we defined_ Spider and close_ Spider method to import database related parameters through spider.settings

import pymysql

Finally, create a database and start the crawler to crawl data.

This work adoptsCC agreement, reprint must indicate the author and the link to this article

Recommended Today

Implementation example of go operation etcd

etcdIt is an open-source, distributed key value pair data storage system, which provides shared configuration, service registration and discovery. This paper mainly introduces the installation and use of etcd. Etcdetcd introduction etcdIt is an open source and highly available distributed key value storage system developed with go language, which can be used to configure sharing […]