Statement: the following contents are my personal understanding. If you find any errors or questions, you can contact me for discussion
The website crawled this time isRadish investment and research, is an intelligent investment and research platform for stock fundamental analysis established by using artificial intelligence, big data and mobile application technology. It can use the term research report and various data for analysis when making investment transactions.
In my spare time, I will learn about investment and financial management. Radish investment research can obtain a lot of financial information and a lot of research reports. This time, I want to write a crawler to complete the goal of persistent storage of target data and program reminder of relevant public opinion. Due to the rich content of the website, it is difficult to crawl it all at one time, This time, I want to get the investment and research information on the home page through scrapy and complete the goal of turning the page. In the later stage, I will continue to update and try to climb down the whole website（ (for personal research only)
ScrapyIt is a crawler framework written to crawl website data and extract structural data. It can crawl relevant data with little code.
Scrapy is an asynchronous network framework using twisted, which can greatly improve our download speed.
The relevant use tutorials of scripy can be found throughOfficial documentsTo get started and understand the role of each module in the framework. The official documents are very powerful. It is recommended to start using them after systematic learning.
The most important thing to learn about scrapy is to understand the workflow of scrapy and follow itExamples of official documentsTo analyze the operation of each step in detail, and the correlation and difference with the process of writing crawlers before.
Bag grabbing tool
Packet capture tool is a software that intercepts and views the contents of network packets. By analyzing the captured data packets, useful information can be obtained.
For more complex websites, it will be troublesome to use the debugging tool in the browser when crawling data analysis. At this time, you can use the packet capture tool to analyze the corresponding request, so as to find the URL of the data we need and the whole request process faster
Recommended learning for the use of packet capture toolsZhu Anbang’s blogIn the tutorial, he talked about three: Charles, fiddler and Wireshark. These packet capture tools have different functions, but the basic principles are the same. It is basically enough to find a smooth learning.
Through the packet capturing and analysis of the whole home page loading process, it is found that the URL of the home page data is
https://gw.datayes.com/rrp_mammon/web/feed/list, the URL of the next page is:
Page turning parameter analysis
By observing the URL, it is found that timestamp and feedids are two parameters that control page turning. Further, multiple pages are requested. It is found that 20210401170127 can be understood as a time node. It is found that 20200228 guesses the time of this refresh, and the last 6 bits are the time stamp of seconds of the current time, which can be written after organization
After multi page data acquisition, it is found that the first four of the feedids parameters are the IDs of the first four data in the first response, and the last number is the ID of the last data in the response, which will increase with the number of accesses. Each time, the ID of the last data is added. The URLs of the next page are spliced for access, and the data that cannot be requested to the next page is found. By copying the original timestamp, it is found that it can be accessed. The problem lies in the previous timestamp parameters. Just after splicing the feedids field, it is found that three fields are in the form of date, namely:
publishTimeFurther analysis shows that removing the last three zeros is a timestamp, and the conversion discovery is the result we need
The scratch shell can help us simulate the request address and enter an interactive terminal. In the interactive terminal, we can view all kinds of requested information and debug. At this time, the shell can be combined with the Scrabble format, but the postscript can not be combined with the Scrabble format.
PostmanIt is a chrome plug-in for web page debugging and sending web page HTTP requests. We can easily simulate various types of requests to debug the interface. It can be used to verify our ideas in reptiles.
The official use tutorial of postman is very detailed. You can learn from the official use tutorial. If you want to use the Chinese version, you canPostman SinicizationDownload from.
By sending a request through postman, we can get the data we want and the formatted data. It looks more organized. Combined with the script shell debugging, we can easily get the data we need
scrapy startproject datayes
In settings.py, you can set
ROBOTSTXT_OBEY = TrueFollow the rules of robots.txt.
scrapy genspider mammon gw.datayes.com
Default start_ URLs is not the link we want to climb, but the link we need
start_urls = ['https://gw.datayes.com/rrp_mammon/web/feed/list']
According to the results of previous analysis, the design scheme completes parse. This time, the difficulty mainly lies in how to splice next_ URL. Because the feedids parameter has an accumulation relationship, it is placed outside the parse function so that it can be accumulated when it can be accessed again.
The crawler observes that the cookie has changed for two days. Only after logging in, the cookie will be maintained, and the parameters in the cookie will be detected and found
cloud-sso-tokenSelect the necessary parameters and add them to settings.py.
First, configure pipeline and database related parameters in settings.py
# Configure item pipelines
Add open to the datayesipeline class we defined_ Spider and close_ Spider method to import database related parameters through spider.settings
Finally, create a database and start the crawler to crawl data.
This work adoptsCC agreement, reprint must indicate the author and the link to this article