Day 1-12 finish Python web crawler?


The beautiful little mm of human resources department came to ask me: Lao Chen, is the relationship between data analysis and crawler? To be honest, I really don’t want to pay attention to her, because I always think that this has little to do with her work. But when I think that she is responsible for the recruitment work of my department, I have to say to her: data analysis, eating, crawling, crawling together are eating inside and outside.

In the era of big data, if you want to conduct data analysis, you must first have data sources. It is not enough to analyze the drizzle (data) of the company alone. Only by learning the crawler and crawling some relevant and useful data from the external (website), can the boss have a basis for making business decisions, and you are also the boss.

A mention of the boss, beautiful MM, very excited, immediately asked aloud: your IT sector, the most handsome is not that do search engine boss Li?

Although I am a little unconvinced and a little unhappy, how can I get it? After all, in terms of web crawler, his (boss Li) technology is better than that. He knows how to use the crawler technology to crawl in the massive Internet information every day, crawl the high-quality information and include it in his database. When users input keywords in the search engine, the engine system will analyze and process the keywords, find out the relevant pages from the included pages, sort them according to certain ranking rules, and display the results to the users.

When I think of the money I earned from the ranking, I will not give a single cent to Lee. I will tell my human MM: “well, I won’t pull the calf with you. I’ll talk to my old fellow about the principle of the web crawler. You can’t see your boss.”

1. What is a reptile

Web crawler is also called web spider, web ant, network machine, etc. it crawls data on the network according to the rules we make. There will be HTML code, JSON data, pictures, audio or video in the results. According to the actual requirements, the programmer filters the data, extracts the useful ones and stores them.

White point is to use python programming language to simulate the browser, visit the specified website, return the results, filter according to the rules and extract the data you need, and store it for use.

You’ve seen me《Day 10 | 12 finish python, file operation》And《Day 11 | 12 finish Python and database operationThe old fellow should know that data often exists in files or databases.

Day 1-12 finish Python web crawler?

2. Crawling process

Users can access the network data through the browser: open the browser, input the web address, submit the request by the browser, download the web code, and parse it into a page.

Crawler programming, specify the web address, simulate the browser to send a request (get web page code) – > extract useful data > store in file or database.

Day 1-12 finish Python web crawler?

Python is recommended for crawler programming, because the python crawler library is simple and easy to use. In Python built-in environment, it can meet most functions. It can:

(1) Send a request (including request header and request body) to the target site with HTTP library;

(2) The response returned by the server is parsed with built-in libraries (HTML, JSON, regular expression)

(3) Store the required data in a file or database.

If the python built-in library is not enough, you can use the PIP install library name to quickly download the third-party library and use it.

3. Positioning of climbing point

In the process of writing crawler code, it is often necessary to specify the node or path to crawl. If I told you that Chrome browser can quickly get the node or path, would you immediately check whether the computer is installed?

If you will, that’s right. No, go and install it.

In the page, press the keyboard F2 key to display the source code. Select the node you want to obtain and right-click check to locate the code. Right click the code and select Copy – copy selector or copy XPath to copy the contents of the node or path.

Day 1-12 finish Python web crawler?

Well, the contents of the crawler principle, Lao Chen finished, if feel that old fellow can help you, hope that the old iron can forward praise, let more people see this article. Your forwarding and praising are the greatest encouragement for Lao Chen to continue to create and share.

An old guy who has been a technical director for 10 years, sharing years of programming experience. If you want to learn programming, you can pay attention to today’s headline: old Chen said programming. I’ll share python, front end (applet) and app dry goods. Pay attention to me, that’s right.

Recommended Today

Redis design and implementation 4: Dictionary Dict

In redis, the dictionary is the infrastructure. Redis database data, expiration time and hash type all take the dictionary as the underlying structure. Structure of dictionary Hashtable The implementation code of hash table is as follows:dict.h/dictht The dictionary of redis is implemented in the form of hash table. typedef struct dictht { //Hash table array, […]