Learning python, we must all start from reptiles. After all, similar resources on the Internet are very rich, and there are many open source projects.
Python learning web crawler is mainly divided into three large sections: capture, analysis and storage
When we enter a URL in the browser and return, what happens in the background?
In short, this process involves the following four steps:
- Find the IP address corresponding to the domain name.
- Send the request to the server corresponding to IP.
- The server responds to the request and sends back the web content.
- The browser parses the web content.
What a web crawler needs to do, in short, is to implement the functions of a browser. By specifying the URL, the data needed by the user can be returned directly, without the need to manipulate the browser step by step.
To grasp this step, you need to be clear about what you want to get? Is it HTML source code or string in JSON format. It’s good to parse the content one by one. In terms of how to parse and how to process the data, a very detailed and powerful list of open source libraries is provided later in the article.
Of course, the data of crawling to other people’s homes is likely to encounter anti crawler mechanism. What should I do? Use a proxy.
Application: limit the IP address and solve the problem of login by entering the verification code due to “frequent click”.
In this case, the best way is to maintain a proxy IP pool. There are a lot of free proxy IP on the Internet. The good and the bad are different, and you can find what you can use through screening.
For the “frequent click” situation, we can also limit the frequency of crawler access to the website to avoid being banned by the website.
Some websites will check whether you are actually browsing or automatically visiting. In this case, the user agent indicates that you are a browser. Sometimes you will check whether you have the referer information and whether your referer is legal. In general, you will add the referer. That is, disguised as a browser, or anti “anti stealing chain”.
There are three methods for the website with verification code:
- Use agent to update IP.
- Log in using cookies.
- Verification code identification.
Next, we will focus on the verification code identification. This Python q-u-n 227 — 435 — 450 is a small edition. I hope you can exchange and discuss with each other. All kinds of introductory materials, advanced materials and framework materials can be obtained free of charge
The open-source Tesseract OCR system can be used to download and recognize the verification code pictures, and the recognized characters can be transferred to the crawler system for simulated Login. Of course, the verification code image can also be uploaded to the coding platform for identification. If it is not successful, the verification code identification can be updated again until it is successful.
Well, the crawler just talks here. Interested friends can search for more details on the Internet.
At the end of the paper, the main point of this paper is attached: the practical Python library.
Urlib – Network Library (stdlib).
Requests – network library.
Grab – Network Library (based on pycurl).
Pycurl – Network Library (bound to libcurl).
Urlib3 – Python HTTP library, secure connection pool, support file post, high availability.
Httplib2 – network library.
Robobrowser – a simple Python library with Python style, you can browse the web without a separate browser.
Mechanicalsoup – a python library that automatically interacts with websites.
Mechanize – stateful, programmable web browsing library.
Socket – the underlying network interface (stdlib).
Web crawler framework
Grab – web crawler framework (based on pycurl / multicur).
Scratch – web crawler framework.
Pyspider – a powerful crawler system.
Cola – a distributed crawler framework.
HTML / XML parser
Lxml – C language to write efficient HTML / XML processing library. XPath is supported.
Cssselect – parses DOM trees and CSS selectors.
Pyquery – parses DOM trees and jQuery selectors.
Beautifulsop – inefficient HTML / XML processing library, pure Python implementation.
Html5lib – DOM to generate HTML / XML documents according to whatwg specification. The specification is used in all browsers today.
Feedparser – parses RSS / atom feeds.
Markupsafe – a string that provides a safe escape for XML / HTML / XHTML.
A library for parsing and manipulating simple text.
Difflib – (Python Standard Library) helps with differential comparisons.
Levenshtein – quickly calculate Levenshtein Distance and string similarity.
Fuzzywuzzy – fuzzy string matching.
Esmre – regular expression accelerator.
Ftfy – automatically defragment Unicode text to reduce fragmentation.
natural language processing
A library dealing with human language problems.
Nltk – the best platform for writing Python programs to handle human language data.
Pattern – the network mining module of Python. He has natural language processing tools, machine learning and others.
Textblob – provides a consistent API for deep natural language processing tasks. It was developed on the shoulders of nltk and pattern giants.
Jieba – Chinese word segmentation tool.
Snowlp – Chinese text processing library.
Loso – another Chinese Thesaurus.
Browser automation and simulation
Selenium – automates real browsers (chrome, Firefox, opera, ie).
Ghost.py – encapsulation of pyqt’s WebKit (pyqt is required).
Spyner – encapsulation of pyqt’s WebKit (pyqt is required).
Splitter – general API browser simulator (selenium web driver, Django client, Zope).
Threading – the threading of the python standard library. It is effective for I / O intensive tasks. It’s useless for CPU bound tasks, because Python Gil.
Multiprocessing – the standard Python library runs multiprocesses.
Cellery – asynchronous task queue / job queue based on distributed messaging.
Concurrent futures – the concurrent futures module provides a high-level interface for invoking asynchronous execution.
Asynchronous network programming library
Asyncio – (Python Standard Library above Python 3.4 + version) asynchronous I / O, time cycling, collaborative programs and tasks.
Twisted – an event driven network engine framework.
Tornado – a network framework and asynchronous network library.
Pulsar – Python event driven concurrency framework.
Diesel – Python’s green event based I / O framework.
Gevent – a protocol based Python Network Library using Greenlet.
Eventlet – an asynchronous framework supported by WSGI.
Tomorrow – a wonderful decorating syntax for asynchronous code.
Cellery – asynchronous task queue / job queue based on distributed messaging.
Huey – small multithreaded task queue.
MRQ – Mr. queue – Python distributed work task queue using redis & gevent.
RQ – redis based lightweight task queue manager.
Simpleq – a simple, infinitely scalable, Amazon SQS based queue.
Python gearman – gearman’s Python API.
Picloud – Python code is executed in the cloud.
Domino up.com – execute R, python, and matlab code in the cloud
Web content extraction
A library for extracting web page content.
Text and metadata for HTML pages
Newspaper – news extraction, article extraction and content curation in Python.
Html2text – converts HTML to markdown formatted text.
Python goose – HTML content / article extractor.
Lassie: a humanized web content retrieval tool
Library for websocket.
Crossbar – open source application messaging router (websocket and Wamp for autobahn implemented in Python).
Autobahnpython – provides Python implementation of websocket protocol and Wamp protocol and is open source.
Websocket for Python – websocket client and server libraries for Python 2 and 3 and pypy.
Dnsyo – check your DNS on more than 1500 DNS servers around the world.
Pycare – interface to c-ares. C-ares is a C language library for DNS requests and asynchronous name resolutions.
Opencv – open source computer vision library.
Simplecv – an introduction to camera, image processing, feature extraction, format conversion, and a readable interface (based on OpenCV).
Mahotas – fast computer image processing algorithm (fully implemented in C + +), based on numpy array as its data type.
Shadowlocks – a fast tunneling agent that helps you penetrate firewalls (supports TCP and UDP, TFO, multi-user and smooth restart, destination IP blacklist).
Tproxy – tproxy is a simple TCP routing agent (layer 7), which is based on gevent and configured in Python.
Python has a lot of web development frameworks. The large and comprehensive development framework is not Django, but the most widely used. There are many companies that use Django framework, such as a fox, a message, etc. Web.py and flask, which are famous for their simplicity, are very easy to use. Tornado, which is famous for its asynchronous and high-performance, has beautifully written source code. Both Zhihu and quora are in use.
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.