AutoScraper ! Make your reptile “smart”!

Time:2021-7-25

[introduction]: autosharper is an intelligent, automatic, fast and lightweight web crawler. It is simple and convenient to use, so you can say goodbye to the trouble of manually parsing web pages and writing rules.

brief introduction

Autosharper is a web crawler implemented in Python. It is compatible with Python 3 and can quickly and intelligently obtain the data on the specified website. These data can be web page text, URL address or other HTML elements. In addition, it can learn to grab rules and return similar elements.

Download and install

The source code address of the project is:

https://github.com/alirezamik… 

Compatible with Python 3. The following methods can be used for installation:

(1) Get installation from Git

$ pip install git+https://github.com/alirezamika/autoscraper.git

(2) Get installation from pypi

$ pip install autoscraper

(3) Download the source code and install it

$ python setup.py install

Simple use

Suppose we want to get all the relevant article titles in the stack overflow page:

from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
wanted_list = ["How to call an external command?"]
scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

The output results are as follows:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

Grab similar results

When you also want to get all the relevant article titles in other pages on stack overflow, you can directly use get\_ result\_ The similar method gets:

scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

The capture results of the two pages are:

AutoScraper ! Make your reptile

Grab exact results

When you just want to grab an exact result, you can use get\_ result\_ The exact method, that is, from wanted\_ Retrieve data in exactly the same order in the list:

scraper.get_result_exact('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

For example, grab the title of the second related article in the page, and the execution result:

AutoScraper ! Make your reptile

Custom request module parameters

You can also pass any custom request module parameter. For example, you might want to use a proxy or custom header:

proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}
result = scraper.build(url, wanted_list, request_args=dict(proxies=proxies))

Capture multiple information
Suppose we want to grab the question links about the text, the number of stars and GitHub repurchase page:

from autoscraper import AutoScraper
url = 'https://github.com/alirezamika/autoscraper'
wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.5k', 'https://github.com/alirezamika/autoscraper/issues']
scraper = AutoScraper()
scraper.build(url, wanted_list)

The execution results are:

AutoScraper ! Make your reptile

Save model

We can save the captured model for later use:

#Specify the path to the saved file
scraper.save('stackoverflow')
#Call method:
scraper.load('stackoverflow')

That’s all for a brief introduction to autosharper. If you want to use more functions, see the official home page for details.

Open source outpostShare popular, interesting and practical open source projects on a daily basis. Participate in maintaining the open source technology resource library of 100000 + star, including python, Java, C / C + +, go, JS, CSS, node.js, PHP,. Net, etc.