Python Reptiles Use Ceery to Accelerate Reptiles

Time:2019-3-4

_celery is an asynchronous task queue based on distributed message transmission, which focuses on real-time processing and supports task scheduling. For more introduction and examples of celery, I can refer to the article Python celery’s introduction and use.
This article will introduce how to use celery to speed up the crawler.
_The example of this crawler comes from the article: N postures of Python crawler. There is no more introduction here. Our project structure is as follows:

Python Reptiles Use Ceery to Accelerate Reptiles

Among them, app_test.py is the main program, and its code is as follows:

from celery import Celery

app = Celery('proj', include=['proj.tasks'])
app.config_from_object('proj.celeryconfig')

if __name__ == '__main__':
    app.start()

Tasks.py is the task function, the code is as follows:

import re
import requests
from celery import group
from proj.app_test import app

@app.task(trail=True)
# Parallel invocation task
def get_content(urls):
    return group(C.s(url) for url in urls)()

@app.task(trail=True)
def C(url):
    return parser.delay(url)

@app.task(trail=True)
# Get the name and description of each page
def parser(url):
    req = requests.get(url)
    html = req.text
    try:
        name = re.findall(r'<span class="wikibase-title-label">(.+?)</span>', html)[0]
        desc = re.findall(r'<span class="wikibase-descriptionview-text">(.+?)</span>', html)[0]
        if name is not None and desc is not None:
            return name, desc
    except Exception as  err:
        return '', ''

Celeryconfig.py is the celery configuration file, the code is as follows:

BROKER_URL ='redis://localhost'# Use Redis as a message broker

CELERY_RESULT_BACKEND='redis://localhost:6379/0'# exits the task result in Redis

CELERY_TASK_SERIALIZER='msgpack' Task serialization and deserialization using msgpack scheme

CELERY_RESULT_SERIALIZER='json'# Reading task results generally do not require high performance requirements, so JSON with better readability is used.

CELERY_TASK_RESULT_EXPIRES = 60 * 60 * 24  Task expiration time

CELERY_ACCEPT_CONTENT= ['json','msgpack'] # Specifies the type of content accepted

Finally, our crawler file, scrapy.py, is coded as follows:

import time
import requests
from bs4 import BeautifulSoup
from proj.tasks import get_content

t1 = time.time()

url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0"
# Request header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, \
            like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
# Send HTTP requests
req = requests.get(url, headers=headers)
# Analysis of Web Pages
soup = BeautifulSoup(req.text, "lxml")
# Find the record where name and description are located
human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []
# Access to Web Sites
for human in human_list:
    url = human.find('a')['href']
    urls.append('https://www.wikidata.org'+url)

#print(urls)

# Call the get_content function and get the crawler results
result = get_content.delay(urls)

res = [v for v in result.collect()]

for r in res:
    if isinstance(r[1], list) and isinstance(r[1][0], str):
        print(r[1])


T2 = time. time () # end time
Print ('time-consuming:% s'% (t2 - t1))

Start redis in the background and switch to the directory where the proj project is located. Run the command:

celery -A proj.app_test worker -l info

The output is as follows (showing only the output of the last few lines):

......
['Antoine de Saint-Exupery', 'French writer and aviator']
['', '']
['Sir John Barrow, 1st Baronet', 'English statesman']
['Amy Johnson', 'pioneering English aviator']
['Mike Oldfield', 'English musician, multi-instrumentalist']
['Willoughby Newton', 'politician from Virginia, USA']
['Mack Wilberg', 'American conductor']
Time-consuming: 80.05160284042358

View the data in RDM as follows:

Python Reptiles Use Ceery to Accelerate Reptiles

In the article Python crawler’s N postures, we already know that it takes about 725 seconds to implement this crawler in a general way, while celery takes about 80 seconds in total, which is about one ninth of the general method. Although there is no scrapy crawler framework and asynchronous framework aiohttp, asyncio comes fast, it can also be used as a crawler idea.
This is the end of this sharing. Thank you for reading.~
Note: I have opened the Wechat Public Number: Python Crawler and Algorithms (micro-signal: easy_web_scrape). Welcome to pay attention to it.~~