WOW! This article introduces in detail the treasure house more powerful than requests

Time:2022-1-7

Hello, I’m Jiannan!

In order to do a tutorial, I crashed a novel website, which really gave me a jump. Every time, I reported an error code of 503, which means that the server is inaccessible because I wrote a crawler program with the collaboration.

be careful:This article only provides learning and use, do not destroy the network, otherwise bear the consequences!!

Because the server can’t accept so much pressure, the resources can’t be accessed temporarily, so when I stop the crawler, the novel website gradually returns to normal.

If you read my blog carefully, you will find that I have only summarized one article on multithreading, queue and multi process respectively, but today is my fifth time to write an article on collaborative process. To be honest, there are too many pits involved in collaborative process and it is not easy. You need to summarize the problems you encounter and the code before optimization again and again.

I don’t use much knowledge about multithreading, multiprocessing, queues, etc., so I only have one summary. I hope readers will forgive me.

Synergetic process

The essence of CO process is single thread. It just uses the delay time in the program to constantly switch the executed code blocks. The cooperative process switching task has high efficiency and uses the thread delay waiting time. Therefore, the cooperative process is given priority in actual processing.

First knowledge of asynchronous HTTP framework httpx

If you don’t know about the collaborative process, you can consider turning out the article I wrote before and making a simple understanding. I’m sure everyone is familiar with the requests library, but the HTTP requests implemented in requests are synchronous requests,But in fact, the I / O blocking feature based on HTTP requests is very suitable for asynchronous HTTP requests

Httpx inherits all the features of requests and supports an open source library for asynchronous HTTP requests.

Installing httpx

pip install httpx

practice

Next, I will compare the time-consuming of batch HTTP requests using httpx synchronous and asynchronous methods. Let’s see the results together.

import httpx
import threading
import time


def send_requests(url, sign):
    status_code = httpx.get(url).status_code
    print(f'send_requests:{threading.current_thread()}:{sign}: {status_code}')


start = time.time()
url = 'http://www.httpbin.org/get'
[send_requests(url, sign=i) for i in range(200)]
end = time.time()
Print ('run time: ', int (end - start))

The code is relatively simple. You can see that send_ Requests realizes synchronous access to the target address 200 times.

Some operation results are as follows:

send_requests:<_MainThread(MainThread, started 9552)>:191: 200
send_requests:<_MainThread(MainThread, started 9552)>:192: 200
send_requests:<_MainThread(MainThread, started 9552)>:193: 200
send_requests:<_MainThread(MainThread, started 9552)>:194: 200
send_requests:<_MainThread(MainThread, started 9552)>:195: 200
send_requests:<_MainThread(MainThread, started 9552)>:196: 200
send_requests:<_MainThread(MainThread, started 9552)>:197: 200
send_requests:<_MainThread(MainThread, started 9552)>:198: 200
send_requests:<_MainThread(MainThread, started 9552)>:199: 200
Running time: 102

From the running results, we can see that the main thread executes in order because it is a synchronous request.

The program took 102 seconds.

Here it comes, here it comes. Let’s try asynchronous HTTP requests and see what surprises it will bring us.

import asyncio
import httpx
import threading
import time


client = httpx.AsyncClient()


async def async_main(url, sign):

    response = await client.get(url)
    status_code = response.status_code
    print(f'{threading.current_thread()}:{sign}:{status_code}')


def main():
    loop = asyncio.get_event_loop()
    tasks = [async_main(url='https://www.baidu.com', sign=i) for i in range(200)]
    async_start = time.time()
    loop.run_until_complete(asyncio.wait(tasks))
    async_end = time.time()
    loop.close()
    Print ('run time: ', async_end async_start)


if __name__ == '__main__':
    main()

Some operation results are as follows:

<_MainThread(MainThread, started 13132)>:113:200
<_MainThread(MainThread, started 13132)>:51:200
<_MainThread(MainThread, started 13132)>:176:200
<_MainThread(MainThread, started 13132)>:174:200
<_MainThread(MainThread, started 13132)>:114:200
<_MainThread(MainThread, started 13132)>:49:200
<_MainThread(MainThread, started 13132)>:52:200
Running time: 1.4899322986602783

Have you been surprised to see this running time? You visited Baidu 200 times in more than 1 second. Fast enough to fly.

Limit concurrency

As I mentioned earlier, too much concurrency will lead to server crash, so we should consider limiting the number of concurrency. How should we limit the number of concurrency when asyncio is combined with httpx?

Using semaphore

Asyncio actually comes with a class that limits the number of coroutines, called semaphore. We just need to initialize it, pass in the maximum number of allowed processes, and then we can use the context manager. The specific code is as follows:

import asyncio
import httpx
import time


async def send_requests(delay, sem):
    Print (f 'request an interface with a delay of {delay} seconds')
    await asyncio.sleep(delay)
    async with sem:
        #Execute concurrent code
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get('http://www.httpbin.org/get')
            print(resp)


async def main():
    start = time.time()
    delay_list = [3, 6, 1, 8, 2, 4, 5, 2, 7, 3, 9, 8]
    task_list = []
    sem = asyncio.Semaphore(3)
    for delay in delay_list:
        task = asyncio.create_task(send_requests(delay, sem))
        task_list.append(task)
    await asyncio.gather(*task_list)
    end = time.time()
    Print ('total time: ', end start)


asyncio.run(main())

Some operation results are as follows:

<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
<Response [200 OK]>
Total time: 9.540421485900879

However, if you want to have only three processes in one minute, what should you do?

Just change the code as shown in the figure below:

async def send_requests(delay, sem):
    Print (f 'request an interface with a delay of {delay} seconds')
    await asyncio.sleep(delay)
    async with sem:
        #Execute concurrent code
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get('http://www.httpbin.org/get')
            print(resp)
    await asyncio.sleep(60)

summary

If you want to limit the number of concurrent processes, the simplest way is to use semaphore. However, it should be noted that it can only be initialized before starting the collaboration process, and then passed to the collaboration process to ensure that the concurrent collaboration process gets the same semaphore object.

Of course, there may be different parts in the program, and the concurrency number of each part may be different. Therefore, multiple semaphore objects need to be initialized.

Practical combat – biqu Pavilion

Web page analysis

WOW! This article introduces in detail the treasure house more powerful than requests

Novel home page

First, on the novel home page, you can find that the chapter links of all novels are in the href attribute in the a tag under the DD tag.

The first step is to get all the chapter links.

The next thing to do is to go into each chapter and get the content.

WOW! This article introduces in detail the treasure house more powerful than requests

Novel chapters

As can be seen from the above figure, the article content is in the < div > tag, and a large number of line breaks can be found in the picture. Therefore, it is necessary to further remove spaces when writing code.

Get web source code

async def get_home_page(url, sem):
    async with sem:
        async with httpx.AsyncClient(timeout=20) as client:
            resp = await client.get(url)
            resp.encoding = 'utf-8'
            html = resp.text
            return html

Get links to all chapters

async def parse_home_page(sem):
    async with sem:
        url = 'https://www.biqugeu.net/13_13883/'
        html = etree.HTML(await get_home_page(url, sem))
        content_urls = ['https://www.biqugeu.net/' + url for url in html.xpath('//dd/a/@href')]
        return content_urls

It should be noted here that I have done one more operation, that is, splicing URLs, because the URLs we grab are not complete, so we need to do simple splicing.

Save data

async def data_save(url, sem):
    async with sem:
        html = etree.HTML(await get_home_page(url, sem))
        title = html.xpath('//h1/text()')[0]
        contents = html.xpath('//div[@id="content"]/text()')
        Print (f 'downloading {Title}')
        for content in contents:
            text = ''.join(content.split())

            With open (f '. / Jinzhi 2 / {Title}. TXT','a ', encoding ='utf-8') as F:
                f.write(text)
                f.write('\n')

Pass the URL obtained above into data_ In the save() function, each URL is parsed, the text content is obtained, and then saved.

Create collaboration task

async def main():
    sem = asyncio.Semaphore(20)
    urls = await parse_home_page(sem)
    tasks_list = []
    for url in urls:
        task = asyncio.create_task(data_save(url, sem))
        tasks_list.append(task)
    await asyncio.gather(*tasks_list)

Result display

WOW! This article introduces in detail the treasure house more powerful than requests

Grab results

In less than a minute, all the novels were captured. Imagine how long it would take if it were an ordinary reptile?

At least 737 seconds!!

last

This will be the last time I write a codec? Definitely not. There’s another article about the asynchronous network request library aiohttp. I’ll share it with you after I meet it.

This sharing is over here. If you see here, I hope you can give me some advice【fabulous】And【Look again】, if you can, please share it with more people to learn together.

Every word of the article is written by my heart, yours【give the thumbs-up】Will let me know that you are the one who works with me.

Recommended Today

Proper memory alignment in go language

problem type Part1 struct { a bool b int32 c int8 d int64 e byte } Before we start, I want you to calculatePart1What is the total occupancy size? func main() { fmt.Printf(“bool size: %d\n”, unsafe.Sizeof(bool(true))) fmt.Printf(“int32 size: %d\n”, unsafe.Sizeof(int32(0))) fmt.Printf(“int8 size: %d\n”, unsafe.Sizeof(int8(0))) fmt.Printf(“int64 size: %d\n”, unsafe.Sizeof(int64(0))) fmt.Printf(“byte size: %d\n”, unsafe.Sizeof(byte(0))) fmt.Printf(“string size: %d\n”, […]