Python crawler: crawl the location data of QiongYou. The world is so big, I want to have a look.


The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time

1、 Preface

The world is so big, I want to see.
Either reading or traveling, the body and mind must have one on the way.
I think we all yearn for tourism, so what are the itineraries and popular scenic spots in the region?
You may need to find travel strategies online. Today, I will take you to collect the scenic spot data of the travel website.

2、 Course highlights

  1. Systematic analysis of the nature of web pages
  2. Structured data analysis
  3. Saving of CSV data

3、 Library used

import csv
import requests
import parsel
from concurrent.futures import ProcessPoolExecutor
import multiprocessing


4、 Environment configuration

python 3.6

5、 General implementation steps of crawler case:

1. Find the URL where the data is located
2. Send network request
3. Data analysis (the data we need)
4. Data storage

6、 Find where the data is

lock =  multiprocessing.Lock () create process lock object

def send_request(url):
    'request data '
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}
    html_data = requests.get(url=url, headers=headers).text
    return html_data


7、 Full code:

def parse_data(html_data):
    selector = parsel.Selector(html_data)
    lis = selector.xpath('//ul[@class="plcCitylist"]/li')

    for li in lis:
        travel_ place =  li.xpath ('. // H3 / A / text()'). Get()? Destination
        travel_ people =  li.xpath ('. // P [@ class = "beento"] / text()'). Get()? Number of people who have been there

        travel_ hot =  li.xpath ('. // P [@ class = "POIs"] / A / text()'). Getall()? Popular scenic spots
        travel_hot = [hot.strip() for hot in travel_hot]
        travel_hot = '、'.join(travel_hot)

        travel_ url =  li.xpath ('. // H3 / A / @ a'). Get() # destination details page URL
        travel_ imgUrl =  li.xpath ('. / P / A / img / @ SRC'). Get() # image URL
        print(travel_place, travel_people, travel_hot, travel_url, travel_imgUrl, sep=' | ')

        yield travel_place, travel_people, travel_hot, travel_url, travel_imgUrl

def save_data(data_generator):
    With open ('qiongyou. CSV ', mode =', encoding = ', UTF-8', newline = ') as F:
        csv_write = csv.writer(f)
        for data in data_generator:
            lock.acquire () lock
            lock.release () release lock

def main(url):
    html_data = send_request(url)
    parse_result = parse_data(html_data)

if __name__ == '__main__':
    # main('')
    with ProcessPoolExecutor(max_workers=13) as executor:
        for page in range(1, 172):
            url = f'{page}/'
            executor.submit(main, url)


Recommended Today

Less sass SCSS stylus of CSS preprocessor

Let’s ask ourselves a question: why do we need a preprocessor?The answer is self-evident, that is, CSS itself has some disadvantages: The syntax is not strong enough to be written nested, resulting in unclear style logic in the project. There is no variable and logical reuse mechanism. When it is necessary to reuse code, you […]