In the middle of the night, I used Python to crawl the whole doutu website, but I didn’t agree to fight

Time:2020-6-28

In the middle of the night, I used Python to crawl the whole doutu website, but I didn’t agree to fight

QQ, wechat doodles are always difficult to fight. It’s easy to climb doodles directly. I have a map of the whole website. If I don’t accept it, I will fight.

There’s not much nonsense. The selected website is a map. Let’s take a brief look at the structure of the website

document info

 

 

From the above picture, we can see that there are multiple sets of pictures on a page. At this time, we need to think about how to store each set of pictures separately (detailed explanation at the back)

Through analysis, all information can be obtained in the page, so we don’t consider asynchronous loading, so we need to consider paging. By clicking on different pages, it’s easy to see paging rules clearly

It’s easy to understand the structure of the paging URL. The image links are all in the source code, so you can write the code to grab the image after you understand this

How to save pictures

Because I want to save each set of graphs into a folder (OS module), I will name the folder with the last few digits of the URL of each set of graphs, and then the file will separate the last field from the file path to name. See the screenshot below for details.

After these are understood, the next step is the code (refer to my parsing idea, only 30 pages are obtained as the test) all the source code

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os
class doutuSpider(object):
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}
    def get_url(self,url):
        data = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(data.content,'lxml')
        totals = soup.findAll("a", {"class": "list-group-item"})
        for one in totals:
            sub_url = one.get('href')
            global path
            path = 'J:\train\image'+'\'+sub_url.split('/')[-1]
            os.mkdir(path)
            try:
                self.get_img_url(sub_url)
            except:
                pass

    def get_img_url(self,url):
        data = requests.get(url,headers = self.headers)
        soup = BeautifulSoup(data.content, 'lxml')
        totals = soup.find_all('div',{'class':'artile_des'})
        for one in totals:
            img = one.find('img')
            try:
                sub_url = img.get('src')
            except:
                pass
            finally:
                urls = 'http:' + sub_url
            try:
                self.get_img(urls)
            except:
                pass
    def get_img(self,url):
        filename = url.split('/')[-1]
        global path
        img_path = path+'\'+filename
        img = requests.get(url,headers=self.headers)
        try:
            with open(img_path,'wb') as f:
                f.write(img.content)
        except:
            pass
    def create(self):
        for count in range(1, 31):
            url = 'https://www.doutula.com/article/list/?page={}'.format(count)
            print 'start downloading page {}'. Format (count)if __name__ == 'Xi' main__ ':

result

summary

In general, the structure of this website is not very complicated. You can refer to it and climb some interesting