Crawler primary operation (2)

Time:2021-3-9

This article is a brief introduction to the primary operation of Python web crawler, which mainly includes the following two parts:

  • Analysis of web pages
  • database

Analysis of web pages

Generally speaking, there are three ways to parse web pages: regular expression, beautiful soup and lxml. Among them, regular expression is more difficult. Beautiful soup is suitable for beginners, and can quickly grasp the method of extracting data from web pages.

regular expression

The common regular characters and their meanings are as follows:

. matches any character except the newline character
*Matches the previous character 0 or more times
+Matches the previous character one or more times
? match the previous character 0 or 1 times

^Matches the beginning of the string
$matches the end of the string

() matches the expression in brackets and also represents a group

\S matches white space characters
\S matches any non white space character

\D matches a number, equivalent to [0-9]
\D matches any non number, equivalent to [^ 0-9]

\W matches alphanumeric, equivalent to [a-za-z0-9]
\W matches non alphanumeric, equivalent to [^ a-za-z0-9]

Used to represent a group of characters

Python regular expressions have the following three methods:
re.matchmethod:Matches a pattern from the beginning of the string. If it matches from the beginning, match () returns none.
grammarre.match(pattern, string, flags=0)
pattern: regular expressions
string: string to match
flags: control the matching method of regular expressions, such as case sensitive, multiline matching, etc

re.searchmethod:A match can only be made from the beginning of a string.

find_allmethod:You can find all the matches.

BeautifulSoup

BeautifulSoupfromHTMLorXMLExtract data from the file. First, you need to use the command line to install:

pip install bs4

When using, you need to import:

from bs4 import BeautifulSoup

For example, usingBeautifulSoupGet the title, code and comment of the blog’s homepage

import requests
from bs4 import BeautifulSoup

link = 'http://www.santostang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(link, headers=headers)
#Converts the string of the response body of the web page to a soup object
soup = BeautifulSoup(r.text, 'html.parser')
first_title = soup.find('h1', class_='post-title').a.text.strip()
The title of the first article is: '_ title)

title_list = soup.find_all('h1', class_='post-title')

for i in range(len(title_list)):
    title = title_list[i].a.text.strip()

    Print ('title of article% s is% s'% (I + 1, title))

The results are as follows
Crawler primary operation (2)

Successfully captured the required content.

aboutBeautifulSoupFinally, let’s look at a practical project: crawling the price of second-hand houses in Beijing. The code is as follows:

import requests
from bs4 import BeautifulSoup
import time

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}

for i in range(1,11):
    link = 'https://beijing.anjuke.com/sale/'
    r = requests.get(link, headers=headers)
    Print ('page ','I')

    soup = BeautifulSoup(r.text, 'lxml')
    house_list = soup.find_all('li', class_='list-item')

    for house in house_list:
        name = house.find('div', class_='house-title').a.text.strip()
        price = house.find('span', class_='price-det').text.strip()
        price_area = house.find('span', class_='unit-price').text.strip()

        no_room = house.find('div', class_='details-item').span.text
        area = house.find('div', class_='details-item').contents[3].text
        floor = house.find('div', class_='details-item').contents[5].text
        year = house.find('div', class_='details-item').contents[7].text
        broker = house.find('span', class_='brokername').text
        broker = broker[1:]
        address = house.find('span', class_='comm-address').text.strip()
        address = address.replace('\xa0\xa0\n                 ', '    ')
        tag_list = house.find_all('span', class_='item-tags')
        tags = [i.text for i in tag_list]
        print(name, price, price_area, no_room, area, floor, year, broker, address, tags)
    time.sleep(5)

So it’s successfulLive in peaceTop 10 pages of Beijing second-hand housing prices.

database

There are two kinds of data storage, which are stored in files (txt and CSV) and databases (MySQL relational database and mongodb database).

CSV (Comma-Separated Values)Is a comma separated file format that stores table data (numbers and text) in plain text.
Each line of the CSV file is separated by a newline character, and the columns are separated by commas.

MySQLIs a relational database management system, the use of isSQLLanguage is the most commonly used standardized language to access database. Relational database (database based on relational model) saves data in different tables instead of putting all data in a large warehouse, which increases the speed of writing and extracting, and makes data storage more flexible.

About the method of storing in the file, I will not repeat it here. The following first introduces how to store in the fileMySQLDatabase.

You need to be there firstOfficial websiteDownload and installMySQLDatabase. The blogger uses the MacOS Sierra system. After the installation, open the system preferences, as shown in the following figure:
Crawler primary operation (2)

It appears at the bottomMySQLOpen and connect, as shown in the figure below:
Crawler primary operation (2)

Open the terminal and enter the command to add MySQL path in the terminal

PATH="$PATH":/usr/local/mysql/bin

Continue typing login toMySQLThe following command:mysql -u root -p, and then enter the password to log in successfully. The successful login interface is as follows:
Crawler primary operation (2)

Next, let’s introduceThe basic operations of MySQL are as follows:

  • Create database

Crawler primary operation (2)

  • Create data table

To create a data table, you must specify the name of each column_ Name) and column_ type)。
Crawler primary operation (2)
In the figure above, four variables are created:id, url, content, created_time. amongidThe category of is integerINT, add attributes for yourself(AUTO_INCREMENT). The value of the newly added data will be automatically increased by 1.PRIMARY KEYIs used toidDefined as primary key.

urlandcontentIs a variable length stringVARCHARThe number in brackets represents the maximum length,NOT NULLexpressurlandcontentCannot be empty.created_timeThere is no need to set the time for adding the data. Because there is a time stamp, it will be automatically filled in according to the time at that time.

After creating a data table, you can view the structure of the data table
Crawler primary operation (2)

  • Insert data into a data table

Crawler primary operation (2)

It’s inserted hereurlandcontentTwo attributes,idIt’s auto incrementing,created_timeIs the time stamp of data addition, and these two variables will be filled in automatically.

  • Extract data from data table

Crawler primary operation (2)

From the figure above, we can see that there are three methods to extract data:
(1) WillidData rows equal to 1 are extracted;
(2) Extract data that only looks at some fields;
(3) Extract data that contains part of the content.

  • Delete data

Crawler primary operation (2)

⚠️ Note that if not specifiedWHEREClause, withDELETE FROM urlsWill lead toMySQLAll records in the table are deleted, that is, the whole table is deleted by mistake.

  • Modify data

Crawler primary operation (2)

becauseidandcreated_timeIt is automatically filled in by the database, so this line of data isid2.

More operations can be referred toRookie course

Here’s how to use itPythonoperationMySQLDatabase, then use the command to installmysqlclient

brew install mysql
export PATH=$PATH:/usr/local/mysql/bin
pip install MySQL-Python
pip3 install mysqlclient

The installation is successful when the following text appears:
Crawler primary operation (2)

usePythonoperationMySQLThe specific code and explanation are as follows:

#coding=UTF-8
import MySQLdb
import requests
from bs4 import BeautifulSoup

#The connect () method is used to create a database connection, in which parameters can be specified: user name, password, host and other information
#This is just a connection to the database. To operate the database, you need to create a cursor
conn = MySQLdb.connect(host='localhost', user='root', passwd='your_password', db='MyScraping', charset='utf8')
#The cursor is created by the cursor () method under the conn database connection.
cur=conn.cursor()

link = 'http://www.santostang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')
title_list = soup.find_all('h1', class_='post-title')
for eachone in title_list:
    url = eachone.a['href']
    title = eachone.a.text.strip()
    #Create a data table, and write pure SQL statements through cur operation and execute() method. Through the execute () method to write such as SQL statements to operate on the data
    cur.execute('INSERT INTO urls (url, content) VALUES (%s, %s)', (url, title))

cur.close()
conn.commit()
conn.close()

Finally, let’s show you how to store toMongoDBDatabase.

The first thing to knowNoSQLIt generally refers to non relational database, which has very high read-write performance and no relationship between dataMongoDBIt is one of the most popular databases. It is a relational database management system, using SQL language, is the most commonly used standardized language to access the database.

The following still takes the above crawling title and URL address in the blog as an example.

The first step is to connectMongoDBClient, and then connect to the databaseblog_database, and then select the set of the datablog. If they do not exist, one will be created automatically. The code example is as follows:

from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client.blog_database
collection = db.blog

The second step is to crawl all the article titles of the blog home page and store them in theMongoDBDatabase, the code is as follows:

import requests
import datetime
from bs4 import BeautifulSoup
from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client.blog_database
collection = db.blog

link = 'http://www.santostang.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')
title_list = soup.find_all('h1', class_='post-title')
for eachone in title_list:
    url = eachone.a['href']
    title = eachone.a.text.strip()
    post = {'url': url,
            'title': title,
            'date': datetime.datetime.utcnow()

    }
    collection.insert_one(post)

Focus on the last part of the code, the first crawler access to the data storedpostIn the dictionary, and then useinsert_oneJoin the collectioncollectionIn the middle.

Finally, startMongoDBView the results.

Open terminal input:

sudo mongod --config /usr/local/etc/mongod.conf

After confirming the authority, leave the current terminal open, and input the following for the new terminal in turn:

mongod

mongo

The following text indicates successful connection:
Crawler primary operation (2)

Then, enter:

use blog_database

db.blog.find().pretty()

You can query the data of the data set, as shown in the figure below:
Crawler primary operation (2)

At the same time, note that it isJSONFormat.

For more information about using Python to operate mongodb database, please refer toPymongo website


This paper is the learning record and summary of Cui Qingcai’s blog and Tang Song’s “Python web crawler from introduction to practice”. The practice code of the blogger in the learning process has also been uploaded toGitHub

Shortcomings, welcome to correct.