Python crawl rental data examples, it is said that you can start crawling small cases!

Time:2019-8-12
I. What is a reptile?

A crawler, also known as a “web crawler”, is a program that automatically accesses the Internet and downloads the content of a website. It is also the basis of search engines, such as Baidu and GOOGLE, which rely on powerful web crawlers to retrieve huge amounts of Internet information and then store it in the cloud to provide high-quality search services for netizens.

2. What’s the use of reptiles?

You might say, besides being a search engine company, what’s the use of learning to crawl? Ha-ha, somebody finally asked for the idea. Let’s take an example: Enterprise A has built a user forum, and many users leave messages on the forum about their experience and so on. Now A needs to understand user needs, analyze user preferences, and prepare for the next iteration of product updates. So how to get the data, of course, needs the crawler software to get it from the forum. So in addition to Baidu and GOOGLE, many companies are hiring crawler engineers with high salaries. Search any recruitment website for “crawler engineers” to see the number of jobs and salary range to see how popular crawlers are.

3. The Principle of Reptiles

Initiate a request: Send a request (a request) to the target site through the HTTP protocol, and then wait for the response of the target site server.

Get the response content: If the server responds properly, it will get a response. Response content is the content of the page to be retrieved, the content of the response may be HTML, Json string, binary data (such as pictures and videos), and so on.

Parsing content: The content may be HTML, which can be parsed with regular expressions and web parsing libraries; it may be Json, which can be directly converted to Json object parsing; it may be binary data, which can be saved or further processed.

Save data: After data parsing is completed, it will be saved. It can be stored as a text document or in a database.

4. Examples of Python Crawlers

The definition, function and principle of reptiles are introduced. It is believed that many small partners have begun to be interested in reptiles and are ready to try. Let’s go “dry” now and paste a simple Python crawler code directly:

1. Preliminary preparation: install Python environment, PYCHARM software, MYSQL database, new database exam, and build a table house in exam to store crawler results [SQL statement: create table house (price varchar (88), unit varchar (88), area varchar (88);]

2. The goal of the crawler: to crawl the price, unit and area of all the links in the home page of a rental home network, and then store the crawler structure in the database.

3. Crawler source code:

Import requests # Request URL page content

Get page elements from BS4 import BeautifulSoup

Import pymysql # link database

Import time # time function

Import lxml parsing library (HTML XML parsing, XPATH parsing)

# Get_page function: Get the content of URL links through the get method of requests, and then integrate it into a format that Beautiful Soup can handle.

def get_page(url):

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')

return soup

# What the get_links function does: Get all rental links on the list page

def get_links(link_url):

soup = get_page(link_url)

links_div = soup.find_all('div',class_="pic-panel")

links=[div.a.get('href') for div in links_div]

return links

# The get_house_info function is used to obtain information about a rental page: price, unit, area, etc.

def get_house_info(house_url):

soup = get_page(house_url)

price =soup.find('span',class_='total').text

unit = soup.find('span',class_='unit').text.strip()

Area ='test'# Here is the area field. We customize a test to do the test.

info = {

'Price': price,

'Unit': unit,

'Area':area

}

return info

# The configuration information of the database is written to the dictionary

DataBase ={

'host': '127.0.0.1',

'database': 'exam',

'user' : 'root',

'password' : 'root',

'charset' :'utf8mb4'}

# Link database

def get_db(setting):

return pymysql.connect(**setting)

# Insert crawler data into the database

def insert(db,house):

values = "'{}',"*2 + "'{}'"

Sql_values = values. format (house ['price'], house ['unit'], house ['area'])

sql ="""

insert into house(price,unit,area) values({})

""".format(sql_values)

cursor = db.cursor()

cursor.execute(sql)

db.commit()

# Main procedure flow: 1. Connect database 2. Get the list of URLs of each house source information 3. For loop from the first URL to get the house source specific information (price, etc.) 4. Insert the database one by one (python learning exchange group 631441315)

db = get_db(DataBase)

links = get_links('https://bj.lianjia.com/zufang/')

for link in links:

time.sleep(2)

house = get_house_info(link)

insert(db,house)

 

First of all, if we want to do well, we must first use the tools. The same is true for writing crawler programs with Python. We need to import various library files in the process of crawling. It is these library files and their useful library files that help us accomplish most of the work of the crawler. We only need to call the relevant excuse functions. The import format is the import library file name. Here we should pay attention to the installation of library files in PYCHARM, which can be installed by cursor on the library file name, and by Ctrl + Alt key, or by command line (Pip install library file name). If the installation fails or is not installed, then the subsequent crawler program will surely report errors. In this code, the first five lines of the program are imported into the relevant library files: requests for requesting the content of the URL page; Beautiful Soup for parsing page elements; pymysql for connecting to the database; time contains various time functions; lxml is a parsing library for parsing HTML, XML format files, and it also supports. Hold XPATH parsing.

Secondly, we start with the main program at the end of the code to see the whole crawler process:

Connect to the database through get_db function. Further into the get_db function, you can see that the connection of database is realized by calling Pymysql’s connect function. Here ** setting is a way for Python to collect keyword parameters. We write the connection information of database into a dictionary DataBase and pass the information in the dictionary to connect to make it real. Ginseng.

Through get_links function, get the links of all the housing sources on the home page of Chain Home Rental Network. Links to all housing sources are listed in Links. The get_links function first obtains the content of the home page of the chain home network through requests, and then organizes the content format through the interface of Beautifu Soup to become the format it can handle. Finally, all the div styles containing pictures are found by the electrophoretic find_all function, and then the content of the hyperlink label (a) contained in all the div styles (that is, the content of href attribute) is obtained by a for loop. All the hyperlinks are stored in the list links.
Go through all links through the FOR loop (for example, one of them is https://bj.lianjia.com/zufang/101101101570737.html)

Use the same method as 2) to locate elements by using find function to get the price, unit and area information in the links in 3) and write it into a dictionary Info.

Call the insert function to write the Info information from a link into the house table of the database. Deep into the insert function, we can know that it is through the cursor function of the database cursor () to execute an SQL statement and then the database commit operation to achieve the response function. Here, the SQL statement is written in a special way, format function is used to format, so as to facilitate the reuse of functions.

Finally, run the crawler code, you can see that the home page of Chain Home Network all the information of the house source is written into the data. (Note: Test is the test string I specify manually)

 

image.png

 

Postscript: In fact, Python crawler is not difficult, familiar with the whole crawler process, is a number of details need attention, such as how to get page elements, how to build SQL statements, and so on. Don’t panic when you encounter problems. Look at the IDE tips and you can eliminate BUG one by one, and finally get the structure we expect.

Recommended Today

Implementation of PHP Facades

Example <?php class RealRoute{ public function get(){ Echo’Get me’; } } class Facade{ public static $resolvedInstance; public static $app; public static function __callStatic($method,$args){ $instance = static::getFacadeRoot(); if(!$instance){ throw new RuntimeException(‘A facade root has not been set.’); } return $instance->$method(…$args); } // Get the Facade root object public static function getFacadeRoot() { return static::resolveFacadeInstance(static::getFacadeAccessor()); } protected […]