Getting started crawling data on the web with Python: an effective way to extract data for data science projects

Time:2021-9-16

Author lakhay Arora
Compile Flin
Source | analyticsvidhya

Overview

  • Web crawling is an efficient way to extract data from a website (depending on the regulations of the website)

  • Learn how to perform web page crawling in Python using the popular beautiful soup library

  • We will introduce different types of data that can be captured, such as text and images

introduce

We have too little data to build a machine learning model. We need more data!

If this sentence sounds familiar, then you are not alone! Hoping to obtain more data to train our machine learning model is a problem that has been perplexing people. We can’t get Excel or. CSV files that can be used directly in data science projects, can we?

So, how to deal with the problem of lack of data?

One of the most effective and simple ways to achieve this goal is to crawl through web pages. I personally find that web crawling is a very useful technology, which can collect data from multiple websites. Today, some websites also provide APIs for many different types of data you may want to use, such as tweets or LinkedIn posts.

But sometimes you may need to collect data from websites that never provide a specific API. This is where web crawling comes in handy. As a data scientist, you can write a simple Python script and extract the required data.

Therefore, in this article, we will learn the different components of web crawling, and then directly study Python to understand how to perform web crawling using the popular and efficient beautifulsup library.

We also created a free course for this article:

Please note that there are many guidelines and rules to follow in web page crawling. Not every website allows users to grab content, so there are certain legal restrictions. Be sure to read the site terms and conditions before attempting to do this.

catalogue

  1. Three popular tools and libraries for web crawlers in Python

  2. Components of Web crawl

    1. Crawl
    2. Parse and Transform
    3. Store
  3. Crawl URLs and email IDs from web pages

  4. Crawl pictures

  5. Grab data when the page is loaded

Three popular tools and libraries for web crawlers in Python

You will encounter multiple libraries and frameworks for web crawling in Python. Here are three popular tools to accomplish tasks efficiently:

BeautifulSoup

  • Beautiful soup is a great parsing library in python that can be used for web crawling from HTML and XML documents.

  • Beautiful soup automatically detects the encoding and gracefully processes HTML documents, even with special characters. We can browse the parsed documents and find the required content, which makes it fast and easy to extract data from web pages. In this article, we will learn in detail how to build a web scraper using beautiful soup

Scrapy

Selenium

Components of Web crawl

This is an excellent description of the three main components of Web Capture:

Let’s look at these components in detail. We will capture the details of the hotel, such as the hotel name and the price per room, through the goibo website to achieve this purpose:

Note: always follow the robots.txt file of the target site, also known as the rover exclusion protocol. This can tell the network Rover which pages to not grab.

Therefore, we are allowed to grab data from the target URL. We are happy to write the script of our network robot. Let’s start!

Step 1: crawl

The first step of web crawling is to navigate to the target website and download the source code of the web page. We will use the request library to do this. Http. Client and urlib2 are two other libraries for making requests and downloading source code.

After downloading the source code of the web page, we need to filter the required content:

"""
Web Scraping - Beautiful Soup
"""

# importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# target URL to scrap
url = "https://www.goibibo.com/hotels/hotels-in-shimla-ct/"

# headers
headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
    }

# send request to download the data
response = requests.request("GET", url, headers=headers)

# parse the downloaded data
data = BeautifulSoup(response.text, 'html.parser')
print(data)

Step 2: parse and transform

The next step in web crawling is to parse this data into an HTML parser. For this purpose, we will use the beautiful soup library. Now, if you have noticed our target page, like most pages, the details of specific hotels are located on different cards.

Therefore, the next step will be to filter the card data from the complete source code. Next, we will select the card and click the “inspect element” option to obtain the source code of the specific card. You will receive the following:

The class names of all cards are the same. We can pass the tag name and attributes (such asLabel) to get a list of these cards, whose names are as follows:

# find all the sections with specifiedd class name
cards_data = data.find_all('div', attrs={'class', 'width100 fl htlListSeo hotel-tile-srp-container hotel-tile-srp-container-template new-htl-design-tile-main-block'})

# total number of cards
print('Total Number of Cards Found : ', len(cards_data))

# source code of hotel cards
for card in cards_data:
    print(card)

We filtered out the card data from the complete source code of the web page. Each card here contains information about a separate hotel. Select only the hotel name, perform the “inspect element” step, and perform the same operation for the room price:

Now, for each card, we must find the hotel name above. These names can only be from

Extracted from tags. This is because each card and room rate has only one < p > tag and < class > tag and class name:

# extract the hotel name and price per room
for card in cards_data:

    # get the hotel name
    hotel_name = card.find('p')

    # get the room price
    room_price = card.find('li', attrs={'class': 'htl-tile-discount-prc'})
    print(hotel_name.text, room_price.text)

Step 3: store data

The final step is to store the extracted data in a CSV file. Here, for each card, we will extract the hotel name and price and store them in the python dictionary. Then we finally add it to the list.

Next, let’s continue to convert this list to a pandas data frame, because it allows us to convert the data frame to a CSV or JSON file:

# create a list to store the data
scraped_data = []

for card in cards_data:

    # initialize the dictionary
    card_details = {}

    # get the hotel name
    hotel_name = card.find('p')

    # get the room price
    room_price = card.find('li', attrs={'class': 'htl-tile-discount-prc'})

    # add data to the dictionary
    card_details['hotel_name'] = hotel_name.text
    card_details['room_price'] = room_price.text

    # append the scraped data to the list
    scraped_data.append(card_details)

# create a data frame from the list of dictionaries
dataFrame = pd.DataFrame.from_dict(scraped_data)

# save the scraped data as CSV file
dataFrame.to_csv('hotels_data.csv', index=False)

congratulations! We have successfully created a basic web crawler. I want you to try these steps and try to get more data, such as the grade and address of the hotel. Now let’s look at how to perform some common tasks, such as crawling URLs, email IDS, images, and crawling data when the page loads.

Grab URL and email ID from web page

The two most common features we try to grab using the web crawl feature are the website URL and e-mail ID. I’m sure you’ve been involved in projects or challenges that require a lot of e-mail ID extraction. So let’s see how to grab this in Python.

Console using web browser

Suppose we want to track our instagram followers and want to know the user name of the person who unsubscribes from our account. First, sign in to your instagram account, and then click followers to view the list:

  • Scroll all the way down to load all user names into the background in browser memory

  • Right click the browser window and click Check elements

  • In the console window, type the following command:

urls = $$(‘a’); for (url in urls) console.log ( urls[url].href);

In just one line of code, we can find all the URLs that exist on that particular page:

  • Next, save this list in two different timestamps, and a simple Python program will let you know the difference between the two. We will be able to know the user name of our account has been cancelled!

  • There are several ways to simplify this task. The main idea is that with just one line of code, we can get all the URLs at once.

Using the Chrome extension email extractor

The email extractor is a chrome plug-in that captures the email ID displayed on the page we are currently browsing

It even allows us to download the list of email IDS in CSV or text files:

Beautifulsoup and regular expressions

The above solution works only if we only want to grab data from one page. But what if we want to perform the same steps on multiple pages?

There are many websites that can do this for us. But here’s the good news – we can also write our own web crawlers in Python! Let’s look at the operation method in the real-time coding window below.

Crawling pictures in Python

In this section, we will grab all the images from the same goibo page. The first step is to navigate to the target website and download the source code. Next, we’ll find all the images using the < img > tag:

"""
Web Scraping - Scrap Images
"""

# importing required libraries
import requests
from bs4 import BeautifulSoup

# target URL
url = "https://www.goibibo.com/hotels/hotels-in-shimla-ct/"

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
    }

response = requests.request("GET", url, headers=headers)

data = BeautifulSoup(response.text, 'html.parser')

# find all with the image tag
images = data.find_all('img', src=True)

print('Number of Images: ', len(images))

for image in images:
    print(image)

From all image labels, select only the SRC part. In addition, please note that the hotel pictures are provided in JPG format. Therefore, we will only select those:

# select src tag
image_src = [x['src'] for x in images]

# select only jp format images
image_src = [x for x in image_src if x.endswith('.jpg')]

for image in image_src:
    print(image)

Now that we have a list of image URLs, all we have to do is request the image content and write it to the file. Make sure to open the file “WB” (write binary) form

image_count = 1
for image in image_src:
    with open('image_'+str(image_count)+'.jpg', 'wb') as f:
        res = requests.get(image)
        f.write(res.content)
    image_count = image_count+1

You can also update the initial page URLs by page number and repeatedly request them to collect a large amount of data.

Grab data when the page is loaded

Let’s take a look at the steam community grant Theft Auto V reviews web page. You will notice that the full content of the web page will not be loaded in one breath.

We need to scroll down to load more content on the page. This is an optimization technique called “delayed loading” used by website back-end developers.

But for us, the problem is that when we try to grab data from this page, we will only get limited content of this page:

Some websites have also created “load more” buttons instead of endless scrolling ideas. It will load more content only when you click the button. The problem of limited content still exists. So let’s see how to crawl these pages.

Navigate to the destination URL and open the check element network window. Next, click the reload button, which will record the order of network, such as image loading, API request, post request, etc.

Clear the current record and scroll down. You will notice that when you scroll down, the page is sending a request for more data:


Scroll further and you’ll see how the site makes requests. Look at the following URLs – only some parameter values are changing, and you can easily generate these URLs with simple Python code:

You need to follow the same steps to grab and store data by sending requests page by page to each page.

Endnote

This is a simple and beginner friendly introduction to network crawling in Python using the powerful beautiful soup library. To be honest, I find web crawling very useful when I am looking for a new project or need information about an existing project.

Note: if you want to learn this tutorial in a more structured form, we have a free course. We will teach web crawling beatifulsoup. You can see it here – An Introduction to web crawling using python.

As mentioned earlier, there are other libraries that can be used to perform web crawling. I’d love to hear about your favorite Library (even if you use r!) and your experience on the subject. Tell me in the comments section below that we will contact you!

Original link:https://www.analyticsvidhya.com/blog/2019/10/web-scraping-hands-on-introduction-python/

Welcome to panchuang AI blog:
http://panchuang.net/

Official Chinese document of sklearn machine learning:
http://sklearn123.com/

Welcome to panchuang blog resources summary station:
http://docs.panchuang.net/