Python crawls news network data



The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

PS: if you need Python learning materials, you can click the link below to get them by yourself

Python free learning materials and group communication solutions. Click to join

Basic development environment

  • Python 3.6
  • Pycharm
import parsel
import requests
import re

Target web page analysis

Today, I’ll climb the international news column in the news network
Click to display more news content

You can see the relevant data interface, which contains the news title and the URL address of the news details

How to extract URL address

1. Convert to JSON, and the value of key value pair;
2. Match URL address with regular expression;

Both methods can be realized, depending on personal preferences

Page turning is carried out according to the pager change in the interface data link, which corresponds to the page number.

On the details page, you can see that the news content is in the div tag and the P tag. According to the normal analysis of the website, you can get the news content.

Save mode

1. You can save TXT text
2. It can also be saved as PDF

Previously, I also talked about crawling the content of the article and saving it into PDF. You can click the link below to see the relevant saving methods.

Python crawls the bid winning bid of Bibi network and saves it in PDF format

Python crawls CSDN blog posts and makes them into PDF files

In this article, use the form of saving TXT text.

Summary of overall crawling ideas

  • On the column list page, click more news content to obtain the interface data URL
  • The data content returned in the interface data URL matches the URL of the news details page
  • Extract news content using regular parsing website operations (re, CSS, XPath)
  • Save data

code implementation

  • Get web page source code
def get_html(html_url):
    Get web page source code response
    :param html_ URL: Web page URL address
    : Return: Web page source code
    response = requests.get(url=html_url, headers=headers)
    return response
  • Get the URL address of each news article
def get_page_url(html_data):
    Get the URL address of each news article
    :param html_data: response.text
    : Return: URL address of each news article
    page_url_list = re.findall('"url":"(.*?)"', html_data)
    return page_url_list
  • The file saving and naming cannot contain special characters, and the news title needs to be processed
def file_name(name):
    File naming cannot carry special characters
    : param Name: news title
    : Return: title without special characters
    replace = re.compile(r'[\\\/\:\*\?\"\\|]')
    new_name = re.sub(replace, '_', name)
    return new_name
  • Save data
def download(content, title):
    With OpenSave news content TXT
    : param content: news content
    : param Title: news title
    Path = 'news \ \' + title + '. TXT'
    with open(path, mode='a', encoding='utf-8') as f:
        Print ('Saving ', title)
  • Main function
def main(url):
    Main function
    : param URL: URL address of news list page
    html_ data = get_ HTML (URL). Text # get interface data response.text
    lis = get_ page_ URL (html_data) # get the list of news URL addresses
    for li in lis:
        page_ data = get_ HTML (LI). Content.decode ('utf-8 ',' ignore ') # news details page response.text
        selector = parsel.Selector(page_data)
        Title = re. Findall ('(. *?)', page_data, re. S) [0] # get the news title
        new_title = file_name(title)
        new_data = selector.css('#cont_1_1_2 div.left_zw p::text').getall()
        content = ''.join(new_data)
        download(content, new_title)

if __name__ == '__main__':
    for page in range(1, 101):
        url_1 = '{}&pagenum=9&t=5_58'.format(page)

Operation effect diagram


Recommended Today

On the mutation mechanism of Clickhouse (with source code analysis)

Recently studied a bit of CH code.I found an interesting word, mutation.The word Google has the meaning of mutation, but more relevant articles translate this as “revision”. The previous article analyzed background_ pool_ Size parameter.This parameter is related to the background asynchronous worker pool merge.The asynchronous merge and mutation work in Clickhouse kernel is completed […]