Python crawler beginner: crawling movie paradise data

Time:2022-1-7

The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article comes from the it sharing home by it sharer

Python crawler beginner: crawling movie paradise data

 

[I. project background]

I believe everyone has a headache experience. It’s very hard to download movies, isn’t it? You need to download one by one, and you can’t intuitively know the status of recent movie updates.

Today, Xiaobian takes the movie paradise as an example to take you to see your favorite movies more intuitively and download them.

Python crawler beginner: crawling movie paradise data

 

[II. Project preparation]

First, we need to install a pycham software. For pychar software installation, you can see this tutorial: Python environment construction – Amway Python Xiaobai’s detailed tutorial on Python and pychar installation.

Movie paradise website:

https://www.ygdy8.net/html/gndy/dyzz/list_23_1.html

How many libraries do we need to download? First open pychart, click file, and then click setting.

Python crawler beginner: crawling movie paradise data

 

After opening, this interface will appear. Click your project name (Project: (your project name)) project interpreter and click the plus sign to download the library we need (requests, requests, time, re module), as shown in the following figure.

Python crawler beginner: crawling movie paradise data

 

If you can’t load the interpreter, you can refer to this handy tutorial: a simple tutorial on how to configure the Python interpreter after installing pychart.

If the corresponding library is still missing, you can download and install it as follows.

Python crawler beginner: crawling movie paradise data

 

[III. project implementation]

We need (requests, requests, time, re module), as shown in the figure below.

Python crawler beginner: crawling movie paradise data

 

Use the packaging method to realize the functions of each part. First, write a framework: construct a class filmsky, then define an init method to inherit (self), and then define a main method (main). Finally, the main method is implemented. The code is as follows:

Python crawler beginner: crawling movie paradise data

 

This time is used to prevent reverse crawling and set the time delay.

First, let’s analyze the characteristics of the next page of this website.

Python crawler beginner: crawling movie paradise data

 

By clicking on three pages, we will find that the address is changed from “23-3, 4, 5” on the original basis.

We can use {} instead of changing values, like this:

https://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html

In this way, we initialize the URL address and construct the request header in the inti method.

Python crawler beginner: crawling movie paradise data

 

In the main function of the main method, use the for loop to traverse the web address.

Python crawler beginner: crawling movie paradise data

 

The following results are obtained:

Python crawler beginner: crawling movie paradise data

 

That means you’re half done. Come on!!

Now we need to make requests for these URLs. In order to see it more intuitively, we write it with a class.

We use requests to make requests. The code of this website is GBK (how do you see the code of the website?).

Open a website and right-click to check the tag in the header. Take this website as an example, you can see charset = “gb312”.

This GB2312 is the code. There are two common coding methods (utf_8, GBK).

Python crawler beginner: crawling movie paradise data

 

Python crawler beginner: crawling movie paradise data

 

We can verify whether the request has really arrived. Using print (HTML) to see this result (a complete HTML page) indicates that the request is successful.

Python crawler beginner: crawling movie paradise data

 

We redefine this method (parse our web page code).

We use regular expressions to parse the data. We can right-click to check that the website we want is in the table

TaggedTaggedThe href of the label.

Python crawler beginner: crawling movie paradise data

 

So we can find the table first, layer by layer. We can refer to the following figure.

Python crawler beginner: crawling movie paradise data

 

Regular expressions are (. *?) Inside is what you want, “. *?” You can omit the label and get to the layer you want. For loop traversal to get each URL. Click these URLs. We need to make a request for the secondary page and parse it.

Because some of the links on the web site are empty, this will lead to the link mismatch of movie download. Therefore, we need to make a judgment. If the length of the download link is greater than 0, it will be displayed as usual, otherwise it will be given a null value, so that it will not be wrong. Finally, this result is returned, as shown in the figure below.

Python crawler beginner: crawling movie paradise data

 

Click the second level page, as shown in the figure, right-click the download link, as shown in the figure below:

Python crawler beginner: crawling movie paradise data

 

Python crawler beginner: crawling movie paradise data

 

We use regular expression analysis to get our download link address, as shown in the following figure:

Python crawler beginner: crawling movie paradise data

 

It doesn’t look very beautiful. Let’s deal with the link, as shown in the figure below:

Python crawler beginner: crawling movie paradise data

 

The results are as follows:

Python crawler beginner: crawling movie paradise data

 

Finally, we save the data in a dictionary with download links and movie names:

Python crawler beginner: crawling movie paradise data

 

Finally, we optimize the requested code, which is a little repetitive;

Use a value to save the content describing the request header. After the request, we can only call this method to make the request, as shown in the following figure:

Python crawler beginner: crawling movie paradise data

 

After the program runs, you can see the effect diagram, as shown in the following figure:

Python crawler beginner: crawling movie paradise data

 

Click the blue link to download (to download Xunlei, Xunlei download is faster)

Can you see more intuitively that you want a movie? Click to download!

[v. summary]

1. Based on Python web crawler technology, this paper provides a more intuitive way to watch your favorite movies and download them conveniently.

2. It is not recommended to grab too much, which is easy to load the server.

Recommended Today

Springboot 2.6.3 integrated redis stepped on the pit

The integration steps are as follows: development tools: idea2019, JDK1.8, maven 3.5.4 Idea creates a new project, selects spring initializer, selects spring boot version 2.6.3 (the latest version at present), and adds web, and redis modules. After successful construction, the POM file is as follows: <?xml version=”1.0″ encoding=”UTF-8″?> <project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd”> <modelVersion>4.0.0</modelVersion> <parent> […]