Implementation method of downloading high score movies automatically by Python 3. X + Xunlei x

Time:2021-2-22

Chinese New Year is coming. What are you busy with? At the end of the year, the company scrambles for tickets and prepares new year’s goods. As soon as the atmosphere of the new year is heated up, they are all eager to go back to work. Like an arrow home = low output = one line of code, ten mistakes = boring. So I thought of python that I had learned for a period of time before. I usually like to watch movies. It’s too boring to manually click in the movie details and download them one by one. Why don’t I use Python to write an automatic movie download tool? Well, it’s not boring to think about it like this. Before there were not so many XX members, they would go to XX paradise to find movie resources if they wanted to see movies. Most of the movies they wanted to see were still available, just it, climb it!

In the past, I used to climb many websites when playing python, and all of them were done in the company (Python is not in the business scope of the company, it’s just for fun). My colleague in charge of operation and maintenance came to me every day and said: what are you climbing? You go to see the news, and so and so are caught climbing again! You are responsible for the accident yourself! Oh, my mother, she was so scared that she didn’t go on playing. This blog is to crawl the resources of a certain Paradise (which paradise will have it in the code below), will it be caught? Should it be ok if we just discuss technology, practice hands and don’t do business? Write here, small hand can’t help shivering

Let’s die. If I don’t go to hell, I’ll see the final effect

As mentioned above, this download tool has an interface (cowhide bar). As long as you enter a root address and movie score, you can automatically climb movies. To complete this tool, you need to have the following knowledge points:

  • Installation and use of pycharmI don’t know much about this. Apes understand it. I can’t popularize science if I don’t belong to apes. It’s just an IDE
  • tkinter This is a python GUI development library. The poor interface in the figure is developed based on TK. It can be removed even if you don’t want the interface. It doesn’t affect the climbing movie at all. In addition, the user interface can be a little bad. Of course, the most important thing is that I want to learn some new knowledgeAnalysis skills of static web pagesCompared with the crawling of dynamic website, crawling of static website is a piece of cake. F12 will press it, right-click to view the source code of webpage. Through these simple operations, you can view the layout rules of webpage, and then write crawler according to these rules, so easy
  • Data persistenceIf you don’t want to download a movie that has already been downloaded the next time, store the downloaded link and compare whether it has been downloaded before downloading the movie to filter repeated downloads
  • Download and installation of Xunlei xNot to mention that, as a promising youth of contemporary socialism, who hasn’t used Xunlei? Whose hard drive doesn’t have many action movies?

That’s about it. As for the technical details of the implementation, there are not many,requests+BeautifulSoupThe use of,Re regularPython data typesPython threaddbm、pickleAnd so on data persistence library use, and so on, this tool also so some knowledge category. Of course, Python is object-oriented, and programming ideas are common to all languages. This is not a matter overnight, nor can it be described clearly by language. You take your seats according to the number, which of the above knowledge is not enough to study by yourself, but I paste the code directly.

When it comes to Python learning, I’d like to say a few more words. When I used to learn Python crawlers, I saw @ craftsman Ruoshui https://blog.csdn.net/yanbober My friend’s Python article is really good. It’s very helpful for those who have programming experience but never come into contact with Python. Basically, they can start a small project soon. Code:

import url_manager
import html_parser
import html_download
import persist_util
from tkinter import *
from threading import Thread
import os
 
class SpiderMain(object):
  def __init__(self):
    self.mUrlManager = url_manager.UrlManager()
    self.mHtmlParser = html_parser.HtmlParser()
    self.mHtmlDownload = html_download.HtmlDownload()
    self.mPersist = persist_util.PersistUtil()
 
  #Load history download link
  def load_history(self):
    history_download_links = self.mPersist.load_history_links()
    if history_download_links is not None and len(history_download_links) > 0:
      for download_link in history_download_links:
        self.mUrlManager.add_download_url(download_link)
        d_ Log ("load history download link): + download_ link)
 
  #Save history download link
  def save_history(self):
    history_download_links = self.mUrlManager.get_download_url()
    if history_download_links is not None and len(history_download_links) > 0:
      self.mPersist.save_history_links(history_download_links)
 
  def craw_movie_links(self, root_url, score=8):
    count = 0;
    self.mUrlManager.add_url(root_url)
    while self.mUrlManager.has_continue():
      try:
        count = count + 1
        url = self.mUrlManager.get_url()
        d_log("craw %d : %s" % (count, url))
        headers = {
          'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
          'Referer': url
        }
        content = self.mHtmlDownload.down_html(url, retry_count=3, headers=headers)
        if content is not None:
          doc = content.decode('gb2312', 'ignore')
          movie_urls, next_link = self.mHtmlParser.parser_movie_link(doc)
          if movie_urls is not None and len(movie_urls) > 0:
            for movie_url in movie_urls:
              d_log('movie info url: ' + movie_url)
              content = self.mHtmlDownload.down_html(movie_url, retry_count=3, headers=headers)
              if content is not None:
                doc = content.decode('gb2312', 'ignore')
                movie_name, movie_score, movie_xunlei_links = self.mHtmlParser.parser_movie_info(doc, score=score)
                if movie_xunlei_links is not None and len(movie_xunlei_links) > 0:
                  for xunlei_link in movie_xunlei_links:
                    #Determine whether the movie has been downloaded
                    is_download = self.mUrlManager.has_download(xunlei_link)
                    if is_download == False:
                      #Movies that haven't been downloaded are added to the thunderbolt download list
                      d_ Log ('Start downloading '+ Movie_ Name + ', link address:' + xunlei_ link)
                      self.mUrlManager.add_download_url(xunlei_link)
                      os.system (R '"D: \" thunder \ "program\ Thunder.exe " {url}'.format(url=xunlei_ link))
                      #Every time a movie is downloaded, the database will be updated in real time, so that even if the program exits abnormally, the movie will not be downloaded repeatedly
                      self.save_history()
          if next_link is not None:
            d_log('next link: ' + next_link)
            self.mUrlManager.add_url(next_link)
      except Exception as e:
        d_ Log ('error message: '+ str (E))
 
 
def runner(rootLink=None, scoreLimit=None):
  if rootLink is None:
    return
  spider = SpiderMain()
  spider.load_history()
  if scoreLimit is None:
    spider.craw_movie_links(rootLink)
  else:
    spider.craw_movie_links(rootLink, score=float(scoreLimit))
  spider.save_history()
 
# rootLink = 'https://www.dytt8.net/html/gndy/dyzz/index.html'
# rootLink = 'https://www.dytt8.net/html/gndy/dyzz/list_23_207.html'
def start(rootLink, scoreLimit):
  loop_thread = Thread(target=runner, args=(rootLink, scoreLimit,), name='LOOP THREAD')
  #loop_thread.setDaemon(True)
  loop_thread.start()
  #loop_ thread.join () do not let the main thread wait, otherwise the GUI interface will be stuck
  btn_start.configure(state='disable')
 
#Refresh GUI interface, text scrolling effect
def d_log(log):
  s = log + '\n'
  txt.insert(END, s)
  txt.see(END)
 
if __name__ == "__main__":
  rootGUI = Tk()
  rootGUI.title ('xx movie auto download tool ')
  #Set form background color
  black_background = '#000000'
  rootGUI.configure(background=black_background)
  #Get the screen width and height
  screen_w, screen_h = rootGUI.maxsize()
  #Center form
  window_x = (screen_w - 640) / 2
  window_y = (screen_h - 480) / 2
  window_xy = '640x480+%d+%d' % (window_x, window_y)
  rootGUI.geometry(window_xy)
 
  lable_ Link = label (rootgui, text ='resolve root address: '\
            bg='black',\
            fg='red', \
            Font = ('song TI ', 12)\
            relief=FLAT)
  lable_link.place(x=20, y=20)
 
  lable_link_width = lable_link.winfo_reqwidth()
  lable_link_height = lable_link.winfo_reqheight()
 
  input_link = Entry(rootGUI)
  input_link.place(x=20+lable_link_width, y=20, relwidth=0.5)
 
  lable_ Score = label (rootgui, text ='movie rating limit: '\
            bg='black', \
            fg='red', \
            Font = ('song TI ', 12)\
            relief=FLAT)
  lable_score.place(x=20, y=20+lable_link_height+10)
 
  input_score = Entry(rootGUI)
  input_score.place(x=20+lable_link_width, y=20+lable_link_height+10, relwidth=0.3)
 
  btn_ Start = button (rootgui, text ='Start download ', command = lambda: start (input_ link.get (), input_ score.get ()))
  btn_start.place(relx=0.4, rely=0.2, relwidth=0.1, relheight=0.1)
 
  txt = Text(rootGUI)
  txt.place(rely=0.4, relwidth=1, relheight=0.5)
 
  rootGUI.mainloop()

spider_ main.py , main code entry, mainlytkinter Implementation of a simple interface, you can enter the root address, the lowest movie score. The so-called root address is the entrance of a kind of movie on a paradise website. For example, the home page has the following categories: the latest movies, Japanese and Korean movies, European and American movies, 2019 boutique zone, and so on. Take the 2019 boutique zone as an example( https://www.dytt8.net/html/gndy/dyzz/index.html )Of course, it is also possible to use other classified address entries. Scoring is a condition for filtering movies. To learn to say no to junk movies, it’s a waste of time and expression. You can specify that movies with a score greater than or equal to 8 will be downloaded, or you can specify that movies with a score greater than or equal to 9 will be downloaded. You must input a number. If you input some messy things, the program will crash. I’m too lazy to deal with this detail.

'''
The URL link management class is responsible for managing the crawled movie link address, including the newly resolved link address and the downloaded link address, and ensuring that the same link address will only be downloaded once
'''
class UrlManager(object):
  def __init__(self):
    self.urls = set()
    self.used_urls = set()
    self.download_urls = set()
 
  def add_url(self, url):
    if url is None:
      return
    if url not in self.urls and url not in self.used_urls:
      self.urls.add(url)
 
  def add_urls(self, urls):
    if urls is None or len(urls) == 0:
      return
    for url in urls:
      self.add_url(url)
 
  def has_continue(self):
    return len(self.urls) > 0
 
  def get_url(self):
    url = self.urls.pop()
    self.used_urls.add(url)
    return url
 
  def get_download_url(self):
    return self.download_urls
 
  def has_download(self, url):
    return url in self.download_urls
 
  def add_download_url(self, url):
    if url is None:
      return
    if url not in self.download_urls:
      self.download_urls.add(url)

url_ manager.py The comments are very clear. Basically, I have written detailed comments on the key points of every py file

import requests
from requests import Timeout
 
'''
HTML Download: download the HTML page as a whole through a link address, and then download it through HTML_ parser.py Analysis of valuable information
'''
class HtmlDownload(object):
  def __init__(self):
    self.request_session = requests.session()
    self.request_session.proxies
 
  def down_html(self, url, retry_count=3, headers=None, proxies=None, data=None):
    if headers:
      self.request_session.headers.update(headers)
    try:
      if data:
        content = self.request_session.post(url, data=data, proxies=proxies)
        print('result code: ' + str(content.status_code) + ', link: ' + url)
        if content.status_code == 200:
          return content.content
      else:
        content = self.request_session.get(url, proxies=proxies)
        print('result code: ' + str(content.status_code) + ', link: ' + url)
        if content.status_code == 200:
          return content.content
    except (ConnectionError, Timeout) as e:
      print('HtmlDownload ConnectionError or Timeout: ' + str(e))
      if retry_count > 0:
        self.down_html(url, retry_count-1, headers, proxies, data)
      return None
    except Exception as e:
      print('HtmlDownload Exception: ' + str(e))

html_ download.py , which is to use requests to download the content of static web pages as a whole

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
import urllib.parse
import base64
 
'''
HTML page parser
'''
class HtmlParser(object):
  #Analyze the movie list page to get the link of movie details page
  def parser_movie_link(self, content):
    try:
      urls = set()
      next_link = None
      doc = BeautifulSoup(content, 'lxml')
      div_content = doc.find('div', class_='co_content8')
      if div_content is not None:
        tables = div_content.find_all('table')
        if tables is not None and len(tables) > 0:
          for table in tables:
            link = table.find('a', class_='ulink')
            if link is not None:
              print('movie name: ' + link.text)
              movie_link = urljoin('https://www.dytt8.net', link.get('href'))
              print('movie link ' + movie_link)
              urls.add(movie_link)
        next = div_ content.find ('a', text= re.compile (R ". *? Next page.)
        if next is not None:
          next_link = urljoin('https://www.dytt8.net/html/gndy/dyzz/', next.get('href'))
          print('movie next link ' + next_link)
 
      return urls, next_link
    except Exception as e:
      Print ('error parsing movie link address: '+ str (E))
 
  #Analyze the movie details page to get the movie details
  def parser_movie_info(self, content, score=8):
    try:
      movie_ Name = none # movie name
      movie_ Score = 0 # movie score
      movie_ xunlei_ Links = set() # the movie's Xunlei download address. There may be more than one
      doc = BeautifulSoup(content, 'lxml')
      movie_ name =  doc.find ('title'). text.replace ('xunlei.com)_ Movie paradise
      #print(movie_name)
      div_zoom = doc.find('div', id='Zoom')
      if div_zoom is not None:
        #Get movie ratings
        span_txt = div_zoom.text
        txt_list = span_txt.split('◎')
        if txt_list is not None and len(txt_list) > 0:
          for tl in txt_list:
            if 'IMDB' in tl or 'IMDb' in tl or 'imdb' in tl or 'IMdb' in tl:
              txt_score = tl.split('/')[0]
              print(txt_score)
              movie_score = re.findall(r"\d+\.?\d*", txt_score)
              if movie_score is None or len(movie_score) <= 0:
                movie_score = 1
              else:
                movie_score = movie_score[0]
        print(movie_ Name + 'IMDB movie score:' + str (movie)_ score))
        if float(movie_score) < score:
          Print ('movie score is lower than '+ str (score) +', ignore ')
          return movie_name, movie_score, movie_xunlei_links
        txt_a = div_zoom.find_all('a', href=re.compile(r".*?ftp:.*?"))
        if txt_a is not None:
          #Get movie Thunder download address, Base64 into thunder format
          for alink in txt_a:
            xunlei_link = alink.get('href')
            '''
            Here will be converted into a movie link thunderbolt dedicated download link, and later found that not converted Thunderbolt can also identify
            xunlei_link = urllib.parse.quote(xunlei_link)
            xunlei_link = xunlei_link.replace('%3A', ':')
            xunlei_link = xunlei_link.replace('%40', '@')
            xunlei_link = xunlei_link.replace('%5B', '[')
            xunlei_link = xunlei_link.replace('%5D', ']')
            xunlei_link = 'AA' + xunlei_link + 'ZZ'
            xunlei_link = base64.b64encode(xunlei_link.encode('gbk'))
            xunlei_link = 'thunder://' + str(xunlei_link, encoding='gbk')
            '''
            print(xunlei_link)
            movie_xunlei_links.add(xunlei_link)
      return movie_name, movie_score, movie_xunlei_links
    except Exception as e:
      Print ('error parsing movie details page: '+ str (E))

html_ parser.py , use BS4 to parse down the HTML page content. According to the web rules, we need things in the past. This is the most important part of crawler. The purpose of writing crawler is to take out things that are useful to us.

import dbm
import pickle
import os
 
'''
Data persistence tool class
'''
class PersistUtil(object):
  def save_data(self, name='No Name', urls=None):
    if urls is None or len(urls) <= 0:
      return
    try:
      history_db = dbm.open('downloader_history', 'c')
      history_db[name] = str(urls)
    finally:
      history_db.close()
 
  def get_data(self):
    history_links = set()
    try:
      history_db = dbm.open('downloader_history', 'r')
      for key in history_db.keys():
        history_links.add(str(history_db[key], 'gbk'))
    except Exception as e:
      Print ('traversal of DBM data failed: '+ str (E))
    return history_links
 
  #Using pickle to save historical download records
  def save_history_links(self, urls):
    if urls is None or len(urls) <= 0:
      return
    with open('DownloaderHistory', 'wb') as pickle_file:
      pickle.dump(urls, pickle_file)
 
  #Get the download history saved in pickle
  def load_history_links(self):
    if os.path.exists('DownloaderHistory'):
      with open('DownloaderHistory', 'rb') as pickle_file:
        return pickle.load(pickle_file)
    else:
      return None

persist_ util.py , data persistence tool class.

In this way, the code part is completed. Let’s talk about Xunlei. What I installed is the latest version of Xunlei X. it is necessary to turn on the one click download function in Xunlei settings as shown in the figure below. Otherwise, every time a new download task is added, the user confirmation box will pop up. In addition, there is the code to call Xunlei download resources os.system (R ‘”D: \” thunder \ “program\ Thunder.exe ” {url}’.format(url=xunlei_ Link), be sure to go to the thunderbolt installation directory to find Thunder.exe File, cannot use shortcut address(My computer, Xunlei, right-click properties, target. The path shown here in Xunlei x is the path of shortcut. You can’t use this), or the program cannot be found.

Here you should be able to get up, OK. Of course, if you want to optimize, there are many things that can be optimized, such as the thread, such as data persistence… Beginners can practice through this, and then analyze the rules of static websites by themselves, and change the code that parses HTML to climb other websites, such as movies with dangerous actions However, it’s better to watch less of such movies, read more books, wipe and wash occasionally, and pay attention to hygiene.

The above is the whole content of this article, I hope to help you learn, and I hope you can support developer more.

Recommended Today

Background management system menu management module

1 menu management page design 1.1 business design Menu management, also known as resource management, is the external manifestation of system resources. This module is mainly to add, modify, query and delete the menu. CREATE TABLE `sys_menus` ( `id` int(11) NOT NULL AUTO_INCREMENT, `Name ` varchar (50) default null comment ‘resource name’, `URL ` varchar […]