Instance of Python 3 crawling torrent seed link

Time:2020-10-16

The environment of this paper is python3, which is built by urllib and beautiful soup.

The project is divided into manager, URL manager, downloader, parser and HTML file producer. Each performs his or her own duties and schedules in the manager. Finally, the parsed seed connection production HTML file is displayed. Of course, it can also be saved in a file. The final effect is shown in the figure.

First, initialize the downloader, parser, and HTML producer in the constructor of the class spidermain(). The code is as follows.


def__init__(self):

  self.urls = url_manager.UrlManager()
  self.downloader = html_downloader.HtmlDownloader()
  self.parser = html_parser.HtmlParser()
  self.outputer = html_outputer.HtmlOutputer()

Then write the main connection in the main method and start to download the parsing and output.

if __name__ == '__main__':
  url = " http://www.btany.com/search/ Tasc-1“
  #Solve the problem of Chinese search for:? =No escape
  root_url = quote(url,safe='/:?=')
  obj_spider = SpiderMain()
  obj_spider.parser(root_url)

Download with the downloader, the parser parses the downloaded Web page, and finally outputs. The framework logic of the manager is set up


def parser(self, root_url):  
  html = self.downloader.download(root_url)  
  datas = self.parser.parserTwo(html)  
  self.outputer.output_html3(datas)

The code of the Downloader is as follows:


def download(self, chaper_url):

  if chaper_url is None:
    return None
  headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
  req = urllib.request.Request(url=chaper_url, headers=headers)
  response = urllib.request.urlopen(req)
  if response.getcode() != 200:
    return None

  return response.read()

Headers are request headers that mimic browsers. Otherwise, the HTML file cannot be downloaded.

The parser code is as follows:

#Parsing seed files
def parserTwo(self,html):
  if html is None:
    return
  soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
  res_datas = self._get_data(soup)
  return res_datas

#The title of seed file, magnetic link and thunderbolt link are encapsulated
def _get_data(self,soup):
  res_datas = []
  all_data = soup.findAll('a',href=re.compile(r"/detail"))
  all_data2 = soup.findAll('a', href=re.compile(r"magnet"))
  all_data3 = soup.findAll('a',href=re.compile(r"thunder"))
  for i in range(len(all_data)):
    res_data = {}
    res_data['title'] = all_data[i].get_text()
    res_data['cl'] = all_data2[i].get('href')
    res_data['xl'] = all_data3[i].get('href')
    res_datas.append(res_data)
  return res_datas

By analyzing the HTML files crawled down, the seed link is under the a tag. Then extract the links under magnet and thunder.

Finally, the outputter outputs the HTML file, and the code is as follows:

def __init__(self):
  self.datas = []

def collect_data(self, data):
  if data is None:
    return
  self.datas.append(data)
#Output form 
def output_html3(self,datas):
  fout = open('output.html', 'w', encoding="utf-8")

  fout.write("<html>")
  fout.write("<head><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\"></head>")
  fout.write("<body>")
  fout.write("<table border = 1>")

  for data in datas:
    fout.write("<tr>")
    fout.write("<td>%s</td>" % data['title'])
    fout.write("<td>%s</td>" % data['cl'])
    fout.write("<td>%s</td>" % data['xl'])
    fout.write("</tr>")

  fout.write("</table>")
  fout.write("</body>")
  fout.write("</html>")
  fout.close()

The project is over. Source code has been uploaded, link https://github.com/Ahuanghaifeng/python3-torrent Please give a star on GitHub. Your encouragement will be the driving force of the author’s creation.

The above example of Python 3 crawling torrent seed link is the whole content shared by Xiaobian. I hope to give you a reference, and I hope you can support developeppaer more.