Python batch crawls to the music of Chinese superstar Jay Chou

Time:2021-11-27

preface

The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

PS: if you need Python learning materials, you can click the link below to get them by yourself

Python free learning materials and group communication solutions. Click to join

The little friend said he wanted to listen to Jay Chou’s music. What website can listen to Jay Chou’s music for free? Then he found that Migu music can listen to Jay Chou’s songs for free. Since he can listen to Jay Chou’s music for free, can’t he climb~

Basic development environment

  • Python 3.6
  • Pycharm
import requests
import parsel

Related modulespipInstall it

在这里插入图片描述

Target site analysis

在这里插入图片描述
Click the play button to automatically jump to the music playing page

在这里插入图片描述
There is a download button on the playback interface. Click download.

Yes, you need to log in

在这里插入图片描述

  • Open developer tools
  • Select network
  • Click download now

There will be a download data interface and a post request data interface. The returned data has the real address of the audio.
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
Copying the URL address will automatically download the file to the local

Since it is a post request, you only need to see the changes of data parameters and the parameters it needs to pass
在这里插入图片描述
Check the downloads of a few more songs to find outcopyrightIdIt is the ID value of each song. You can download the music only by obtaining the ID value of each song.

So return to Jay Chou’s music list page

在这里插入图片描述
It can be found that the music list page is a static website. You can directly use requests to request the website to parse the website data, and you can obtain the ID value and title of music.

Now there is the last problem left, that is, page turning and multi page acquisition.

For page turning and crawling, just click the next page to view the change of URL address and find its corresponding change law.

在这里插入图片描述
Page is the corresponding page number, so it’s done to turn the page and crawl. The next step is to write code

1. Request the web page to get the ID value and title of the music

I won’t bring the cookie. You can log in to Migu music and copy it in the developer tool

ef get_mp3_info(url):
    headers = {
        'cookie': '',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=url, headers=headers)
    selector = parsel.Selector(response.text)
    lis = selector.css('#J_PageSonglist > div.songlist-body > div')
    for li in lis:
        page_url = li.css('.song-name-txt::attr(href)').get()
        mp3_id = page_url.split('/')[-1]
        title = li.css('.song-name-txt::attr(title)').get()

2. Post request for music download address

There is no need to write so many header parameters here. For convenience, you can directly copy and paste them. Because it is a post request, some parameters are necessary to bring, otherwise you will get the desired return result.

def get_mp3_url(mp3_id, title):
    url = 'https://music.migu.cn/v3/api/order/download'
    headers = {
        'authority': 'music.migu.cn',
        'method': 'POST',
        'path': '/v3/api/order/download',
        'scheme': 'https',
        'accept': '*/*',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9',
        'cache-control': 'no-cache',
        'content-length': '42',
        'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'cookie': '',
        'origin': 'https://music.migu.cn',
        'pragma': 'no-cache',
        'referer': 'https://music.migu.cn/v3/music/order/download/60054701923',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
    }

    data = {
        'copyrightId': '{}'.format(mp3_id),
        'payType': '01',
        'type': '1'
    }
    response = requests.post(url=url, data=data, headers=headers)
    html_data = response.json()
    mp3_url = html_data['downUrl']

3. Save music to local

Saving code is still relatively simple and commonly usedwith open

def download(download_url, title):
    response = requests.get(url=download_url)
    Path = 'music \ \' + title + '. MP3'
    with open(path, mode='wb') as f:
        f.write(response.content)

Specific implementation effect

Some music still needs to be paid. Therefore, when you post a request for paid music, there is no download address. You can write a judgment
在这里插入图片描述
在这里插入图片描述

summary

The code can be optimized. This is only the simplest version of crawler code. The download speed is not very fast. You can use multi-threaded crawling for better speed. You can optimize yourself.

Crawling is not difficult, mainly because it analyzes websites, unless it involves heavily encrypted websites, such as font encryption and JS data encryption. Most of these websites only need to spend some time analyzing the data of the website.

In fact, the font encrypted website is not difficult, mainly because the crawling process is a little complicated. Come on