Handling of page garbled code obtained by Python


When using requests module to obtain website data, website coding is a very troublesome problem. Generally, requests will automatically identify the website code. If the page does not specify a code, it will default to iso-8859-1 code. Something may go wrong at this time.

Generally, there are several ways. The simplest is to artificially specify the code R. encoding = ‘UTF-8’

However, when collecting data, you may visit websites with different domain names. At this time, it is difficult to artificially specify a correct code for each website. The following are general methods

if r.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(r.text)
            if encodings:
                encoding = encodings[0]
                encoding = r.apparent_encoding
            return r.content.decode(encoding, 'replace')
            return r.text

This work adoptsCC agreement, reprint must indicate the author and the link to this article