Why should crawler engineers have some basic back-end common sense?

Time:2022-5-4

Today, in the fan exchange group, a classmate said he found itRequestsAnd fixed it:

Why should crawler engineers have some basic back-end common sense?

The corresponding pictures in the chat record are:

Why should crawler engineers have some basic back-end common sense?

Seeing the screenshot of this classmate, I probably know what problems he encountered and why he mistakenly thought it was a requests bug.

To explain this problem, we need to understand one problem first, that is, the two display forms of JSON strings andjson.dumpsofensure_asciiParameters.

Suppose we have a dictionary in Python:

Info = {'name': 'Qingnan', 'age': 20}

When we want to convert it into JSON string, we may write code like this:

import json
Info = {'name': 'Qingnan', 'age': 20}
info_str = json.dumps(info)
print(info_str)

The operation effect is shown in the figure below, and Chinese is changed into Unicode code:

Why should crawler engineers have some basic back-end common sense?

We can also add a parameterensure_ascii=False, let Chinese display normally:

info_str = json.dumps(info, ensure_ascii=False)

The operation effect is shown in the figure below:

Why should crawler engineers have some basic back-end common sense?

The classmate believes that due to{"name": "\u9752\u5357", "age": 20}and{"name": "Qingnan", "age": 20}From the perspective of string, it is obviously not equal. When requests sends data through post, it does not have this parameter by defaultjson.dumpsFor example, omitting this parameter is equivalent toensure_ascii=True

Why should crawler engineers have some basic back-end common sense?

So actuallyRequestsWhen post contains Chinese data, it will convert Chinese into Unicode code and send it to the server, so the server can’t get the original Chinese information at all. Therefore, it will lead to error reporting.

But in fact, this is not the case. I often tell my classmates in the group that students who are reptiles should have some basic back-end common sense so as not to be misled by this phenomenon. In order to explain why the above student’s understanding is wrong and why this is not a request bug, let’s write a post service to see if there is any difference between our post data. In order to prove that this feature has nothing to do with the network framework, I use flash, fastapi and gin to demonstrate it.

First, let’s take a look at the requests test code. Here, data in JSON format is sent in three ways:

import requests 
import json 

body = {
    'name': 'Qingnan',
    'age': 20
}
url = 'http://127.0.0.1:5000/test_json'

#Send directly using JSON =
resp = requests.post(url, json=body).json() 
print(resp)

headers = {
    'Content-Type': 'application/json'
}

#The dictionary is serialized into JSON strings in advance, and Chinese is converted into Unicode, which is equivalent to the first method
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body)).json()
print(resp)

#The dictionary is serialized into JSON strings in advance, and Chinese is reserved
resp = requests.post(url,
                     headers=headers,
                     data=json.dumps(body, ensure_ascii=False).encode()).json()
print(resp)

This test code uses three methods to send post requests, of which the first method is the self-contained method of requestsjson=Parameter. The parameter value is a dictionary. Requests will automatically convert it to a JSON string. In the latter two ways, we manually convert the dictionary into JSON string in advance, and then usedata=Parameters are sent to the server. These two methods need to be specified in headers'Content-Type': 'application/json', the server knows that the JSON string is sent.

Let’s take a look at the back-end code written by flask:

from flask import Flask, request
app = Flask(__name__)


@app.route('/')
def index():
    return {'success': True}


@app.route('/test_json', methods=["POST"])
def test_json():
    body = request.json 
    MSG = f 'received post data, {body ["name"] =}, {body ["age"] =}'
    print(msg)
    return {'success': True, 'msg': msg}

The operation effect is shown in the figure below:

Why should crawler engineers have some basic back-end common sense?

It can be seen that no matter which post method is used, the back end can receive the correct information.

Let’s look at the fastapi version:

from fastapi import FastAPI
from pydantic import BaseModel 


class Body(BaseModel):
    name: str
    age: int 

app = FastAPI()



@app.get('/')
def index():
    return {'success': True}


@app.post('/test_json')
def test_json(body: Body):
    MSG = f 'received post data, {body. Name =}, {body. Age =}'
    print(msg)
    return {'success': True, 'msg': msg}

The operation effect is shown in the figure below. The data sent by three kinds of post can be correctly recognized by the back end:

Why should crawler engineers have some basic back-end common sense?

Let’s take a look at the back end of the gin version:

package main

import (
    "fmt"
    "net/http"

    "github.com/gin-gonic/gin"
)

type Body struct {
    Name string `json:"name"`
    Age  int16  `json:"age"`
}

func main() {
    r := gin.Default()
    r.GET("/", func(c *gin.Context) {
        c.JSON(http.StatusOK, gin.H{
            "message": "running",
        })
    })
    r.POST("/test_json", func(c *gin.Context) {
        json := Body{}
        c.BindJSON(&json)
        msg := fmt. Sprintf ("received post data, name =% s, age =% d", JSON. Name, JSON. Age)
        fmt.Println(">>>", msg)
        c.JSON(http.StatusOK, gin.H{
            "msg": fmt. Sprintf ("received post data, name =% s, age =% d", JSON. Name, JSON. Age),
        })
    })
    r.Run()
}

The operation effect is as follows. The data of the three request methods are exactly the same:

Why should crawler engineers have some basic back-end common sense?

From here, we can know that whether Chinese exists in the form of Unicode code or directly in the form of Chinese characters in the JSON string submitted by post, the back-end service can correctly parse it.

Why doesn’t it matter in which form Chinese is displayed in the JSON string? This is because, for a JSON string, the programming language converts it back into an object (calledDeserialization), itself can handle them correctly. Let’s look at the following figure:

Why should crawler engineers have some basic back-end common sense?

ensure_asciiParameter only controls the display style of JSONensure_asciibyTrueMake sure that there are only ASCII characters in the JSON string, so all characters that are not within 128 ASCII characters will be converted. And whenensure_asciibyFalseThese non ASCII characters are still displayed as they are. This is like a person with or without make-up. The essence will not change. When modern programming languages deserialize them, both forms can be correctly recognized.

Therefore, if you use modern web framework to write the back end, there should be no difference between the two JSON forms. Request defaultjson=Parameter, equivalent toensure_ascii=True, any modern web framework can correctly identify the content submitted by post.

Of course, if you use C, assembly or other languages to write back-end interfaces naked, it may be different. But for people with normal IQ, who would do this?

To sum up, the problem encountered by this student is not the bug of requests, but the problem of his back-end interface itself. Maybe the back-end uses some kind of retarded web framework. The information it receives from post is a JSON string without deserialization, and the back-end programmer uses regular expressions to extract data from the JSON string. Therefore, when it is found that there is no Chinese in the JSON string, it will report an error.

In addition to the problem of sending JSON through post, I once had a subordinate who, when sending post information using scripy, couldn’t write post code, had a whim, spliced the fields sent by post into the URL, and then requested by get. It was found that data could also be obtained, similar to:

Body = {'name': 'Qingnan', 'age': 20}
url = 'http://www.xxx.com/api/yyy'
requests.post(url, json=body).text

requests. get(' http://www.xxx.com/api/yyy?name= Qingnan & age = 20 ') text

So the student came to a conclusion that he thought it was a general rule that all post requests could be transferred to get requests in this way.

But obviously, this conclusion is also incorrect. This only shows that the back-end programmer of this website can make this interface compatible with two ways of submitting data at the same time, which requires the back-end programmer to write additional code. By default, get and post are two completely different request methods, and they cannot be converted in this way.

If the student knows some simple back ends, he can immediately write a back-end program to verify his conjecture.

For another example, some websites may include another URL in the URL, for example:

https://kingname.info/get_info?url=https://abc.com/def/xyz?id=123&db=admin

If you don’t have basic back-end knowledge, you may not see what’s wrong with the above website. But if you have some basic back-end knowledge, you may ask a question: what is in the website&db=admin, belongs tohttps://kingname.info/get_infoA parameter of, withurl=Level; Still belong tohttps://abc.com/def/xyz?id=123&db=adminParameters? You will be confused, and so will the backend. That’s why we need URLEncode at this time. After all, the following two writing methods are completely different:

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123%26db%3Dadmin

https://kingname.info/get_info?url=https%3A%2F%2Fabc.com%2Fdef%2Fxyz%3Fid%3D123&db=admin

Finally, I will summarize with a sentence in the preface of my reptile book:

Reptile is a chore. If you only know reptile, you can’t learn reptile well.

Why should crawler engineers have some basic back-end common sense?