Rethinking http3xx redirection mechanism

Time:2021-7-21

Problem introduction

When collecting data some time ago, we need to download the CDN log for analysis. The interface of log download is very complex, and there is no corresponding SDK. It only provides a shell script. Although it is convenient to use on Ubuntu, the redirection analysis in the script is very complex. So I want to have a deeper understanding of redirection.

Script for querying the list of network log

#!/bin/sh

TMP_FILE="/tmp/wslog_query_client.log"
#Usage
Usage() {
	echo "wslog_query_client.sh [query_url] [user] [passwd] [start_time] [end_time] [channels]"
	return 0
}
#check input parameters
if [ $# -eq 1 ]; then
	if [ "$1" = "-h" ]; then
		Usage
		exit 0
	else
		Usage
		exit -1
	fi
elif [ $# -ne 6 ]; then
	Usage
	exit -1
fi
#params set
url=$1
user=$2
passwd=`echo $3 | sed 's/&/%26/g' `
start_time=$4
end_time=$5
channels=$6
#access logQuery access API
curl -s -D $TMP_FILE $1
cat $TMP_FILE | grep "HTTP/" | grep "302" > /dev/null
if [ $? -ne 0 ]; then
	exit -2
fi
#redirect to verify url with user and passwd
TMP_URL=`cat $TMP_FILE | grep "Location: "|sed 's/\r//' | awk '{print $2}' | sed 's/http:/https:/'`
TMP_URL="${TMP_URL}?u=$user&p=$passwd&channel=$channels"
curl -s -k -D $TMP_FILE $TMP_URL
cat $TMP_FILE | grep "HTTP/" | grep "302" > /dev/null
if [ $? -ne 0 ]; then
	exit -3
fi
#redirect to query url with start_time, end_time and channels
TMP_URL=`cat $TMP_FILE | grep "Location: "|sed 's/\r//' | awk '{print $2}'`
TMP_URL="${TMP_URL}&start_time=$start_time&end_time=$end_time&channels=$channels"
curl -s -D $TMP_FILE $TMP_URL
#check query result
cat $TMP_FILE | grep "HTTP/" | grep "200" > /dev/null
if [ $? -ne 0 ]; then
	if 
		cat $TMP_FILE | grep "HTTP/" | grep "404" > /dev/null
	then
		exit -404
	else
		exit -4
	fi
fi
exit 0

Script call command and result (user name, password, domain and wskey have been processed, call result is only for reference)

[email protected]:/tmp# sh /root/wslog_query_client.sh "http://dx.wslog.chinanetcenter.com/logQuery/access" user1 passwd1 2017-08-30-0000 2017-08-30-2359 "rtmp-wsz.enterprise.com"
{"logs": [{"domain": "rtmp-wsz.enterprise.com", "files": [{"size": 4320, "end_time": "2017-08-30-1159", "start_time": "2017-08-30-0000", "url": "http://dx.wslog.chinanetcenter.com/log/qukan/rtmp-wsz.enterprise.com/2017-08-30-0000-1130_rtmp-wsz.enterprise.com.cn.log.gz?wskey=e4030060bdfe9d5600a77726c5900d07aa3adae00e8b2"}, {"size": 8006, "end_time": "2017-08-30-2359", "start_time": "2017-08-30-1200", "url": "http://dx.wslog.chinanetcenter.com/log/qukan/rtmp-wsz.enterprise.com/2017-08-30-1200-2330_rtmp-wsz.enterprise.com.cn.log.gz?wskey=3772006094880e8300a73cc2c59006bfeea33ae00d9da"}]}]}

The calling process of the script is 302 redirection step by step according to the parameters. The redirection depends on the parameters. The parameters that each redirection depends on are not the same, not just URL jump. If you directly use the following HTTP link, you can’t jump to it. Therefore, you need to parse layer by layer according to the shell script.

http://dx.wslog.chinanetcenter.com/logQuery/access?user=user1&passwd=passwd1&channels=rtmp-wsz.enterprise.com&start_time=2017-08-30-0000&end_time=2017-08-30-2359

The principle of HTTP redirection

The client initiates an HTTP request. If the server returns an HTTP redirection response, the client will request a new URL. This is the process of redirection. This process is redirection. It is automatically completed between the client and the server, and is invisible to the user.

Different types of redirection mapping can be divided into three categories: permanent redirection, temporary redirection and special redirection.

If you want to permanently change your website to a new domain name, you should use 301 permanent redirection. When the search engine robot encounters the status code, it will trigger an update operation and modify the URL related to the resource in its index library.

The use of HTTP redirection

This paper mainly introduces the use of HTTP redirection in Python and shell.

Python

The HTTP libraries urllib, urllib2 and requests commonly used in Python support HTTP redirection. Take the requests library as an example.

import requests


def get_final_link(url):
    try:
        r = requests.get(url=url, allow_redirects=False)
        if r.status_code == 302 or r.status_code == 301:
            return get_final_link(r.headers['Location'])
        else:
            return r.url
    except:
        return url


def get_final_link1(url):
    r = requests.get(url=url, allow_redirects=True)
    for rsp in r.history:
        print rsp.url
    return r.url

print get_final_link(url='http://runreport.dnion.com/DCC/logDownLoad.do?user=user1&password=password1&domain=rtmpdist-d.quklive.com&date=20171026&hour=10')
print get_final_link(url='https://github.com')
print get_final_link(url='http://github.com')
print get_ final_ link1(url=' http://github.com ') will occur

If it is determined that all HTTP (s) requests are in the process of redirection, then allow_ Set the redirects parameter to true to get the final HTTP link. If not, you need to do recursive parsing yourself.

If you want to download files simply, you can use theurllib.urlretrieveEasily competent, even if the final link is FTP.

Shell

Simulate with curl command

-L parameter, when the page has a jump, the output jump to the page

-When there is a jump in the I parameter header information, the new URL address to jump to can be determined by curl – L – I URL | grep location

[email protected]:~# curl -L -I "http://runreport.dnion.com/DCC/logDownLoad.do?user=user1&password=password1&domain=rtmpdist-d.quklive.com&date=20171026&hour=10"
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=0F11668F6EBF4DC16B43E322CCF16C85; Path=/DCC
Location: http://runreport.dnion.com/logDownLoad.do?user=qukan&password=0cddcbf6d292fab5de0aas931bf19c&domain=rtmpdist-d.quklive.com&date=20171026&hour=10
Content-Type: text/html;charset=GBK
Content-Length: 0
Date: Mon, 06 Nov 2017 09:46:44 GMT

HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
Location: ftp://ABA606843D412DAE34F28CDB23F7A31E:0687B16F2F5D0A2637FACDB23F[email protected]125.39.237.48:55621/rtmpdist-d.quklive.com_20171026_10_11.gz
Content-Type: text/html;charset=GBK
Content-Length: 0
Date: Mon, 06 Nov 2017 09:46:44 GMT

Last-Modified: Thu, 26 Oct 2017 02:30:13 GMT
Content-Length: 1932
Accept-ranges: bytes

Finally, jump to the required FTP link.

HTTP redirection packet capture verification

The results of packet capture using Wireshark are as follows:

The first jump process is as follows

The second jump process is as follows

So you can clearly see the process of 302 jump through the packet capture


reference resources:

  1. Common parameter usage of CSDN curl command
  2. Mozilla HTTP redirection

Remember to praise me!

Carefully sorted out the video courses and e-books in all directions of the computer, including introduction, advanced and actual combat. According to the reasonable classification according to the directory, you can always find the learning materials you need. What are you waiting for? Pay attention to download it!!!

resource-introduce

If you don’t forget, there will be an echo. Please give me a compliment. Thank you very much.

I am a bright brother in the workplace, YY Senior Software Engineer, with four years of working experience. I refuse to be a leading slash programmer.

Listen to me, more progress, a shuttle of procedural life

If you are lucky enough to help you, please give me a “like” to pay attention to it. If you can comment on it and give me encouragement, I will be very grateful.

List of articles of workplace bright brother:More articles

wechat-platform-guide-attention

All my articles and answers have cooperation with the copyright protection platform. The copyright belongs to brother Liang in the workplace. Without authorization, reprint must be investigated!