Knowledge points
- Dynamic packet capture
- Dynamic page analysis
- Requests carries parameters to send requests
- JSON data parsing
development environment
- Python 3.8 newer stable running code
- Pycharm 2021.2 auxiliary code
- Requests third party module
I Data source analysis (train of thought analysis)
1. Open the developer tool to refresh the web page
-
Right click Check or F12 to open
-
Select network and refresh the web page
-
Click to open a video
-
Click to find the content
-
Expand the view in turn to find the video address we need
2. Determine the URL address, request method, request parameters and request header parameters
-
Request header parameters
-
Request parameters
3. Summary
- Request method: Post
- Request header (disguise):
headers = {
'content-type': 'application/json',
'cookie': 'your own cookie',
'Host': 'www.kuaishou.com',
'Origin': 'https://www.kuaishou.com',
'Referer': 'https://www.kuaishou.com/profile/3xv78fxycm35nn4',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
- Request parameters:
data = {
'operationName': "visionProfilePhotoList",
'query': "query visionProfilePhotoList($pcursor: String, $userId: String, $page:
String, $webPageArea: String) {\n visionProfilePhotoList(pcursor: $pcursor, userId:
$userId, page: $page, webPageArea: $webPageArea) {\n result\n llsid\n
webPageArea\n feeds {\n type\n author {\n id\n name\n
following\n headerUrl\n headerUrls {\n cdn\n url\n
__typename\n }\n __typename\n }\n tags {\n type\n
name\n __typename\n }\n photo {\n id\n
duration\n caption\n likeCount\n realLikeCount\n
coverUrl\n coverUrls {\n cdn\n url\n __typename\n
}\n photoUrls {\n cdn\n url\n __typename\n
}\n photoUrl\n liked\n timestamp\n expTag\n
animatedCoverUrl\n stereoType\n videoRatio\n
profileUserTopPhoto\n __typename\n }\n canAddComment\n
currentPcursor\n llsid\n status\n __typename\n }\n hostName\n
pcursor\n __typename\n }\n}\n",
'variables': {'userId': "3x9dquvtb9n9fps", 'pcursor': "", 'page': "profile"}
}
- If you need to turn pages and crawl later, you need to use recursive implementation
II code implementation
1. Send a request to visit the website
url = 'https://www.kuaishou.com/graphql'
#Disguise
headers = {
#Control data type JSON type string
'content-type': 'application/json',
'Cookie': 'kpf=PC_WEB; kpn=KUAISHOU_VISION; clientid=3; did=web_ea128125517a46bd491ae9ccb255e242; client_key=65890b29; userId=270932146; kuaishou.server.web_st=ChZrdWFpc2hvdS5zZXJ2ZXIud2ViLnN0EqABnjkpJPZ-QanEQnI0XWMVZxXtIqPj-hwjsXBn9DHaTzispQcLjGR-5Xr-rY4VFaIC-egxv508oQoRYdgafhxSBpZYqLnApsaeuAaoLj2xMbRoytYGCrTLF6vVWJvzz3nzBVzNSyrXyhz-RTlRJP4xe1VjSp7XLNLRnVFVEtGPuBz0xkOnemy7-1-k6FEwoPIbOau9qgO5mukNg0qQ2NLz_xoSKS0sDuL1vMmNDXbwL4KX-qDmIiCWJ_fVUQoL5jjg3553H5iUdvpNxx97u6I6MkKEzwOaSigFMAE; kuaishou.server.web_ph=b282f9af819333f3d13e9c45765ed62560a1',
'Host': 'www.kuaishou.com',
'Origin': 'https://www.kuaishou.com',
'Referer': 'https://www.kuaishou.com/profile/3xauthkq46ftgkg',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
}
#< response [200] >: send request success result
response = requests.post(url=url, headers=headers, json=data)
2. Obtain data
json_data = response.json()
3. Parse the data to remove unwanted content
feeds = json_data['data']['visionProfilePhotoList']['feeds']
#Parameters required on the next page
pcursor = json_data['data']['visionProfilePhotoList']['pcursor']
# print(pcursor)
for feed in feeds:
Caption = feed ['photo '] ['caption'] # title
Photourl = feed ['photo '] ['photourl'] # video link
#\: escape character, write directly \ cannot match\
#\ \ to match\
#Using CSS and XPath is necessary. The data you get is a web page source code
caption = re.sub('[\\/:*?"<>|\n\t]', '', caption)
print(caption, photoUrl)
5. The video data obtained is binary video data
video_data = requests.get(url=photoUrl).content
6. Save the video in binary mode
with open(f'video/{caption}.mp4', mode='wb') as f:
f.write(video_data)
Print (caption, 'download complete!')
Page crawling
def get_page(pcursor):
#The required data must be specified
#Recursion, call yourself and jump out of recursion
data = {
'operationName': "visionProfilePhotoList",
'query': "query visionProfilePhotoList($pcursor: String, $userId: String, $page: String, $webPageArea: String) {\n visionProfilePhotoList(pcursor: $pcursor, userId: $userId, page: $page, webPageArea: $webPageArea) {\n result\n llsid\n webPageArea\n feeds {\n type\n author {\n id\n name\n following\n headerUrl\n headerUrls {\n cdn\n url\n __typename\n }\n __typename\n }\n tags {\n type\n name\n __typename\n }\n photo {\n id\n duration\n caption\n likeCount\n realLikeCount\n coverUrl\n coverUrls {\n cdn\n url\n __typename\n }\n photoUrls {\n cdn\n url\n __typename\n }\n photoUrl\n liked\n timestamp\n expTag\n animatedCoverUrl\n stereoType\n videoRatio\n profileUserTopPhoto\n __typename\n }\n canAddComment\n currentPcursor\n llsid\n status\n __typename\n }\n hostName\n pcursor\n __typename\n }\n}\n",
'variables': {'userId': "3xauthkq46ftgkg", 'pcursor': pcursor, 'page': "profile"}
}
if pcursor == None:
Print ('download complete ')
return 0
get_page(pcursor)
get_page('')
Effect display