Can Python prove its unreliability by crawling the data of century Jiayuan?


The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article first level Python technology, by Parson sauce

Novices and Xiaobai who have just come into contact with Python can copy the following link to watch the basic introduction video of Python for free



Today, I saw a discussion on “is it reliable for century Jiayuan to find an object?” on Zhihu. 1903 people paid attention to it and were browsed 1940753 times. Most of the 355 answers were unreliable. Can the data of century Jiayuan crawling in Python prove its unreliability?



1、 Data capture

Open the century Jiayuan website on the PC side and search for girlfriends aged 20 to 30, regardless of region



Turned a few pages and found a search_ V2.php link. Its return value is an irregular JSON string, including nickname, gender, marriage, matching conditions, etc



Click headers and pull it to the bottom. In its parameters, sex is gender, STC is age, P is paging, and liststyle has photos



Through the get method of URL + parameter, 10000 pages of data are captured, with a total of 240116



The modules to be installed are openpyxl, which is used to filter special characters

# coding:utf-8
import csv
import json

import requests
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
import re

line_index = 0

def fetchURL(url):
    headers = {
        'accept': '*/*',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
        'Cookie': 'guider_quick_search=on; accessID=20201021004216238222; PHPSESSID=11117cc60f4dcafd131b69d542987a46; is_searchv2=1; SESSION_HASH=8f93eeb87a87af01198f418aa59bccad9dbe5c13; user_access=1; Qs_lvt_336351=1603457224; Qs_pv_336351=4391272815204901400%2C3043552944961503700'

    r = requests.get(url, headers=headers)
    return r.text.encode("gbk", 'ignore').decode("gbk", "ignore")

def parseHtml(html):

    html = html.replace('\\', '')
    html = ILLEGAL_CHARACTERS_RE.sub(r'', html)
    s = json.loads(html,strict=False)
    global line_index

    userInfo = []
    for key in s['userInfo']:
        line_index = line_index + 1
        a = (key['uid'],key['nickname'],key['age'],key['work_location'],key['height'],key['education'],key['matchCondition'],key['marriage'],key['shortnote'].replace('\n',' '))

    with open('sjjy.csv', 'a', newline='') as f:
        writer = csv.writer(f)

if __name__ == '__main__':
    for i in range(1, 10000):
        url = ',2:20.30&sn=default&sv=1&p=' + str(i) + '&f=select&listStyle=bigPhoto'
        html = fetchURL(url)
        print(str(i)  + ' Page '  +  str(len(html))  + '*********' *  20)

2、 Weight removal

When removing the duplicate data, I found a lot of duplicates. I thought there was a problem with the code. After checking the bugs for a long time, I finally found that there were only a lot of duplicates in the data on 100 pages of the website. The following two figures are 110 pages of data and 111 pages of data respectively. Are there many familiar faces.

110 pages of data



111 pages of data



Only the duplicate data is left after filtering   one thousand eight hundred and seventy-two   Yes, this water is really big

def filterData():
    filter = []
    csv_reader = csv.reader(open("sjjy.csv", encoding='gbk'))
    i = 0
    for row in csv_reader:
        i = i + 1
        Print ('processing: '  +  str(i)  + ' Line ')
        if row[0] not in filter: