Can Python prove its unreliability by crawling the data of century Jiayuan?

Time:2021-9-18

The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article first level Python technology, by Parson sauce

Novices and Xiaobai who have just come into contact with Python can copy the following link to watch the basic introduction video of Python for free

https://v.douyu.com/author/y6AZ4jn9jwKW
Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

preface

Today, I saw a discussion on “is it reliable for century Jiayuan to find an object?” on Zhihu. 1903 people paid attention to it and were browsed 1940753 times. Most of the 355 answers were unreliable. Can the data of century Jiayuan crawling in Python prove its unreliability?

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

1、 Data capture

Open the century Jiayuan website on the PC side and search for girlfriends aged 20 to 30, regardless of region

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

Turned a few pages and found a search_ V2.php link. Its return value is an irregular JSON string, including nickname, gender, marriage, matching conditions, etc

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

Click headers and pull it to the bottom. In its parameters, sex is gender, STC is age, P is paging, and liststyle has photos

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

Through the get method of URL + parameter, 10000 pages of data are captured, with a total of 240116

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

The modules to be installed are openpyxl, which is used to filter special characters

# coding:utf-8
import csv
import json

import requests
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
import re

line_index = 0

def fetchURL(url):
   
    headers = {
        'accept': '*/*',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
        'Cookie': 'guider_quick_search=on; accessID=20201021004216238222; PHPSESSID=11117cc60f4dcafd131b69d542987a46; is_searchv2=1; SESSION_HASH=8f93eeb87a87af01198f418aa59bccad9dbe5c13; user_access=1; Qs_lvt_336351=1603457224; Qs_pv_336351=4391272815204901400%2C3043552944961503700'
    }

    r = requests.get(url, headers=headers)
    r.raise_for_status()
    return r.text.encode("gbk", 'ignore').decode("gbk", "ignore")


def parseHtml(html):

    html = html.replace('\\', '')
    html = ILLEGAL_CHARACTERS_RE.sub(r'', html)
    s = json.loads(html,strict=False)
    global line_index

    userInfo = []
    for key in s['userInfo']:
        line_index = line_index + 1
        a = (key['uid'],key['nickname'],key['age'],key['work_location'],key['height'],key['education'],key['matchCondition'],key['marriage'],key['shortnote'].replace('\n',' '))
        userInfo.append(a)

    with open('sjjy.csv', 'a', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(userInfo)


if __name__ == '__main__':
    
    for i in range(1, 10000):
        url = 'http://search.jiayuan.com/v2/search_v2.php?key=&sex=f&stc=23:1,2:20.30&sn=default&sv=1&p=' + str(i) + '&f=select&listStyle=bigPhoto'
        html = fetchURL(url)
        print(str(i)  + ' Page '  +  str(len(html))  + '*********' *  20)
        parseHtml(html)

2、 Weight removal

When removing the duplicate data, I found a lot of duplicates. I thought there was a problem with the code. After checking the bugs for a long time, I finally found that there were only a lot of duplicates in the data on 100 pages of the website. The following two figures are 110 pages of data and 111 pages of data respectively. Are there many familiar faces.

110 pages of data

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

111 pages of data

Python爬取世纪佳缘的数据,是否能证明它的不靠谱?

 

Only the duplicate data is left after filtering   one thousand eight hundred and seventy-two   Yes, this water is really big

def filterData():
    filter = []
    csv_reader = csv.reader(open("sjjy.csv", encoding='gbk'))
    i = 0
    for row in csv_reader:
        i = i + 1
        Print ('processing: '  +  str(i)  + ' Line ')
        if row[0] not in filter:
            filter.append(row[0])
    print(len(filter))