Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

Time:2021-9-12

The text and pictures of this article come from the network, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article is attached to the study of Python by brother J

Novices and Xiaobai who have just come into contact with Python can copy the following link to watch the basic introduction video of Python for free

https://v.douyu.com/author/y6AZ4jn9jwKW

 

preface

A few days ago, Guangshen’s friends probably wore short sleeves and envied the snowy atmosphere in the north. As a result, just last week, Guangzhou and Shenzhen also welcomed the cooling, and everyone joined the “cooling group chat”.

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

In order to help everyone resist the cold, I specially climbed down Jingdong’s down jacket data. Why not tmall? The reason is very simple. Slider verification is a little troublesome.

Data acquisition

Jingdong website is a Ajax dynamically loaded website, which can only be crawled through the parsing interface or using selenium automated testing tools. About dynamic web crawler, this official account of the history of the original article introduced, interested friends can go to understand.

This data acquisition adopts selenium. Because my Google browser version is updated quickly, the original Google driver is interrupted. So I replaced the browser to update automatically and downloaded the corresponding version of the driver.

Then, use selenium to search for down jacket on jd.com, scan the mobile phone code to log in, and obtain the commodity name, commodity price, store name, number of comments and other information of down jacket.

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from lxml import etree
import random
import json
import csv
import time

Browser = webdriver. Chrome ('/ cuisine J learning Python / JD / chromedriver')
Wait = webdriverwait (browser, 50) # set the wait time
url = 'https://www.jd.com/'
data_ List = [] # set global variables to store data
Keyword = "down jacket" # keyword

def page_click(page_number):
    try:
        #Slide to bottom
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        Time. Sleep (random. Random (1, 3)) # random delay
        button = wait.until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.pn-next > em'))
        )#Page turning button
        Button. Click () # click the button
        wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(30)"))
        )#Wait until 30 items are loaded
        #Slide to the bottom and load the last 30 items
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait.until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)"))
        )#Wait until 60 items are loaded
        wait.until(
            EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#J_bottomPage > span.p-num > a.curr"), str(page_number))
        )#Judge whether the page turning is successful, and the highlighted button number is the same as the set page number
        html = browser.page_ Source # get web page information
        prase_ HTML (HTML) # calls the function that extracts the data
    except TimeoutError:
        return page_click(page_number)

 

Data cleaning

Import data

import pandas as pd
import numpy as np
df = pd.read_ CSV ("/ Cai J Xue Python / JD / down jacket. CSV")
df.sample(10)

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

Heavy life list

DF = df.rename (columns = {'title': 'commodity name', 'price': 'commodity price', 'shop_name': 'store name', 'comment': 'number of comments'})

 

View data information

df.info()
'''
1. There may be duplicate values
2. The store name has a missing value
3. The number of evaluators needs cleaning
'''

RangeIndex: 4950 entries, 0 to 4949
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0 commodity name 4950 non null object 
 1 commodity price 4950 non null float64
 2 store name 4949 non null object 
 3 number of comments 4950 non null objects 
dtypes: float64(1), object(3)
memory usage: 154.8+ KB

 

Delete duplicate data

df = df.drop_duplicates()

 

Missing value processing

DF ["store name"] = DF ["store name"]. Fillna ("anonymous")

 

Trade name cleaning

thickness

tmp=[]
For I in DF [trade name]:
    If "thick" in I:
        Tmp.append ("premium payment")
    Elif "thin" in I:
        Tmp.append ("thin money")
    else:
        Tmp.append ("other")
DF ['thickness'] = TMP

 

Plate type

For I in DF [trade name]:
    If "slim fit" in I:
        Tmp.append ("slim fit")
    Elif "loose" in I:
        Tmp.append ("loose type")
    else:
        Tmp.append ("other")
DF ['version'] = TMP

 

style

tmp=[]
For I in DF [trade name]:
    If "Han" in I:
        Tmp.append ("Korean version")
    Elif "business" in I:
        Tmp.append ("business wind")
    Elif "leisure" in I:
        Tmp.append ("casual style")
    Elif "simplicity" in I:
        Tmp.append ("minimalist style")
    else:
        Tmp.append ("other")
DF ['style'] = TMP

 

Commodity price cleaning

DF ["price range"] = pd.cut (DF ["commodity price"], [0, 100300, 500, 700, 10001000000], labels = ['less than 100 yuan', '100 yuan - 300 yuan', '300 yuan - 500 yuan', '500 yuan - 700 yuan', '700 yuan - 1000 yuan', 'more than 1000 yuan], right = false)

 

Number of evaluators

import re
DF ['number'] = [re. Findall (R '(\ D + \. {0,1} \ d *), I) [0] for I in DF [' number of comments'] # extract the number
DF ['number'] = DF ['number']. Astype ('float ') # convert to numeric type
DF ['unit'] = [''. Join (re. Findall (R '(10000'), I)) for I in DF ['number of comments'] # extraction unit (10000)
DF ['unit'] = DF ['unit']. Apply (lambda x: 10000if x = = '10000' Else1)
DF ['number of comments'] = DF [' number '] * DF [' unit '] # calculate the number of comments
DF ['number of comments'] = DF [' number of comments']. Astype ("int")
Df.drop (['number', 'unit'], axis = 1, inplace = true)

 

Shop name cleaning

DF ["store type"] = DF ["store name"]. STR [- 3:]

 

visualization

Introduce visual correlation Library

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Plt.rcparams ['font. Sans serif '] = ['simhei'] # set the loaded font name
PLT. Rcparams ['axes. Unicode_minus'] = false # solves the problem that the negative sign '-' is displayed as a square in the saved image 
import jieba
import re
from pyecharts.charts import *
from pyecharts import options as opts 
from pyecharts.globals import ThemeType  
import stylecloud
from IPython.display import Image

 

descriptive statistics

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

correlation analysis

Histogram of commodity price distribution

sns.set_style('white')   
fig,axes=plt.subplots(figsize=(15,8)) 
SNS. Distplot (DF ["commodity price"], color = "salmon", bins = 10) 
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
axes.set_ Title ("histogram of commodity price distribution")

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Distribution histogram of comments

sns.set_style('white')  
fig,axes=plt.subplots(figsize=(15,8)) 
SNS. Distplot (DF ["number of comments"], color = "green", bins = 10, rug = true) 
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
axes.set_ Title ("distribution histogram of comments")

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Relationship between the number of commentators and commodity prices

fig,axes=plt.subplots(figsize=(15,8)) 
SNS. Regplot (x = 'number of comments', y =' commodity price ', data = DF, color ='Orange', marker = '*')
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Down jacket price distribution

Df2 = DF ["price range"]. Astype ("STR"). Value_ counts()
print(df2)
df2 = df2.sort_values(ascending=False)
regions = df2.index.to_list()
values = df2.to_list()
c = (
        Pie(init_opts=opts.InitOpts(theme=ThemeType.DARK))
        .add("", list(zip(regions,values)))
        .set_ global_ Opts (legend_opts = opts. Legendopts (is_show = false), title_opts = opts.titleopts (title = "down jacket price range distribution", subtitle = "data source: Tencent video \ n graphics: cuisine J learning Python", pos_top = "0.5%", pos_left = 'left'))
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=14))
        
    )
c.render_notebook()

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Number of reviews TOP10 stores

DF5 = DF. Groupby ('store name ') [' number of comments']. Mean ()
df5 = df5.sort_values(ascending=True)
df5 = df5.tail(10)
print(df5.index.to_list())
print(df5.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1100px",height="600px"))
    .add_xaxis(df5.index.to_list())
    .add_ yaxis("",df5.to_list()).reversal_ Axis () #x-axis and y-axis exchange sequence
    .set_ global_ Opts (title_opts = opts. Titleopts (title = "top 10 reviewers", subtitle = "data source: JD \ tmapping: brother J", pos_left = 'left'),
                       xaxis_ Opts = opts.axisopts (axislabel_opts = opts. Labelopts (font_size = 11)), # change the abscissa font size
                       #yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),
                        yaxis_ Opts = opts. Axisopts (axislabel_opts = {"rotate": 30}) # change the ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right'))
    )
c.render_notebook()

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Plate type

DF5 = DF. Groupby ('version ') [' commodity price ']. Mean ()
df5 = df5.sort_values(ascending=True)[:2]
#df5 = df5.tail(10)
df5 = df5.round(2)
print(df5.index.to_list())
print(df5.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px"))
    .add_xaxis(df5.index.to_list())
    .add_ yaxis("",df5.to_list()).reversal_ Axis () #x-axis and y-axis exchange sequence
    .set_ global_ Opts (title_opts = opts. Titleopts (title = "average price of down jacket of each version", subtitle = "data source: Centaline \ tmapping: brother J", pos_left = 'left'),
                       xaxis_ Opts = opts.axisopts (axislabel_opts = opts. Labelopts (font_size = 11)), # change the abscissa font size
                       #yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),
                        yaxis_ Opts = opts. Axisopts (axislabel_opts = {"rotate": 30}) # change the ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right'))
    )
c.render_notebook()

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

thickness

DF5 = DF. Groupby ('thickness') ['commodity price']. Mean ()
df5 = df5.sort_values(ascending=True)[:2]
#df5 = df5.tail(10)
df5 = df5.round(2)
print(df5.index.to_list())
print(df5.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px"))
    .add_xaxis(df5.index.to_list())
    .add_ yaxis("",df5.to_list()).reversal_ Axis () #x-axis and y-axis exchange sequence
    .set_ global_ Opts (title_opts = opts. Titleopts (title = "average price of down jacket of each thickness", subtitle = "data source: JD \ tmapping: brother J", pos_left = 'left'),
                       xaxis_ Opts = opts.axisopts (axislabel_opts = opts. Labelopts (font_size = 11)), # change the abscissa font size
                       #yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),
                        yaxis_ Opts = opts. Axisopts (axislabel_opts = {"rotate": 30}) # change the ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right'))
    )
c.render_notebook()

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

style

DF5 = DF. Groupby ('style ') [' commodity price ']. Mean ()
df5 = df5.sort_values(ascending=True)[:4]
#df5 = df5.tail(10)
df5 = df5.round(2)
print(df5.index.to_list())
print(df5.to_list())
c = (
    Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px"))
    .add_xaxis(df5.index.to_list())
    .add_ yaxis("",df5.to_list()).reversal_ Axis () #x-axis and y-axis exchange sequence
    .set_ global_ Opts (title_opts = opts. Titleopts (title = "average price of down jacket of various styles", subtitle = "data source: JD \ tmapping: brother J", pos_left = 'left'),
                       xaxis_ Opts = opts.axisopts (axislabel_opts = opts. Labelopts (font_size = 11)), # change the abscissa font size
                       #yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=12)),
                        yaxis_ Opts = opts. Axisopts (axislabel_opts = {"rotate": 30}) # change the ordinate font size
                       )
    .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right'))
    )
c.render_notebook()

 

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes

 

 

Down jacket word cloud picture

Python crawls through the data of a down jacket and uses visualization to help you choose your favorite clothes