Is Double Eleven finished? Teach you to use Python to chop it again (Python simulates login, collects Taobao product data)

Time:2022-9-18

foreword

On November 4, the China Consumers Association issued a consumption reminder on its official website, reminding consumers of six points to pay attention to when shopping during the "Double Eleven" shopping festival. The main content is not to be superstitious about the "low price" of Double Eleven, and beware of business routines. So how can we choose the real good at the bottom price?

Today, I will take you to use the python+selenium tool to obtain these public merchant data, and you can collect the price and evaluation of the product for comparison

Environment introduction

  • python 3.8
  • pycharm
  • selenium
  • csv
  • time
  • random

 

Install required third-party modules

from selenium import webdriver
import time # Time module, can be used for program delay
import random # random number module
from constants import TAO_USERNAME1, TAO_PASSWORD1
import csv # module for data saving

 

Create a browser

driver = webdriver.Chrome()

 

Perform automated browser actions

driver.get('https://www.taobao.com/')
driver.implicitly_wait(10) # Set the wait of the browser and load the data
driver.maximize_window() # maximize the browser

 

Search function

First, open the developer tools; then select the search box with the tool in the upper left corner, and then it will help us locate the label of the currently selected element; finally, right-click, select Copy, and then select Xpath syntax

def search_product(keyword):
    driver.find_element_by_xpath('//*[@id="q"]').send_keys(keyword)
    time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

    driver.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button').click()
    time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

word = input('Please enter the keyword you want to search for:')

# Call the function for product search
search_product(word)

 

login interface

Using the same method as above, find the desired element

driver.find_element_by_xpath('//*[@id="f-login-id"]').send_keys(TAO_USERNAME1)
time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

driver.find_element_by_xpath('//*[@id="f-login-password"]').send_keys(TAO_PASSWORD1)
time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

driver.find_element_by_xpath('//*[@id="login-form"]/div[4]/button').click()
time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

 

Students who have questions about this article can add [data for free prostitution, answer exchange group: 1136201545]

The browser operated by selenium is recognized and cannot log in

Modify some properties of the browser to bypass the detection

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",
            {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => false})"""})

 

Parse product data

def parse_data():
    divs = driver.find_elements_by_xpath('//div[@class="grid g-clearfx"]/div/div') # all div tags

    for div in divs:
        try:
            info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').text
            price = div.find_element_by_xpath('.//strong').text + '元'
            deal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').text
            name = div.find_element_by_xpath('.//div[@class="shop"]/a/span[2]').text
            location = div.find_element_by_xpath('.//div[@class="location"]').text
            detail_url = div.find_element_by_xpath('.//div[@class="pic"]/a').get_attribute('href')

            print(info, price, deal, name, location, detail_url)

 

keep

with open('某宝.csv', mode='a', encoding='utf-8', newline='') as f:
    csv_write = csv.writer(f)
    csv_write.writerow([info, price, deal, name, location, detail_url])

 

Page flipping

Find the regularity of the page, which is an arithmetic sequence, and the first page is 0

for page in range(100): # 012
    print(f'\n================== Fetching page {page + 1} data =============== =====')
    url = f'https://s.taobao.com/search?q=%E5%B7%B4%E9%BB%8E%E4%B8%96%E5%AE%B6&s={page * 44}'
    # Parse product data
    parse_data()
    time.sleep(random.randint(1, 3)) # Try to avoid random delays in human-machine detection

 

Finally run the code and get the result