DWQA QuestionsCategory: ProgramIf you use the script to crawl Sina Weibo, you cannot enter the callback's parse item function. What's the matter?
XDMonkey asked 1 month ago

Why can’t I enter the parse item function? I said that after the URLs of web.com are all changed to CSDN, I can even use Weibo cookies. I wonder if it’s because of Weibo redirection. The code is as follows:

import scrapy
import re 
from scrapy.selector import Selector
from scrapy.http import Request
from tutorial.items import DmozItem
from string import maketrans
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
def extractData(regex, content, index=1): 
    r = '0' 
    p = re.compile(regex) 
    m = p.search(content) 
    if m: 
        r = m.group(index) 
    return r 
class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["weibo.com"]
    download_delay = 2
    rules=[
        Rule(LinkExtractor(allow=('/')),callback='parse_item',follow=True)
        ]

    headers = {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        # "Host": "login.sina.com.cn",
        "Referer": "http://weibo.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
    }
    cookies = {
        'ALF': 'my cookies',
        'Apache': 'my cookies',
        'SCF': 'my cookies',
        'sinaglobal': 'my cookie',
        'ssologinstate': 'my cookie',
        'sub': 'my cookie',
        'sub': 'my cookie',
        'suhb': 'my cookie',
        'TC page G0': 'my cookie',
        'tc-ugrow-g0': 'my cookie',
        'tc-v5-g0': 'my cookie',
        'ULV': 'my cookie',
        'uor': 'my cookie',
        'wbstorage': 'my cookies',
        'YF page G0': 'my cookie',
        'yf-ugrow-g0': 'my cookie',
        'yf-v5-g0': 'my cookie',
        '_s_tentry':'-',
        'log ABCD Sid': 'my cookies',
        'UN': 'my cookie',
    }
    def start_requests(self):
        return [Request("http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1",cookies = self.cookies,headers=self.headers)]

    def parse_item(self, response):
        print "comehere!"
        regexID=r'class=\"username\">(.*)\<\/h1>'
        content=response.body
        item=DmozItem()
        ID=extractData(regexID,content,1)
        item['ID']=ID
        print ID       
        yield item

At the same time, the console output is as follows. I initially suspect that it is the reason for redirection. How can I solve this problem? How can I enter the parse item function? :If you use the script to crawl Sina Weibo, you cannot enter the callback's parse item function. What's the matter?

2 Answers
Best Answer
XDMonkey answered 1 month ago

Cookie problem, let the page no longer redirect.

haofly answered 1 month ago

It should be callback = self.parse’item

XDMonkey replied 1 month ago

The reason is not here. Later, it was found that the error of cookie led to the problem of redirection. However, it was thought that sina Weibo JS was directly transmitted. For example, the user’s attention list was initially thought to be transmitted in the form of JSON and Ajax. Later, it was found that it was directly placed in a JS file. How could this be done? I wonder if every user writes a JS file