Question

0 0

scrapy模拟登陆知乎出现重定向无法登陆问题

当我使用 scrapy 想进行模拟登陆, 然后抓取首页的问题和答案时, 一直显示重定向问题


 python


 from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request, FormRequest
from zhihu.items import ZhihuItem



class ZhihuSipder(CrawlSpider) :
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    start_urls = [
        "http://www.zhihu.com"
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow = r'http://www\.zhihu\.com/question/\d+'), callback = 'parse_page'),
    )

    def start_requests(self):
        return [Request("https://www.zhihu.com/login", callback = self.post_login)]

    #FormRequeset出问题了
    def post_login(self, response):
        print 'Preparing login'
        xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
        print xsrf
        ##############
        return [FormRequest.from_response(response,  #"http://www.zhihu.com/login",
                            formdata = {
                            '_xsrf': xsrf,
                            'email': '[email protected]',
                            'password': 'HUAZANG.55789260',
                            'rememberme': 'y',
                            },
                            callback = self.parse_page
                            )]

    def parse_page(self, response):
        problem = Selector(response)
        item = ZhihuItem()
        item['url'] = response.url
        item['title'] = problem.xpath('//h2[@class="zm-item-title zm-editable-content"]/text()').extract()
        item['description'] = problem.xpath('//div[@class="zm-editable-content"]/text()').extract()
        item['answer']= problem.xpath('//div[@class=" zm-editable-content clearfix"]/text()').extract()
        return item

使用命令运行爬虫, 可以正确打印xsrf, 但无法成功登陆


 $ scrapy crawl zhihu

错误结果如下
2014-12-18 14:45:11+0800 [zhihu] INFO: Spider opened
2014-12-18 14:45:11+0800 [zhihu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-18 14:45:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-18 14:45:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-18 14:45:11+0800 [zhihu] DEBUG: Redirecting (301) to <GET https://www.zhihu.com/> from <GET https://www.zhihu.com/login>
2014-12-18 14:45:11+0800 [zhihu] DEBUG: Redirecting (302) to <GET http://www.zhihu.com/> from <GET https://www.zhihu.com/>
2014-12-18 14:45:12+0800 [zhihu] DEBUG: Crawled (200) <GET http://www.zhihu.com/> (referer: None)
Preparing login
d117e46de0dcc5e8ee2f0c7031fcafe9
2014-12-18 14:45:12+0800 [zhihu] DEBUG: Redirecting (302) to <GET http://www.zhihu.com/> from <POST http://www.zhihu.com/login>
2014-12-18 14:45:12+0800 [zhihu] DEBUG: Filtered duplicate request: <GET http://www.zhihu.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2014-12-18 14:45:12+0800 [zhihu] INFO: Closing spider (finished)
2014-12-18 14:45:12+0800 [zhihu] INFO: Dumping Scrapy stats:

希望得到解答, 为什么不能成功登陆, 非常疑惑, 非常感谢

python scrapy 网页爬虫

12 years, 6 months ago

朕射你无罪

share

朕射你无罪 12 years, 6 months ago

Answer 1

0

上面其实已经成功登录, 后来自己测试过了, 只不过没有调用一个抓取页面url的函数

answered 12 years, 6 months ago

NINI酱

share

NINI酱 answered 12 years, 6 months ago

scrapy模拟登陆知乎出现重定向无法登陆问题

朕射你无罪

Answers

NINI酱

Your Answer