为什么我的Scrapy爬不出数据?


向各位老师请教,我在做一个爬虫,第一步是想爬下来所有股票的代码和名字,网址是 http://app.finance.ifeng.com/list/stock.php?t=ha&f=symbol&o=asc&p=1

我的items.py是这样的:


 python


 import scrapy

class NameItem(scrapy.Item):
    code = scrapy.Field()
    name = scrapy.Field()


我的爬取脚本是这样的:


 python


 from scrapy.spider import BaseSpider
from Stock.items import NameItem
from scrapy.selector import Selector
from scrapy.http import Request

class StockNameSpider(BaseSpider):
    name = "stock_name"
    allowed_domains = ["http://app.finance.ifeng.com"]
    start_urls = ["http://app.finance.ifeng.com/list/stock.php?t=ha"]

    def parse(self, response):
        sel = Selector(response)
        links = sel.xpath('//*[@class= "tab01"]/table/tbody/tr')
        for link in links:
            code = link.xpath('td[1]/a/text()').extract()
            name = link.xpath('td[2]/a/text()').extract()
            nameitem = NameItem()
            nameitem['code'] = code[0] if code else None
            nameitem['name'] = name[0] if name else None
            yield nameitem

Xpath没有写错,在Shell已经测试过了

运行期间没有报任何错误,

下列是运行log

2015-02-19 20:22:49+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: Stock)
2015-02-19 20:22:49+0800 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-02-19 20:22:49+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Stock.spiders', 'SPIDER_MODULES': ['Stock.spiders'], 'LOG_FILE': 'test.log', 'BOT_NAME': 'Stock'}
2015-02-19 20:22:50+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-19 20:22:51+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-19 20:22:51+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-19 20:22:51+0800 [scrapy] INFO: Enabled item pipelines:
2015-02-19 20:22:51+0800 [stock_name] INFO: Spider opened
2015-02-19 20:22:51+0800 [stock_name] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-19 20:22:51+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-19 20:22:51+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-19 20:22:51+0800 [stock_name] DEBUG: Crawled (200) < GET http://app.finance.ifeng.com/list/stock.php?t=ha > (referer: None)
2015-02-19 20:22:51+0800 [stock_name] INFO: Closing spider (finished)
2015-02-19 20:22:51+0800 [stock_name] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 239,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11784,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 19, 12, 22, 51, 897000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 2, 19, 12, 22, 51, 352000)}
2015-02-19 20:22:51+0800 [stock_name] INFO: Spider closed (finished)


但是结果没有爬取到任何数据。
各位老师,请问是为什么?我是新手,在线等,十分感谢

python scrapy python-爬虫

7824902 9 years, 8 months ago

唉好吧,原来是用FireBug查出的网页HTML跟直接浏览器右键看到的网页源代码结构上有些不一致

下限凹凸曼 answered 9 years, 8 months ago

Your Answer