使用selenium爬取网站时输出结果不正确


网站链接: http://www.ncbi.nlm.nih.gov/pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)

我想要论文的标题。


 from selenium import webdriver
import time

domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail

browser = webdriver.Firefox()
browser.get(url)

def extract_data(browser):
    titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
    return [title for title in titles]

print "Page 1"
print extract_data(browser)

for page in range(2, 4):
    print "Page %d" % page
    next_page = browser.find_element_by_xpath("//*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page']").click()
    print extract_data(browser)
    print "------"
    time.sleep(5)

输出结果:


 Page 1
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449150>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449090>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449110>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4490d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4626d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>]
Page 2
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4626d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea42aed0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452bd0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c50>]
------
Page 3
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449110>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452bd0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c90>]
------

stackoverflow上有个人遇到了和我类似的问题: 点这里 ,事实上我的代码就是仿照那个最佳答案写的,但是仍然出错。

ps:


 next_page = browser.find_element_by_xpath("//*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page']").click()

这行代码有错误,这个网站的xpath很奇怪,它的First,Prev,Next和Last的xpath都是一样的(有没有人可以解释一下这是为什么?),即 //*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page'] ,然后我模拟点击时它先点击next到第二页,然后会点first回到第一页。

selenium python-爬虫

ahhaore 9 years, 9 months ago

return [title.text for title in titles]
你这里返回的不应该是return [title for title in titles] 否则值就没有改变了

Ko君D日常 answered 9 years, 9 months ago

Your Answer