使用selenium爬取网站时输出结果不正确
我想要论文的标题。
from selenium import webdriver
import time
domain = "http://www.ncbi.nlm.nih.gov/"
url_tail = "pubmed?term=(%222013%22%5BDate%20-%20Publication%5D%20%3A%20%222013%22%5BDate%20-%20Publication%5D)"
url = domain + url_tail
browser = webdriver.Firefox()
browser.get(url)
def extract_data(browser):
titles = browser.find_elements_by_css_selector("div.rprt div.rslt p.title a")
return [title for title in titles]
print "Page 1"
print extract_data(browser)
for page in range(2, 4):
print "Page %d" % page
next_page = browser.find_element_by_xpath("//*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page']").click()
print extract_data(browser)
print "------"
time.sleep(5)
输出结果:
Page 1
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449150>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449090>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449110>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4490d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4626d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>]
Page 2
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4626d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea42aed0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452bd0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c50>]
------
Page 3
[<selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449050>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea449110>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452850>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452890>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4528d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452910>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452990>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452950>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea4529d0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452a90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452ad0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452b90>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452bd0>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c10>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c50>, <selenium.webdriver.remote.webelement.WebElement object at 0x7f2dea452c90>]
------
stackoverflow上有个人遇到了和我类似的问题: 点这里 ,事实上我的代码就是仿照那个最佳答案写的,但是仍然出错。
ps:
next_page = browser.find_element_by_xpath("//*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page']").click()
这行代码有错误,这个网站的xpath很奇怪,它的First,Prev,Next和Last的xpath都是一样的(有没有人可以解释一下这是为什么?),即
//*[@id='EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page']
,然后我模拟点击时它先点击next到第二页,然后会点first回到第一页。
ahhaore
9 years, 9 months ago