Maison > Questions et réponses > le corps du texte
使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:
代码如下:
from selenium import webdriver
import time
driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
alla = driver.find_elements_by_class_name('question_link')
for a in alla:
a = a.get_attribute('href')
print(a)
driver.get(a)
title = driver.find_element_by_id('activity-name').text
writer = driver.find_element_by_id('post-user').text
content = driver.find_element_by_id('js_content').text
print(writer,title,content)
#time.sleep(8)
driver.close()
driver.quit()
能采集到一个网址链接的内容,然后提示错误:
Traceback (most recent call last):
File "D:/python-work/test.py", line 10, in <module>
a = a.get_attribute('href')
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
return self._parent.execute(command, params)
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
self.error_handler.check_response(response)
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}}
Screenshot: available via screen
PHPz2017-04-18 10:08:34
Chers Dieux, j'ai modifié le code, mais la vitesse d'exécution est très lente et le chargement des images est désactivé. Parfois, le même problème se reproduit. Veuillez me faire savoir ce qui peut être modifié et optimisé. Le code est le suivant :
__author__ = 'Administrator'
from selenium import webdriver
import time
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = False
#cap["phantomjs.page.settings.javascriptEnabled"] = False
cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False
driver = webdriver.PhantomJS(desired_capabilities=cap)
#driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
length = len(driver.find_elements_by_class_name('question_link'))
for i in range(0,length):
alla = driver.find_elements_by_class_name('question_link')
a = alla[i]
print(a)
if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'):
a.click()
driver.get(a.get_attribute('href'))
title = driver.find_element_by_id('activity-name').text
writer = driver.find_element_by_id('post-user').text
content = driver.find_element_by_id('js_content').get_attribute('outerHTML')
print(writer,title,content)
driver.back()
time.sleep(8)
driver.close()
driver.quit()