使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:
代码如下:
from selenium import webdriver
import time
driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
alla = driver.find_elements_by_class_name('question_link')
for a in alla:
a = a.get_attribute('href')
print(a)
driver.get(a)
title = driver.find_element_by_id('activity-name').text
writer = driver.find_element_by_id('post-user').text
content = driver.find_element_by_id('js_content').text
print(writer,title,content)
#time.sleep(8)
driver.close()
driver.quit()
能采集到一个网址链接的内容,然后提示错误:
Traceback (most recent call last):
File "D:/python-work/test.py", line 10, in <module>
a = a.get_attribute('href')
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute
resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
return self._parent.execute(command, params)
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
self.error_handler.check_response(response)
File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}}
Screenshot: available via screen
PHPz2017-04-18 10:08:34
Masters, I modified the code, but the execution speed is very slow, and the loading of images is disabled. Sometimes the same problem occurs again. Could you please tell me what can be modified and optimized? The code is as follows:
__author__ = 'Administrator'
from selenium import webdriver
import time
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = False
#cap["phantomjs.page.settings.javascriptEnabled"] = False
cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False
driver = webdriver.PhantomJS(desired_capabilities=cap)
#driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
length = len(driver.find_elements_by_class_name('question_link'))
for i in range(0,length):
alla = driver.find_elements_by_class_name('question_link')
a = alla[i]
print(a)
if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'):
a.click()
driver.get(a.get_attribute('href'))
title = driver.find_element_by_id('activity-name').text
writer = driver.find_element_by_id('post-user').text
content = driver.find_element_by_id('js_content').get_attribute('outerHTML')
print(writer,title,content)
driver.back()
time.sleep(8)
driver.close()
driver.quit()