Home  >  Q&A  >  body text

python - 使用selenium和phantomjs爬虫遇到的缓存问题 ?

使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:

代码如下:

from selenium import webdriver
import time 
driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
alla = driver.find_elements_by_class_name('question_link')
for a in alla:
    a = a.get_attribute('href')
    print(a)
    driver.get(a)
    title = driver.find_element_by_id('activity-name').text
    writer = driver.find_element_by_id('post-user').text
    content = driver.find_element_by_id('js_content').text
    print(writer,title,content)
    #time.sleep(8)
driver.close()
driver.quit()

能采集到一个网址链接的内容,然后提示错误:

Traceback (most recent call last):
  File "D:/python-work/test.py", line 10, in <module>
    a = a.get_attribute('href')
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}}
Screenshot: available via screen
PHP中文网PHP中文网2741 days ago837

reply all(1)I'll reply

  • PHPz

    PHPz2017-04-18 10:08:34

    Masters, I modified the code, but the execution speed is very slow, and the loading of images is disabled. Sometimes the same problem occurs again. Could you please tell me what can be modified and optimized? The code is as follows:

    __author__ = 'Administrator'
    
    from selenium import webdriver
    import time
    
    cap = webdriver.DesiredCapabilities.PHANTOMJS
    cap["phantomjs.page.settings.resourceTimeout"] = 1000
    cap["phantomjs.page.settings.loadImages"] = False
    #cap["phantomjs.page.settings.javascriptEnabled"] = False
    cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False
    driver = webdriver.PhantomJS(desired_capabilities=cap)
    
    #driver = webdriver.PhantomJS()
    driver.get('http://chuansong.me')
    length = len(driver.find_elements_by_class_name('question_link'))
    for i in range(0,length):
        alla = driver.find_elements_by_class_name('question_link')
        a = alla[i]
        print(a)
        if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'):
            a.click()
            driver.get(a.get_attribute('href'))
            title = driver.find_element_by_id('activity-name').text
            writer = driver.find_element_by_id('post-user').text
            content = driver.find_element_by_id('js_content').get_attribute('outerHTML')
            print(writer,title,content)
            driver.back()
            time.sleep(8)
    driver.close()
    driver.quit()

    reply
    0
  • Cancelreply