recherche

Maison  >  Questions et réponses  >  le corps du texte

python - 使用selenium和phantomjs爬虫遇到的缓存问题 ?

使用selenium和phantomjs爬虫遇到问题,代码如下【【采集时我用了蓝灯软件来代理,不能直接采集】】:

代码如下:

from selenium import webdriver
import time 
driver = webdriver.PhantomJS()
driver.get('http://chuansong.me')
alla = driver.find_elements_by_class_name('question_link')
for a in alla:
    a = a.get_attribute('href')
    print(a)
    driver.get(a)
    title = driver.find_element_by_id('activity-name').text
    writer = driver.find_element_by_id('post-user').text
    content = driver.find_element_by_id('js_content').text
    print(writer,title,content)
    #time.sleep(8)
driver.close()
driver.quit()

能采集到一个网址链接的内容,然后提示错误:

Traceback (most recent call last):
  File "D:/python-work/test.py", line 10, in <module>
    a = a.get_attribute('href')
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 141, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 494, in _execute
    return self._parent.execute(command, params)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 236, in execute
    self.error_handler.check_response(response)
  File "D:\Program Files\Python35-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: {"errorMessage":"Element does not exist in cache","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:60284","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"GET","url":"/attribute/href","urlParsed":{"anchor":"","query":"","file":"href","directory":"/attribute/","path":"/attribute/href","relative":"/attribute/href","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/attribute/href","queryKey":{},"chunks":["attribute","href"]},"urlOriginal":"/session/bcbced70-c66a-11e6-a824-4b87531d9c78/element/:wdc:1482207278197/attribute/href"}}
Screenshot: available via screen
PHP中文网PHP中文网2803 Il y a quelques jours885

répondre à tous(1)je répondrai

  • PHPz

    PHPz2017-04-18 10:08:34

    Chers Dieux, j'ai modifié le code, mais la vitesse d'exécution est très lente et le chargement des images est désactivé. Parfois, le même problème se reproduit. Veuillez me faire savoir ce qui peut être modifié et optimisé. Le code est le suivant :

    __author__ = 'Administrator'
    
    from selenium import webdriver
    import time
    
    cap = webdriver.DesiredCapabilities.PHANTOMJS
    cap["phantomjs.page.settings.resourceTimeout"] = 1000
    cap["phantomjs.page.settings.loadImages"] = False
    #cap["phantomjs.page.settings.javascriptEnabled"] = False
    cap["phantomjs.page.settings.localToRemoteUrlAccessEnabled"] = False
    driver = webdriver.PhantomJS(desired_capabilities=cap)
    
    #driver = webdriver.PhantomJS()
    driver.get('http://chuansong.me')
    length = len(driver.find_elements_by_class_name('question_link'))
    for i in range(0,length):
        alla = driver.find_elements_by_class_name('question_link')
        a = alla[i]
        print(a)
        if 'question_link' in a.get_attribute('class') or 'n' in a.get_attribute('href'):
            a.click()
            driver.get(a.get_attribute('href'))
            title = driver.find_element_by_id('activity-name').text
            writer = driver.find_element_by_id('post-user').text
            content = driver.find_element_by_id('js_content').get_attribute('outerHTML')
            print(writer,title,content)
            driver.back()
            time.sleep(8)
    driver.close()
    driver.quit()

    répondre
    0
  • Annulerrépondre