In the process of using selenium to crawl 12306, I found that phantomjs cannot be used to crawl, and chromedriver can be used. It should be that phantomjs is detected and banned by the website. Using chromedriver will display the interface again, and the crawling efficiency is low.
Now I have two questions. I have been searching on Google for a long time and have not found an effective solution.
1. How to disguise phantomjs as much as possible
2. How to set up chromedriver so that it does not display the interface, or still Are there any other ways to improve crawling efficiency
grateful! ! !
PHP中文网2017-05-18 10:55:13
You can achieve your needs through PyVirtualDisplay. The code is probably like this:
#!/usr/bin/env python
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(800, 600))
display.start()
# now Firefox will run in a virtual display.
# you will not see the browser.
browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
print browser.title
browser.quit()
display.stop()
I don’t know if you have modified the header information of phantomjs, you can pass
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
options.add_argument('user-agent="Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"')
browser = webdriver.Chrome(chrome_options=options)
url = "https://baidu.com"
browser.get(url)
browser.quit()
This method modifies the header information of phantomjs. You can also try this