我在用python监控一个网页 这个网页不定时的会更新 我要寻找需要匹配的关键词比如‘ABC’ 大概的程序框架如下
基本方法就是 用 selenium 获取源码 然后beautifulsoup解析 然后再去结果里面match 每2秒循环一次
while true:
html = browser.page_source
soup = BeautifulSoup(html)
abc=soup.find_all(text=re.compile("(ABC)"))
if not abc:
.....
else:
.....
browser.refresh()
time.sleep(2.0 - ((time.time() - starttime) % 2.0))
现在问题就是这个程序很依赖网速,browser.refresh() 刷新一次有可能就会用1秒钟
有没有什么办法 不需要刷新网页 就能知道网页有变化
或者有没有其他办法能让我这个程序 不被网速拖累
黄舟2017-04-18 10:19:02
Http Last-Modified
1) What is "Last-Modified"?
When the browser requests a URL for the first time, the return status from the server will be 200, the content is the resource you requested, and there is a Last-Modified attribute marking this The last time the file was modified on the server side, the format is similar to this:
Last-Modified: Fri, 12 May 2006 18:53:33 GMT When the client requests this URL for the second time, according to the provisions of the HTTP
protocol, the browser The If-Modified-Since header will be sent to the server to ask whether the file has been modified after this time:
If-Modified-Since: Fri, 12 May 2006 18:53:33 GMT
If the server-side resources have not changed, it will automatically Returns HTTP 304 (Not
Changed.) status code with empty content, thus saving the amount of data to be transmitted. When the server-side code changes or the server is restarted, the resource is reissued and the return is similar to the first request. This ensures that resources are not sent to the client repeatedly, and also ensures that when the server changes, the client can get the latest resources.
headers 'If-Modified-Since'
Status Code:304 Not Modified
Status code 304 means the page has not been changed
>>> import requests as req
>>> url='http://www.guancha.cn/'
>>> rsp=req.head(url,headers={'If-Modified-Since':'Sun, 05 Feb 2017 05:39:11 GMT'})
>>> rsp
<Response [304]>
>>> rsp.headers
{'Server': 'NWS_TCloud_S1', 'Content-Type': 'text/html', 'Date': 'Sun, 05 Feb 2017 05:45:20 GMT', 'Cache-Control': 'max-age=60', 'Expires': 'Sun, 05 Feb 2017 05:46:20 GMT', 'Content-Length': '0', 'Connection': 'keep-alive'}
Time changed to yesterday (4th)
The server returns status code 200
and there are 'Last-Modified': 'Sun, 05 Feb 2017 06:00:03 GMT'
indicates the time of last modification.
>>> hds={'If-Modified-Since':'Sat, 04 Feb 2017 05:39:11 GMT'} # 时间改为 昨天(4号)
>>> rsp=req.head(url,headers=hds)
>>> rsp
<Response [200]>
>>> rsp.headers
{'Last-Modified': 'Sun, 05 Feb 2017 06:00:03 GMT', 'Date': 'Sun, 05 Feb 2017 06:04:59 GMT', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'X-Daa-Tunnel': 'hop_count=2', 'X-Cache-Lookup': 'Hit From Disktank3 Gz, Hit From Inner Cluster, Hit From Upstream', 'Server': 'nws_ocmid_hy', 'Content-Type': 'text/html', 'Expires': 'Sun, 05 Feb 2017 06:05:59 GMT', 'Cache-Control': 'max-age=60', 'Content-Length': '62608'}
>>>
伊谢尔伦2017-04-18 10:19:02
No matter what, you have to visit the source site to get the data. If you don’t capture the data, how will you know if there are changes?
大家讲道理2017-04-18 10:19:02
This kind of update may be updated using ajax. Personally, I think you can look at the js code of the website to find the request URL and parameters. If possible, go to the request directly?