Home  >  Q&A  >  body text

网页爬虫 - python 爬虫问题,请问为什么我爬不下这个的数据?求解,网站都能打开。

import sys
import time
import requests
import json
reload(sys)
sys.setdefaultencoding('utf-8')
time=int(time.time())
session=requests.session()
user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36'
headers={'User-Agent':user_agent,'Host':'xygs.gsaic.gov.cn','Connection':'keep-alive','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
params={'pripid':'62030200052016012700011'}
cookies={'JSESSIONID':'2B33BC6D34DF44BE8D76C2AE20701D95'}
Url='http://xygs.gsaic.gov.cn/gsxygs/smallEnt!view.do?pripid=62030200052016012700011'
captcha=session.get(Url,headers=headers,params=(params),cookies=cookies).text
print captcha

得不到表格里的信息,求解为什么啊?

PHP中文网PHP中文网2741 days ago229

reply all(2)I'll reply

  • 高洛峰

    高洛峰2017-04-17 17:51:31

    https://segmentfault.com/q/1010000005117988
    The previous question was answered for you. I don’t know if it has solved your problem. Why is there no response? If it has been solved, remember to adopt it. The code for this question is as follows:

    import requests
    
    headers = {
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6'
    }
    url = 'http://xygs.gsaic.gov.cn/gsxygs/smallEnt!view.do?pripid=62030200052016012700011'
    r = requests.get(url, headers=headers)
    print r.text

    reply
    0
  • 迷茫

    迷茫2017-04-17 17:51:31

    The web form uses ajax technology. You can use Network in chrome tools to view the source of the table.
    In addition, the crawler is not only based on the language python. You'd better learn some knowledge related to web development, especially js and http protocols. Sorry, I didn't read carefully because I answered on my mobile phone.

    I just checked, it’s because you are missing the Accept-Language protocol header

    reply
    0
  • Cancelreply