>웹 프론트엔드 >HTML 튜토리얼 >POST抓取页面的问题_html/css_WEB-ITnose

POST抓取页面的问题_html/css_WEB-ITnose

WBOY
WBOY원래의
2016-06-24 11:48:591115검색

某同学反映,spider通过post方式抓取某站点有问题,老是302到自己,具体如下:

url :http://www.meituan.com/multiact/default/deal/25814805.html

post数据:"yui_3_16_0_1_1423700000_000:{\"act\":\"deal/dynamiccomponent\",\"args\":25814805,\"__referer\":\"\"}"

通过python可以正常抓取,抓取代码如下:

import urllibimport urllib2values = {        'yui_3_16_0_1_1423700000_000':'{"act":"deal/dynamiccomponent","args":25814805,"__referer":""}',}header={        "X-Requested-With":"XMLHttpRequest",}url="http://www.meituan.com/multiact/default/deal/25814805.html"data = urllib.urlencode(values)print datareq = urllib2.Request(url, data,header)response = urllib2.urlopen(req)the_page = response.read()print the_page

但是自己构造http请求包无法抓取,请求包如下:
POST /multiact/default/deal/25814805.html HTTP/1.1^M
Host: www.meituan.com^M
Content-Length: 126^M
Connection: close^M
Content-Type: application/x-www-form-urlencoded^M
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2^M
Accept-Encoding: gzip^M
Accept: */*^M
X-Requested-With:  XMLHttpRequest^M


抓取失败原因,缺少该参数:Content-Type: application/x-www-form-urlencoded^M

加上就可以了,具体如下:

POST /multiact/default/deal/25814805.html HTTP/1.1^M
Host: www.meituan.com^M
Content-Length: 126^M
Connection: close^M
Content-Type: application/x-www-form-urlencoded^M
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2^M
Accept-Encoding: gzip^M
Accept: */*^M
X-Requested-With:  XMLHttpRequest^M

Content-Type: application/x-www-form-urlencoded^M

성명:
본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.