简单的通过scrapy访问雪球都报错,我知道要先访问一次雪球,需要cookie信息才能真正打开连接。scrapy据说可以不用在意cookie,会自动获取cookie。我按照这个连接在middleware里已经启用cookie,http://stackoverflow.com/ques...,但为什么还是会返回404错误?搜索了几天都没找到答案。郁闷啊,求帮忙给个简单代码如何访问,谢谢了
class XueqiuSpider(scrapy.Spider):
name = "xueqiu"
start_urls = "https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1"
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Host": "www.zhihu.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
}
def __init__(self, url = None):
self.user_url = url
def start_requests(self):
yield scrapy.Request(
url = self.start_urls,
headers = self.headers,
meta = {
'cookiejar': 1
},
callback = self.request_captcha
)
def request_captcha(self,response):
print response
错误日志。
2017-03-04 12:42:02 [scrapy.core.engine] INFO: Spider opened
2017-03-04 12:42:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-04 12:42:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://xueqiu.com/robots.txt>
Set-Cookie: aliyungf_tc=AQAAAGFYbBEUVAQAPSHDc8pHhpYZKUem; Path=/; HttpOnly
2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://xueqiu.com/robots.txt> (referer: None)
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>
Set-Cookie: aliyungf_tc=AQAAAPTfyyJNdQUAPSHDc8KmCkY5slST; Path=/; HttpOnly
2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1> (referer: None)
2017-03-04 12:42:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>: HTTP status code is not handled or not allowed
2017-03-04 12:42:12 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-04 12:42:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
PHP中文网2017-04-18 10:26:03
I tried it again.. You really don’t need to log in.. I was overthinking it... Just request xueqiu.com first, and then request the API address after getting the cookie.. That’s it..
==============The dividing line of shame=============
Verified by me, you need to log in...
import scrapy
import hashlib
from scrapy.http import FormRequest, Request
class XueqiuScrapeSpider(scrapy.Spider):
name = "xueqiu_scrape"
allowed_domains = ["xueqiu.com"]
def start_requests(self):
m = hashlib.md5()
m.update(b"your password") # 在这里填入你的密码
password = m.hexdigest().upper()
form_data={
"telephone": "your account", # 在这里填入你的用户名
"password": password,
"remember_me": str(),
"areacode": "86",
}
print(form_data)
return [FormRequest(
url="https://xueqiu.com/snowman/login",
formdata=form_data,
meta={"cookiejar": 1},
callback=self.loged_in
)]
def loged_in(self, response):
# print(response.url)
return [Request(
url="https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1",
meta={"cookiejar": response.meta["cookiejar"]},
callback=self.get_result,
)]
def get_result(self, response):
print(response.body)
In addition, as long as the website really has the User-Agent
进行了验证,可以在settings.py
中进行设置,当然自己写在爬虫文件里也可以。密码是MD5
加密后的字符串。
哦对,补充一点,因为我是用手机注册的,所以form_data
是这些字段,如果你是其他方式,只需要用Chrome工具看一下POST请求有哪些参数,自己修改一下form_data
content.
黄舟2017-04-18 10:26:03
Haha, thank you, it solved the confusion for a few days. I used to do it through request before, no need to log in, post the code,
session = requests.Session()
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
session.get('https://xueqiu.com')
for page in range(1,100):
url = 'https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=%s&size=1' % page
print url
r = session.get(url)
#print r.json().list
a = r.text