cari

Rumah  >  Soal Jawab  >  teks badan

python - 如何在scrapy中带cookie访问?

简单的通过scrapy访问雪球都报错,我知道要先访问一次雪球,需要cookie信息才能真正打开连接。scrapy据说可以不用在意cookie,会自动获取cookie。我按照这个连接在middleware里已经启用cookie,http://stackoverflow.com/ques...,但为什么还是会返回404错误?搜索了几天都没找到答案。郁闷啊,求帮忙给个简单代码如何访问,谢谢了





class XueqiuSpider(scrapy.Spider):
    name = "xueqiu"
    start_urls = "https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1"
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
    }


    def __init__(self, url = None):
        self.user_url = url

    def start_requests(self):
        yield scrapy.Request(
            url = self.start_urls,
            headers = self.headers,
            meta = {
                'cookiejar': 1
            },
            callback = self.request_captcha
        )

    def request_captcha(self,response):
        print response

错误日志。

2017-03-04 12:42:02 [scrapy.core.engine] INFO: Spider opened
2017-03-04 12:42:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-04 12:42:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://xueqiu.com/robots.txt>
Set-Cookie: aliyungf_tc=AQAAAGFYbBEUVAQAPSHDc8pHhpYZKUem; Path=/; HttpOnly

2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://xueqiu.com/robots.txt> (referer: None)
********Current UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6************
2017-03-04 12:42:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>
Set-Cookie: aliyungf_tc=AQAAAPTfyyJNdQUAPSHDc8KmCkY5slST; Path=/; HttpOnly

2017-03-04 12:42:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1> (referer: None)
2017-03-04 12:42:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1>: HTTP status code is not handled or not allowed
2017-03-04 12:42:12 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-04 12:42:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
高洛峰高洛峰2784 hari yang lalu676

membalas semua(2)saya akan balas

  • PHP中文网

    PHP中文网2017-04-18 10:26:03

    Saya mencubanya sekali lagi.. Anda benar-benar tidak perlu log masuk.. Saya terlalu memikirkannya... Hanya minta xueqiu.com dahulu, dan kemudian minta alamat API selepas mendapat kuki.. Itu sahaja. .

    ============== Garis pemisah malu==============

    Seperti yang disahkan oleh saya, anda perlu log masuk...

    import scrapy
    import hashlib
    from scrapy.http import FormRequest, Request
    
    class XueqiuScrapeSpider(scrapy.Spider):
        name = "xueqiu_scrape"
        allowed_domains = ["xueqiu.com"]
    
        def start_requests(self):
            m = hashlib.md5()
            m.update(b"your password")  # 在这里填入你的密码
            password = m.hexdigest().upper()
            form_data={
                "telephone": "your account",   # 在这里填入你的用户名
                "password": password,
                "remember_me": str(),
                "areacode": "86",
            }
            print(form_data)
            return [FormRequest(
                url="https://xueqiu.com/snowman/login", 
                formdata=form_data, 
                meta={"cookiejar": 1},
                callback=self.loged_in
                )]
    
        def loged_in(self, response):
            # print(response.url)
            return [Request(
                url="https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=1&size=1",
                meta={"cookiejar": response.meta["cookiejar"]},
                callback=self.get_result,
                )]
    
        def get_result(self, response):
            print(response.body)

    Selain itu, tapak web sememangnya telah mengesahkan User-Agent dan boleh ditetapkan dalam settings.py Sudah tentu, anda juga boleh menulisnya sendiri dalam fail perangkak. Kata laluan ialah rentetan yang disulitkan MD5.
    Oh, ya, satu perkara lagi, kerana saya mendaftar dengan telefon bimbit saya, form_data adalah medan ini Jika anda menggunakan kaedah lain, anda hanya perlu menggunakan alat Chrome untuk melihat parameter yang ada pada permintaan POST dan ubah suai itu sendiri form_data Kandungan akan berjaya.

    balas
    0
  • 黄舟

    黄舟2017-04-18 10:26:03

    Haha, terima kasih, ia telah menyelesaikan kekeliruan selama beberapa hari. Saya pernah melakukannya melalui permintaan sebelum ini, tidak perlu log masuk, pos kod,

    session = requests.Session()
    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
    session.get('https://xueqiu.com')
    for page in range(1,100):
        url = 'https://xueqiu.com/stock/f10/finmainindex.json?symbol=SZ000001&page=%s&size=1' % page
        print url
        r = session.get(url)
    #print r.json().list
        a = r.text

    balas
    0
  • Batalbalas