search

Home  >  Q&A  >  body text

python - scrapy 如何爬取网页里面的thunder链接?

目标url:
http://www.xiaopian.com/html/...

这个是chrome里显示的源代码

这个是scrapy shell url后用response.css().extract()显示东西

我想知道为何二者不一致?scrapy爬取到的信息并没有对应的thunder链接,而是明面上的ftp链接

PHPzPHPz2769 days ago1174

reply all(1)I'll reply

  • 黄舟

    黄舟2017-04-18 09:43:53

    The crawler should right-click to view the source code of the webpage > View the source code of the webpage. Instead of in the review element, the code seen here has been rendered by js, which is different from the original code, and the code obtained by the crawler has not been through js Rendered, that is, the original code.
    I took a look and found that the Thunder download address was calculated using js

    The specific code is as follows:

    function ThunderEncode(t_url) {
        var thunderPrefix = "AA";
        var thunderPosix = "ZZ";
        var thunderTitle = "thunder://";
    
        var thunderUrl = thunderTitle + base64encode(utf16to8(thunderPrefix + t_url + thunderPosix));
    
        return thunderUrl;
    }

    Tested it:
    Pass the address ftp://a:a@dygod18.com:21/[电影天堂www.dy2018.com]忍者神龟2破影而出BD中英双字.rmvb as a parameter and you will get a Thunder connection, but it is different from the one on the web page. After recoding, it URL-encodes the Chinese characters. As long as the encoding is unified, there will be no problem. .

    reply
    0
  • Cancelreply