search

Home  >  Q&A  >  body text

网页爬虫 - 【如图】python爬取的html页面和浏览器显示源码的结果不同

高洛峰高洛峰2889 days ago774

reply all(4)I'll reply

  • 伊谢尔伦

    伊谢尔伦2017-04-18 09:33:41

    After actual testing, the conclusion is that bs4 changes the order of attributes.

    1. Right-click the page in the browser and select:

    Censorship Element

    View the web page source code

    2. Comparison in python3 program:

    import re
    ptn_tr = re.compile(r'<tr[^>]+>')
    
    import requests as req
    rsp=req.get('http://www.pythonscraping.com/pages/page3.html')
    html = rsp.text
    print('requests:\t', ptn_tr.findall(html)[0])
    
    from urllib.request import urlopen
    rsp = urlopen("http://www.pythonscraping.com/pages/page3.html")
    html = rsp.read().decode()
    print('urlopen:\t', ptn_tr.findall(html)[0])
    
    from bs4 import BeautifulSoup
    html = str(BeautifulSoup(html,"lxml"))
    print('bs4Soup:\t', ptn_tr.findall(html)[0])
    

    Result:

    requests:     <tr id="gift1" class="gift">
    urlopen:     <tr id="gift1" class="gift">
    bs4Soup:     <tr class="gift" id="gift1">

    reply
    0
  • 阿神

    阿神2017-04-18 09:33:41

    The order of class and id is just different.
    If you use Chrome and Firefox to view the source code of the same web page, the order is also different.

    reply
    0
  • 高洛峰

    高洛峰2017-04-18 09:33:41

    It is recommended that the questioner post the website or even his own code so that everyone can help you debug it. It's normal to be different. If the content crawled by your crawler is saved as a static page and is different from what you see with the browser, then the other party's anti-crawler mechanism must have recognized it, so the server will return different information. There are many ways to identify crawlers. If you still have any questions, please feel free to ask again

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-18 09:33:41

    The poster recommends that you post all the source code, because the website can identify whether you are operating a human browser or a crawler.

    Looking at the current code, it is recommended that you add header information! use-agent That line of code!

    reply
    0
  • Cancelreply