伊谢尔伦2017-04-18 09:33:41
After actual testing, the conclusion is that bs4 changes the order of attributes.
Censorship Element
View the web page source code
import re
ptn_tr = re.compile(r'<tr[^>]+>')
import requests as req
rsp=req.get('http://www.pythonscraping.com/pages/page3.html')
html = rsp.text
print('requests:\t', ptn_tr.findall(html)[0])
from urllib.request import urlopen
rsp = urlopen("http://www.pythonscraping.com/pages/page3.html")
html = rsp.read().decode()
print('urlopen:\t', ptn_tr.findall(html)[0])
from bs4 import BeautifulSoup
html = str(BeautifulSoup(html,"lxml"))
print('bs4Soup:\t', ptn_tr.findall(html)[0])
Result:
requests: <tr id="gift1" class="gift">
urlopen: <tr id="gift1" class="gift">
bs4Soup: <tr class="gift" id="gift1">
阿神2017-04-18 09:33:41
The order of class and id is just different.
If you use Chrome and Firefox to view the source code of the same web page, the order is also different.
高洛峰2017-04-18 09:33:41
It is recommended that the questioner post the website or even his own code so that everyone can help you debug it. It's normal to be different. If the content crawled by your crawler is saved as a static page and is different from what you see with the browser, then the other party's anti-crawler mechanism must have recognized it, so the server will return different information. There are many ways to identify crawlers. If you still have any questions, please feel free to ask again
巴扎黑2017-04-18 09:33:41
The poster recommends that you post all the source code, because the website can identify whether you are operating a human browser or a crawler.
Looking at the current code, it is recommended that you add header information! use-agent That line of code!