import lxml,requests,sys
from bs4 import BeautifulSoup
from lxml import etree
reload(sys)
sys.setdefaultencoding("utf-8")
def main():
url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0'
req = requests.get(url).content
# soup = BeautifulSoup(req.content,'lxml')
# imgs = soup.find_all('img')
content = etree.HTML(req)
paths = content.xpath('//*[@id="imgid"]/ul/li[1]/a/img/text()')
# for img in imgs:
#
# print img
# for img in imgs :
print paths
main()
天蓬老师2017-04-18 10:32:15
When writing a crawler, you must use xpath to confirm whether there is data in the source code of the web page. If not, it means it is loaded asynchronously
view-source:https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0
The following picture list was not found. We can determine that the pictures are loaded by js
Looking at the network in F12 (you can only see it after refreshing), I did not find the image information loaded by the asynchronous request, so I guessed that the data should be in the html, but it was placed in js and processed when loading the image
The same way to view the source code as above, search for the parameter objURL and find the real url
//很多,集中在html下半部分
http://img3.duitang.com/uploads/item/201608/06/20160806110540_MAcru.jpeg
The rest is up to you~ Find a way to parse the real url below!