Home  >  Q&A  >  body text

python - 我写的Xpath 为什么爬取不到内容

-- coding:utf-8 --

import lxml,requests,sys
from bs4 import BeautifulSoup
from lxml import etree

reload(sys)
sys.setdefaultencoding("utf-8")

def main():

url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0'

req = requests.get(url).content

# soup = BeautifulSoup(req.content,'lxml')
# imgs = soup.find_all('img')

content = etree.HTML(req)
paths = content.xpath('//*[@id="imgid"]/ul/li[1]/a/img/text()')
# for img in imgs:
#
#     print img

# for img in imgs :

print paths

main()

阿神阿神2741 days ago774

reply all(1)I'll reply

  • 天蓬老师

    天蓬老师2017-04-18 10:32:15

    When writing a crawler, you must use xpath to confirm whether there is data in the source code of the web page. If not, it means it is loaded asynchronously

    1. Enter this link in the browser to view the source code, ctrl+f to find the location of imgid

    view-source:https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%9A%B4%E8%B5%B0%E6%BC%AB%E7%94%BB&pn=0

    2. Discover

    The following picture list was not found. We can determine that the pictures are loaded by js

    3. Find

    Looking at the network in F12 (you can only see it after refreshing), I did not find the image information loaded by the asynchronous request, so I guessed that the data should be in the html, but it was placed in js and processed when loading the image

    The same way to view the source code as above, search for the parameter objURL and find the real url

    //很多,集中在html下半部分
    http://img3.duitang.com/uploads/item/201608/06/20160806110540_MAcru.jpeg

    Solution

    The rest is up to you~ Find a way to parse the real url below!

    reply
    0
  • Cancelreply