Home  >  Q&A  >  body text

python - 请教这个页面中的这两个信息能否不用无头浏览器爬取到?

在爬取"http://www.haodf.com/doctor/DE4r08xQdKSLBVM8i9sHYQ8uQGIO.htm"这个页面的时候, 发现"擅长"和"执业经历"这两个信息通过beautifulsoup是取不到的, 我选取这两个信息的代码如下:

soup.select('#full_DoctorSpecialize').get_text(strip=True)
soup.select('#full').get_text(strip=True)

查询页面发现这两个信息好像是通过JS查询的结果, 除了把网页全部正则表达式匹配的方法, 请教各位:
1, 这两个信息能否直接取到?
2, 除了类似"Selenium"这样的工具, 是否还有其他方式能够取到这两个信息?
3, 能否通过分析查询接口的方式解决?

谢谢

巴扎黑巴扎黑2740 days ago530

reply all(3)I'll reply

  • PHP中文网

    PHP中文网2017-04-18 10:20:55

    Maybe on this page, the data you want to capture is rendered using js after the page is loaded. In other words, the data in this #full_DoctorSpecialize
    is ajax, retrieved from the server. Specifically how to get such data, you can download phantomjs from Baidu, and you will definitely gain something.

    reply
    0
  • PHP中文网

    PHP中文网2017-04-18 10:20:55

    These two pieces of information can be obtained directly, but the information is included in the JS block BigPipe.onPageletArrive({这个里面}) , 可以通过正则表达式获取。这个里面 is a string in JSON format. After matching, it is easy to convert to json. If you want to obtain it through the query interface, you should It's possible, but you have to analyze the JS code, which is too troublesome. You can use a packet capture tool to capture the http request, and then look at the data returned by the request. In comparison, it is faster to write a regular match.

    reply
    0
  • 怪我咯

    怪我咯2017-04-18 10:20:55

    This is like the one mentioned above that is rendered by js. The content is in the js code. You can regularly match the elements in the js code to get the information you want

    reply
    0
  • Cancelreply