Home >Backend Development >Python Tutorial >How Python crawls content added by js in web pages (code)
The content of this article is about how Python crawls the content (code) added by js in the web page. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
When we crawl a web page, we will use certain rules to extract effective information from the returned HTML data. But if the web page contains JavaScript code, we must go through rendering processing to get the original data. At this point, if we still use conventional methods to scrape data from it, we will get nothing. Well, this problem can be easily solved through Web kit. Web kits can do anything a browser can do. For some browsers, Web kit is the underlying web page rendering tool. Web kit is part of the QT
library, so if you have installed the QT
and PyQT4
libraries, you can run it directly.
Linux: sudo apt
-
get install python
-
qt4
Windows:
Step 1: Download .whl, address: https://www.lfd.uci.edu/~gohlke/pythonlibs/ #pyqt4, Here you can download packages corresponding to different python versions.
Step 2: Select a directory, put the downloaded file in the directory, then cmd, cd into the directory, execute the command: pip install PyQt4- 4.11.4-cp36-cp36m-win_amd64.whl, complete the installation.
Step 3: Verify whether the installation is successful.
First send the request information through the Web kit, and then wait for the web page to be fully loaded. Assign it to a variable. Next, we use lxml
to extract effective information from HTML data. This process takes a while.
import sys from PyQt4.QtWebKit import * from PyQt4.QtGui import * from PyQt4.QtCore import * class Render(QWebPage): # 用来渲染网页,将url中的所有信息加载下来并存到一个新的框架中 def __init__(self,url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://jandan.net/ooxx' r = Render(url) html = r.frame.toHtml() print(html)
Then, the next work is to parse the HTML code, which will not be explained here.
The above is the detailed content of How Python crawls content added by js in web pages (code). For more information, please follow other related articles on the PHP Chinese website!