Home  >  Article  >  Backend Development  >  How Python crawls content added by js in web pages (code)

How Python crawls content added by js in web pages (code)

不言
不言forward
2018-09-28 14:14:578211browse

The content of this article is about how Python crawls the content (code) added by js in the web page. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

When we crawl a web page, we will use certain rules to extract effective information from the returned HTML data. But if the web page contains JavaScript code, we must go through rendering processing to get the original data. At this point, if we still use conventional methods to scrape data from it, we will get nothing. Well, this problem can be easily solved through Web kit. Web kits can do anything a browser can do. For some browsers, Web kit is the underlying web page rendering tool. Web kit is part of the QT library, so if you have installed the QT and PyQT4 libraries, you can run it directly.

1. Environment preparation

Linux: sudo apt-get install python-qt4

Windows:

Step 1: Download .whl, address: https://www.lfd.uci.edu/~gohlke/pythonlibs/ #pyqt4, Here you can download packages corresponding to different python versions.

How Python crawls content added by js in web pages (code)

Step 2: Select a directory, put the downloaded file in the directory, then cmd, cd into the directory, execute the command: pip install PyQt4- 4.11.4-cp36-cp36m-win_amd64.whl, complete the installation.

How Python crawls content added by js in web pages (code)

Step 3: Verify whether the installation is successful.

How Python crawls content added by js in web pages (code)

How Python crawls content added by js in web pages (code)

2. Solution

First send the request information through the Web kit, and then wait for the web page to be fully loaded. Assign it to a variable. Next, we use lxml to extract effective information from HTML data. This process takes a while.

import sys
from PyQt4.QtWebKit import *
from PyQt4.QtGui import *
from PyQt4.QtCore import *

class Render(QWebPage):  # 用来渲染网页,将url中的所有信息加载下来并存到一个新的框架中
    def __init__(self,url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()
    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'http://jandan.net/ooxx'
r = Render(url)
html = r.frame.toHtml()
print(html)

Then, the next work is to parse the HTML code, which will not be explained here.

The above is the detailed content of How Python crawls content added by js in web pages (code). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete