Home  >  Article  >  Backend Development  >  scrapy custom crawler-crawl javascript content

scrapy custom crawler-crawl javascript content

高洛峰
高洛峰Original
2016-10-17 13:57:561269browse

Many websites use javascript... Web page content is dynamically generated by js, and some js events trigger page content changes and links to open. Some websites even do not work at all without js, and instead return you a message similar to "Please open and browse"

There are four solutions to support javascript:
1. Write code to simulate relevant js logic.
2. Call a browser with an interface, similar to various widely used for testing, Selenium and the like.
3. Use an interfaceless browser, various webkit-based ones, casperjs, phantomjs, etc.
4. Combine it with a js execution engine to implement a lightweight browser yourself. It is very difficult.

For simple limited crawling tasks, if you can simulate js logic through code, this solution is preferred. For example, in the duckduckgo search engine, the page turning action is triggered by js. Simulation seems to be difficult, and then I noticed When you go to the second form on his page, it seems that you can turn the page after submitting. After trying it, it is indeed the case.
When writing code to simulate the relevant js logic, first try to close the js in the browser to see if you can get the required information. Something. Some pages provide compatibility without js. If it is not possible, open the chrome console or firebug to observe the js logic. It may be ajax and other sending and receiving packages. You can use urllib2 (recommended requests library) to simulate it, or it may be to modify the dom or the like. , you can use lxml to make corresponding modifications. In other words, if js executes something, use python code to simulate the execution.

You can also choose to use selenium. The disadvantage is that the efficiency is very low. You should first test selenium to start a Is the time required for the browser instance acceptable to you? This time is generally on the second level. Considering that the browser opens the page and renders it, it is even slower. On the premise that the efficiency is acceptable, this solution is not bad.
Another aspect of this solution One problem is that selenium cannot run on a server without a desktop environment.

For situations where the scale is large, simulating js is not feasible, selenium efficiency is too low, or it needs to be executed without a desktop environment. With or without an interface browser, The general situation of several interfaceless browsers is as follows:
1, casperjs, phantomjs: non-py, can be called through the command line, and the functions are basically satisfied. It is recommended to check whether these two are satisfied first. They are relatively mature. phantomjs also has an unofficial one The webdriver protocol is implemented, so you can adjust phantomjs through selenium to achieve no interface.
2, ghost, spynner, etc.: py customized webkit, personally feel that the spynner code is messy, and the ghost code is of good quality. But there are bugs. I have seen several such I changed it after the library.
See the details of this solution below.

Finally there is another option, based on the js execution engine, you can implement a lightweight interfaceless browser that supports js. Unless you There is a lot of content that needs to be crawled, and efficiency is very, very, very important. If you have this idea, you can take a look at pyv8. In the sample code of v8, there is a simple browser model based on v8. Yes, it is just a The model is not fully usable. You have to fill in some methods in it yourself. To achieve this, you need to implement these functions on top of the js engine (v8) and http library (urllib2). 1. Get the js code contained in the web page when it is opened. , 2. Build a browser model, including various events and DOM trees. 3. Execute js. In addition, there may be some other details. It is difficult.
You can find an article on the shopping price comparison crawler used by Yitao online. Related ppt. This crawler only uses the third solution. You can read this ppt. This crawler probably uses webkit and scrapy. In addition, the scrapy scheduling queue is changed to redis-based to achieve distribution.

How? Implementation:

Let’s talk about some background knowledge. Scrapy uses twisted, an asynchronous network framework. Therefore, we should pay attention to potential blocking situations. But we noticed that there is a parameter in the settings to set the parallelism of ItemPipeline. It is speculated that the pipeline will not block. , the pipeline may be executed in the thread pool (not verified). Pipeline is generally used to save the captured information (write database, write files), so here you don’t have to worry about time-consuming operations blocking the entire framework, and There is no need to implement this write operation as asynchronous in Pipeline. In addition, other parts of the framework are all asynchronous. To put it simply, the request generated by the crawler is downloaded by the scheduler, and then the crawler continues execution and scheduling. After the downloader completes the download, the response will be handed over to the crawler for analysis.

Reference examples found on the Internet include part of the js support written into the DownloaderMiddleware, and the same is true for the code snippet on the scrapy official website. If implemented in this way, it will block the entire framework and the work of the crawler. The mode has become, download-parse-download-parse, instead of parallel downloading. This is not a big problem in small-scale crawling that does not require high efficiency.

A better approach is to write js support to scrapy's downloader There is such an implementation on the Internet (using selenium+phantomjs). But it only supports get requests.

When adapting a webkit to scrapy's downloader, there are various details that need to be dealt with.


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn