Home  >  Article  >  Web Front-end  >  Implementation code for web scraping using phantomjs_javascript skills

Implementation code for web scraping using phantomjs_javascript skills

WBOY
WBOYOriginal
2016-05-16 16:35:001296browse

Because phantomjs is a headless browser that can run js, it can also run dom nodes, which is perfect for web crawling.

For example, we want to batch crawl the content of "Today in History" on the web page. Website

Observing the dom structure, we only need to get the title value of .list li a. So we use advanced selectors to build DOM fragments

var d= ''
var c = document.querySelectorAll('.list li a')
var l = c.length;
for(var i =0;i<l;i++){
d=d+c[i].title+'\n'
}

After that, you only need to let the js code run in phantomjs~

var page = require('webpage').create();
	page.open('http://www.todayonhistory.com/', function (status) { //打开页面
		if (status !== 'success') {
			console.log('FAIL to load the address');
		} else {
			console.log(page.evaluate(function () {
					var d= ''
					var c = document.querySelectorAll('.list li a')
					var l = c.length;
					for(var i =0;i<l;i++){
					d=d+c[i].title+'\n'
					}
						return d
				}))

		}
		phantom.exit();
	});

Finally we save it as catch.js, execute it in dos, and output the content to a txt file (you can also use the file api of phantomjs to write)

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn