当当当!终于来到了Jsoup的特色:CSS Selector部分。selector也是我写的爬虫框架webmagic开发的一个重点。附上一张street fighter的图,希望以后webmagic也能挑战Jsoup!
select机制
Jsoup的select包里,类结构如下:
在最开始介绍Jsoup的时候,就已经说过NodeVisitor和Selector了。Selector是select部分的对外facade,而NodeVisitor则是遍历树的底层API,CSS Selector也是根据NodeVisitor实现的遍历。
Jsoup的select核心是Evaluator。Selector所传递的表达式,会经过QueryParser,最终编译成一个Evaluator。Evaluator是一个抽象类,它只有一个方法:
public abstract boolean matches(Element root, Element element);
注意这里传入了root,是为了某些情况下对树进行遍历时用的。
Evaluator的设计简洁明了,所有的Selector表达式单词都会编译到对应的Evaluator。例如#xx对应Id,.xx对应Class,[]对应Attribute。这里补充一下w3c的CSS Selector规范:http://www.w3.org/TR/CSS2/selector.html
当然,只靠这几个还不够,Jsoup还定义了CombiningEvaluator(对Evaluator进行And/Or组合),StructuralEvaluator(结合DOM树结构进行筛选)。
这里我们可能最关心的是,“div ul li”这样的父子结构是如何实现的。这个的实现方式在StructuralEvaluator.Parent中,贴一下代码了:
static class Parent extends StructuralEvaluator { public Parent(Evaluator evaluator) { this.evaluator = evaluator; }public boolean matches(Element root, Element element) { if (root == element) return false;Element parent = element.parent(); while (parent != root) { if (evaluator.matches(root, parent)) return true; parent = parent.parent(); } return false; }}
这里Parent包含了一个evaluator属性,会根据这个evaluator去验证所有父节点。注意Parent是可以嵌套的,所以这个表达式”div ul li”最终会编译成And(Parent(And(Parent(Tag(“div”)),Tag(“ul”)),Tag(“li”)))这样的Evaluator组合。
select部分比想象的要简单,代码可读性也很高。经过了parser部分的研究,这部分应该算是驾轻就熟了。
关于webmagic的后续打算
webmagic是一个爬虫框架,它的Selector是用于抓取HTML中指定的文本,其机制和Jsoup的Evaluator非常像,只不过webmagic暂时是将Selector封装成较简单的API,而Evaluator直接上了表达式。之前也考虑过自己定制DSL来写一个HTML,现在看了Jsoup的源码,实现能力算是有了,但是引入DSL,实现只是一小部分,如何让DSL易写易懂才是难点。
其实看了Jsoup的源码,精细程度上比webmagic要好得多了,基本每个类都对应一个真实的概念抽象,可能以后会在这方面下点工夫。
下篇文章将讲最后一部分:白名单及HTML过滤机制。
最后依然附上这系列文章和代码的github地址:https://github.com/code4craft/jsoup-learning

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 English version
Recommended: Win version, supports code prompts!

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools