Home  >  Article  >  Web Front-end  >  Use Jsoup to parse and manipulate HTML_html/css_WEB-ITnose

Use Jsoup to parse and manipulate HTML_html/css_WEB-ITnose

WBOY
WBOYOriginal
2016-06-24 11:40:021109browse

Jsoup Introduction

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.

The main functions of jsoup are as follows:

1. Parse HTML from a URL, file or string;
2. Use DOM or CSS selectors to find and retrieve data;
3. Can operate HTML elements, attributes, and text;

jsoup is released based on the MIT license and can be used in commercial projects with confidence.

The main class hierarchy of jsoup is shown in the figure below:



Next, we will specifically focus on several common application scenarios to illustrate how jsoup can elegantly process HTML documents. processed.

Document input

jsoup can load HTML documents from strings, URL addresses and local files, and generate Document object instances.

The following is the relevant code:

// 直接从字符串中输入 HTML 文档String html = "<html><head><title>开源中国社区</title></head>"  + "<body><p>这里是 jsoup 项目的相关文章</p></body></html>";Document doc = Jsoup.parse(html); // 从URL直接加载 HTML 文档Document doc = Jsoup.connect("http://www.oschina.net/").get();String title = doc.title(); Document doc = Jsoup.connect("http://www.oschina.net/")  .data("query", "Java")   //请求参数  .userAgent("I’m jsoup") //设置User-Agent  .cookie("auth", "token") //设置cookie  .timeout(3000)           //设置连接超时时间  .post();                 //使用POST方法访问URL // 从文件中加载 HTML 文档File input = new File("D:/test.html");Document doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/");


Please pay attention to the third parameter of parse in the last HTML document input method. Why do you need to specify a URL here? (Although it can not be specified, like the first method)? Because there will be many links, pictures, referenced external scripts, css files, etc. in the HTML document, and the third parameter named baseURL means that when the HTML document uses a relative path to reference an external file, jsoup will automatically These URLs are prefixed with a baseURL.

For example, 8a3d155f684f52248478510e38b4b59aOpen Source Software5db79b134e9f6b82c0b36e0489ee08ed will be converted into 4ead9468af80aa0c0964416f90d00d73Open Source Software5db79b134e9f6b82c0b36e0489ee08ed .

Parse and extract HTML elements

This part involves the most basic functions of an HTML parser, but jsoup uses a different approach from other open source projects?? Selector, we will introduce the jsoup selector in detail in the last part, in this section you will see how jsoup is implemented with the simplest code.

However, jsoup also provides traditional DOM element analysis. Take a look at the following code:

File input = new File("D:/test.html");Document doc = Jsoup.parse(input, "UTF-8", "http://www.oschina.net/"); Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {  String linkHref = link.attr("href");  String linkText = link.text();}


You may feel that jsoup's method is familiar, yes, like The getElementById and getElementsByTag methods have the same names as JavaScript methods, and their functions are exactly the same. You can get the corresponding element or element list based on the node name or the id of the HTML element.

Unlike the htmlparser project, jsoup does not define a corresponding class for HTML elements. Generally, the components of an HTML element include: node name, attributes and text. jsoup provides simple methods for you to retrieve by yourself. These data are also the reason why jsoup keeps slim.

In terms of element retrieval, jsoup's selector is simply omnipotent,

File input = new File("D:\test.html");Document doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/"); Elements links = doc.select("a[href]"); // 具有 href 属性的链接Elements pngs = doc.select("img[src$=.png]");//所有引用png图片的元素  Element masthead = doc.select("div.masthead").first();// 找出定义了 class=masthead 的元素 Elements resultLinks = doc.select("h3.r > a"); // direct a after h3


This is what really impresses me about jsoup, jsoup Use the same selector as jQuery to retrieve elements. If the above retrieval method is replaced by other HTML interpreters, it will require at least many lines of code, but jsoup only requires one line of code to complete.

jsoup’s selector also supports expression function. We will introduce this super powerful selector in the last section.

Modify data

While parsing the document, we may need to modify some elements in the document. For example, we can add You can click on the link, modify the link address or modify the text, etc.

Here are some simple examples:

doc.select("div.comments a").attr("rel", "nofollow");//为所有链接增加 rel=nofollow 属性 doc.select("div.comments a").addClass("mylinkclass");//为所有链接增加 class=mylinkclass 属性 doc.select("img").removeAttr("onclick"); //删除所有图片的onclick属性 doc.select("input[type=text]").val(""); //清空所有文本输入框中的文本


The reason is very simple, you just need to use jsoup's selector to find the element, and then you can Through the above method, except that the tag name cannot be modified (you can delete it and then insert a new element), the attributes and text of the element can be modified.

After modification, directly call the html() method of Element(s) to obtain the modified HTML document.

HTML document cleaning

jsoup also does a great job in providing a powerful API while being user-friendly. When building a website, a user comment function is often provided. Some users are naughty and will add some scripts to the comment content, and these scripts may destroy the behavior of the entire page, or more seriously, obtain some confidential information, such as XSS cross-site attacks.

jsoup has very powerful support in this regard and is very simple to use. Take a look at the following code:

String unsafe = "<p><a href='http://www.oschina.net/' onclick='stealCookies()'>开源中国社区</a></p>";String safe = Jsoup.clean(unsafe, Whitelist.basic());// 输出: // <p><a href="http://www.oschina.net/" rel="nofollow">开源中国社区</a></p>

jsoup uses a Whitelist class to filter HTML documents. This class provides several common methods:

If None of these five filters can meet your requirements. For example, if you allow users to insert flash animations, it doesn't matter. Whitelist provides extended functions, such as whitelist.addTags("embed","object","param","span"," div"); You can also call addAttributes to add attributes to certain elements.

What’s great about jsoup? Selectors

We have briefly introduced how jsoup uses selectors to retrieve elements. In this section we focus on the powerful syntax of selectors themselves. The following table is a detailed list of all syntaxes for jsoup selectors.

Basic usage

The above is the most basic selector syntax. These syntaxes can also be used in combination. The following is the combined usage supported by jsoup:

In addition to some basic syntax and the combination of these syntaxes, jsoup also supports the use of expressions for element filtering and selection. The following is a list of all expressions supported by jsoup:

Summary

The basic functions of jsoup have been introduced here, but because jsoup has good Extensible API design, you can develop very powerful HTML parsing functions through the definition of selectors. In addition, the development of the jsoup project itself is also very active, so if you are using Java and need to process HTML, you might as well give it a try.

The above is excerpted from the open source Chinese community:

Attachment:

jsoup online API: jsoup 1.6.3 API

jsoup Development Guide

Copyright Statement: This article is an original article by the blogger and has not been published. No reproduction is allowed without the permission of the blogger.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn