Use Jsoup to parse and manipulate HTML_html/css_WEB-ITnose-HTML Tutorial-php.cn

Home

Web Front-end

HTML Tutorial

Use Jsoup to parse and manipulate HTML_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:40 AM

Jsoup Introduction

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.

The main functions of jsoup are as follows:

1. Parse HTML from a URL, file or string;
2. Use DOM or CSS selectors to find and retrieve data;
3. Can operate HTML elements, attributes, and text;

jsoup is released based on the MIT license and can be used in commercial projects with confidence.

The main class hierarchy of jsoup is shown in the figure below:

Next, we will specifically focus on several common application scenarios to illustrate how jsoup can elegantly process HTML documents. processed.

Document input

jsoup can load HTML documents from strings, URL addresses and local files, and generate Document object instances.

The following is the relevant code:

// 直接从字符串中输入 HTML 文档String html = "<html><head><title>开源中国社区</title></head>"  + "<body><p>这里是 jsoup 项目的相关文章</p></body></html>";Document doc = Jsoup.parse(html); // 从URL直接加载 HTML 文档Document doc = Jsoup.connect("http://www.oschina.net/").get();String title = doc.title(); Document doc = Jsoup.connect("http://www.oschina.net/")  .data("query", "Java")   //请求参数  .userAgent("I’m jsoup") //设置User-Agent  .cookie("auth", "token") //设置cookie  .timeout(3000)           //设置连接超时时间  .post();                 //使用POST方法访问URL // 从文件中加载 HTML 文档File input = new File("D:/test.html");Document doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/");

Please pay attention to the third parameter of parse in the last HTML document input method. Why do you need to specify a URL here? (Although it can not be specified, like the first method)? Because there will be many links, pictures, referenced external scripts, css files, etc. in the HTML document, and the third parameter named baseURL means that when the HTML document uses a relative path to reference an external file, jsoup will automatically These URLs are prefixed with a baseURL.

For example, Open Source Software will be converted into Open Source Software .

Parse and extract HTML elements

This part involves the most basic functions of an HTML parser, but jsoup uses a different approach from other open source projects?? Selector, we will introduce the jsoup selector in detail in the last part, in this section you will see how jsoup is implemented with the simplest code.

However, jsoup also provides traditional DOM element analysis. Take a look at the following code:

File input = new File("D:/test.html");Document doc = Jsoup.parse(input, "UTF-8", "http://www.oschina.net/"); Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {  String linkHref = link.attr("href");  String linkText = link.text();}

You may feel that jsoup's method is familiar, yes, like The getElementById and getElementsByTag methods have the same names as JavaScript methods, and their functions are exactly the same. You can get the corresponding element or element list based on the node name or the id of the HTML element.

Unlike the htmlparser project, jsoup does not define a corresponding class for HTML elements. Generally, the components of an HTML element include: node name, attributes and text. jsoup provides simple methods for you to retrieve by yourself. These data are also the reason why jsoup keeps slim.

In terms of element retrieval, jsoup's selector is simply omnipotent,

File input = new File("D:\test.html");Document doc = Jsoup.parse(input,"UTF-8","http://www.oschina.net/"); Elements links = doc.select("a[href]"); // 具有 href 属性的链接Elements pngs = doc.select("img[src$=.png]");//所有引用png图片的元素  Element masthead = doc.select("div.masthead").first();// 找出定义了 class=masthead 的元素 Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

This is what really impresses me about jsoup, jsoup Use the same selector as jQuery to retrieve elements. If the above retrieval method is replaced by other HTML interpreters, it will require at least many lines of code, but jsoup only requires one line of code to complete.

jsoup’s selector also supports expression function. We will introduce this super powerful selector in the last section.

Modify data

While parsing the document, we may need to modify some elements in the document. For example, we can add You can click on the link, modify the link address or modify the text, etc.

Here are some simple examples:

doc.select("div.comments a").attr("rel", "nofollow");//为所有链接增加 rel=nofollow 属性 doc.select("div.comments a").addClass("mylinkclass");//为所有链接增加 class=mylinkclass 属性 doc.select("img").removeAttr("onclick"); //删除所有图片的onclick属性 doc.select("input[type=text]").val(""); //清空所有文本输入框中的文本

The reason is very simple, you just need to use jsoup's selector to find the element, and then you can Through the above method, except that the tag name cannot be modified (you can delete it and then insert a new element), the attributes and text of the element can be modified.

After modification, directly call the html() method of Element(s) to obtain the modified HTML document.

HTML document cleaning

jsoup also does a great job in providing a powerful API while being user-friendly. When building a website, a user comment function is often provided. Some users are naughty and will add some scripts to the comment content, and these scripts may destroy the behavior of the entire page, or more seriously, obtain some confidential information, such as XSS cross-site attacks.

jsoup has very powerful support in this regard and is very simple to use. Take a look at the following code:

String unsafe = "<p><a href='http://www.oschina.net/' onclick='stealCookies()'>开源中国社区</a></p>";String safe = Jsoup.clean(unsafe, Whitelist.basic());// 输出: // <p><a href="http://www.oschina.net/" rel="nofollow">开源中国社区</a></p>

jsoup uses a Whitelist class to filter HTML documents. This class provides several common methods:

If None of these five filters can meet your requirements. For example, if you allow users to insert flash animations, it doesn't matter. Whitelist provides extended functions, such as whitelist.addTags("embed","object","param","span"," div"); You can also call addAttributes to add attributes to certain elements.

What’s great about jsoup? Selectors

We have briefly introduced how jsoup uses selectors to retrieve elements. In this section we focus on the powerful syntax of selectors themselves. The following table is a detailed list of all syntaxes for jsoup selectors.

Basic usage

The above is the most basic selector syntax. These syntaxes can also be used in combination. The following is the combined usage supported by jsoup:

In addition to some basic syntax and the combination of these syntaxes, jsoup also supports the use of expressions for element filtering and selection. The following is a list of all expressions supported by jsoup:

Summary

The basic functions of jsoup have been introduced here, but because jsoup has good Extensible API design, you can develop very powerful HTML parsing functions through the definition of selectors. In addition, the development of the jsoup project itself is also very active, so if you are using Java and need to process HTML, you might as well give it a try.

The above is excerpted from the open source Chinese community:

Attachment:

jsoup online API: jsoup 1.6.3 API

jsoup Development Guide

Copyright Statement: This article is an original article by the blogger and has not been published. No reproduction is allowed without the permission of the blogger.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Difficulty in updating caching of official account web pages: How to avoid the old cache affecting the user experience after version update?Mar 04, 2025 pm 12:32 PM

The official account web page update cache, this thing is simple and simple, and it is complicated enough to drink a pot of it. You worked hard to update the official account article, but the user still opened the old version. Who can bear the taste? In this article, let’s take a look at the twists and turns behind this and how to solve this problem gracefully. After reading it, you can easily deal with various caching problems, allowing your users to always experience the freshest content. Let’s talk about the basics first. To put it bluntly, in order to improve access speed, the browser or server stores some static resources (such as pictures, CSS, JS) or page content. Next time you access it, you can directly retrieve it from the cache without having to download it again, and it is naturally fast. But this thing is also a double-edged sword. The new version is online,

How do I use HTML5 form validation attributes to validate user input?Mar 17, 2025 pm 12:27 PM

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

How to efficiently add stroke effects to PNG images on web pages?Mar 04, 2025 pm 02:39 PM

This article demonstrates efficient PNG border addition to webpages using CSS. It argues that CSS offers superior performance compared to JavaScript or libraries, detailing how to adjust border width, style, and color for subtle or prominent effect

What are the best practices for cross-browser compatibility in HTML5?Mar 17, 2025 pm 12:20 PM

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

What is the purpose of the <datalist> element?Mar 21, 2025 pm 12:33 PM

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

What is the purpose of the <progress> element?Mar 21, 2025 pm 12:34 PM

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

How do I use the HTML5 <time> element to represent dates and times semantically?Mar 12, 2025 pm 04:05 PM

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit

What is the purpose of the <meter> element?Mar 21, 2025 pm 12:35 PM

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks agoByDDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks agoByDDD

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

SublimeText3 Mac version

God-level code editing software (SublimeText3)

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6

Visual web development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7317

1625

1349

1261

1209