Detailed explanation of using HTMLParser (2)-HTML Tutorial-php.cn

Home

Web Front-end

HTML Tutorial

Detailed explanation of using HTMLParser (2)

黄舟

Dec 29, 2016 pm 03:52 PM

htmlparser

HTMLParser saves the parsed information as a tree structure. Node is the basis of data type for information storage.
Please look at the definition of Node:

public interface Node extends Cloneable;

There are several types of methods included in Node:
For functions that traverse tree structures, these functions are the easiest to understand:

Node getParent ()：取得父节点
NodeList getChildren ()：取得子节点的列表
Node getFirstChild ()：取得第一个子节点
Node getLastChild ()：取得最后一个子节点
Node getPreviousSibling ()：取得前一个兄弟（不好意思，英文是兄弟姐妹，直译太麻烦而且不符合习惯，对不起女同胞了）
Node getNextSibling ()：取得下一个兄弟节点

Function to obtain Node content:

String getText ()：取得文本
String toPlainTextString()：取得纯文本信息。
String toHtml () ：取得HTML信息（原始HTML）
String toHtml (boolean verbatim)：取得HTML信息（原始HTML）
String toString ()：取得字符串信息（原始HTML）
Page getPage ()：取得这个Node对应的Page对象
int getStartPosition ()：取得这个Node在HTML页面中的起始位置
int getEndPosition ()：取得这个Node在HTML页面中的结束位置

Function used for Filter filtering:

void collectInto (NodeList list, NodeFilter filter)：基于filter的条件对于这个节点进行过滤，符合条件的节点放到list中。

Function used for Visitor traversal:

void accept (NodeVisitor visitor)：对这个Node应用visitor

Function used to modify content, this type is rarely used:

void setPage (Page page)：设置这个Node对应的Page对象
void setText (String text)：设置文本
void setChildren (NodeList children)：设置子节点列表

Other functions:

void doSemanticAction ()：执行这个Node对应的操作（只有少数Tag有对应的操作）
Object clone ()：接口Clone的抽象函数。

Actual We use HTMLParser most to process HTML pages. Filter or Visitor related functions are necessary, and the first and second types of functions are the most used. The first type of function is easier to understand. Let’s use an example to illustrate the second type of function.
The following is the HTML file used for testing:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释-->
白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>

Test code:

/**
* @author www.baizeju.com
*/
package com.baizeju.htmlparsertester;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.File;
import java.net.HttpURLConnection;
import java.net.URL;
import org.htmlparser.Node;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.Parser;
/**
* @author www.baizeju.com
*/
public class Main {
private static String ENCODE = "GBK";
private static void message( String szMsg ) {
try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){}
}
public static String openFile( String szFileName ) {
try {
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) );
String szContent="";
String szTemp;
while ( (szTemp = bis.readLine()) != null) {
szContent+=szTemp+"/n";
}
bis.close();
return szContent;
}
catch( Exception e ) {
return "";
}
}
public static void main(String[] args) {
try{
Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );
for (NodeIterator i = parser.elements (); i.hasMoreNodes(); ) {
Node node = i.nextNode();
message("getText:"+node.getText());
message("getPlainText:"+node.toPlainTextString());
message("toHtml:"+node.toHtml());
message("toHtml(true):"+node.toHtml(true));
message("toHtml(false):"+node.toHtml(false));
message("toString:"+node.toString());
message("=================================================");
} 
}
catch( Exception e ) { 
System.out.println( "Exception:"+e );
}
}
}

Output result:

getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
getPlainText:
toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toHtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121
=================================================
getText:
getPlainText:
toHtml:
toHtml(true):
toHtml(false):
toString:Txt (121[0,121],123[1,0]): /n
=================================================
getText:head
getPlainText:白泽居-www.baizeju.com
toHtml:<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>
toHtml(true):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>
toHtml(false):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>
toString:HEAD: Tag (123[1,0],129[1,6]): head
Tag (129[1,6],197[1,74]): meta http-equiv="Content-Type" content="text/html; ...
Tag (197[1,74],204[1,81]): title
Txt (204[1,81],223[1,100]): 白泽居-www.baizeju.com
End (223[1,100],231[1,108]): /title
End (231[1,108],238[1,115]): /head
=================================================
getText:
getPlainText:
toHtml:
toHtml(true):
toHtml(false):
toString:Txt (238[1,115],240[2,0]): /n
=================================================
getText:html xmlns="http://www.w3.org/1999/xhtml"
getPlainText:
白泽居-www.baizeju.com
白泽居-www.baizeju.com
白泽居-www.baizeju.com
toHtml:<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释-->
白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>
toHtml(true):<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释-->
白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>
toHtml(false):<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释-->
白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>
toString:Tag (240[2,0],283[2,43]): html xmlns="http://www.w3.org/1999/xhtml"
Txt (283[2,43],285[3,0]): /n
Tag (285[3,0],292[3,7]): body 
Txt (292[3,7],294[4,0]): /n
Tag (294[4,0],313[4,19]): div id="top_main"
Txt (313[4,19],316[5,1]): /n/t
Tag (316[5,1],336[5,21]): div id="logoindex"
Txt (336[5,21],340[6,2]): /n/t/t
Rem (340[6,2],351[6,13]): 这是注释
Txt (351[6,13],376[8,0]): /n/t/t白泽居-www.baizeju.com/n
Tag (376[8,0],409[8,33]): a href="http://www.baizeju.com"
Txt (409[8,33],428[8,52]): 白泽居-www.baizeju.com
End (428[8,52],432[8,56]): /a
Txt (432[8,56],435[9,1]): /n/t
End (435[9,1],441[9,7]): /div
Txt (441[9,7],465[11,0]): /n/t白泽居-www.baizeju.com/n
End (465[11,0],471[11,6]): /div
Txt (471[11,6],473[12,0]): /n
End (473[12,0],480[12,7]): /body
Txt (480[12,7],482[13,0]): /n
End (482[13,0],489[13,7]): /html

== ===============================================

For the content of the first Node, the corresponding line is , this is easier to understand.
From this output result, you can also see the tree structure of the content. Or rather the structure of the woods. The first-level tags in the Page content, such as DOCTYPE, head and html, respectively form a top-level Node node (many people may be a little strange about the content of the second and fourth Node. In fact, these two Nodes are Two newline symbols. HTMLParser converts all line breaks, spaces, tabs, etc. in the HTML page content into corresponding Tags, so there is a Node like this. Although it has less content, it has a high level, haha)
getPlainTextString is Everything the user can see is included. There are two interesting points. One is that the Title content in the

tag is in plainText, so it may be visible even if it is visible in the title. In addition, as mentioned earlier, line breaks and other characters in the HTML content have also become plainText. This seems to be a bit of a logical problem.

In addition, you may find that there is no difference between the results of toHtml, toHtml(true) and toHtml(false). This is actually the case. If you trace the code of HTMLParser, you can find that the subclass of Node is AbstractNode, which implements the code of toHtml() and directly calls toHtml(false). Among the three subclasses of AbstractNode, RemarkNode, TagNode and TextNode, In the implementation of toHtml(boolean verbatim), the verbatim parameter is not processed, so the results of the three functions are exactly the same. If you don't need to implement any special processing of your own, simply use toHtml.

The above is the detailed explanation of using HTMLParser (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How to implement multi-project carousel in Bootstrap 4?Apr 30, 2025 pm 03:24 PM

Solution to implement multi-project carousel in Bootstrap4 Implementing multi-project carousel in Bootstrap4 is not an easy task. Although Bootstrap...

How does deepseek official website achieve the effect of penetrating mouse scroll event?Apr 30, 2025 pm 03:21 PM

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...

How to modify the playback control style of HTML videoApr 30, 2025 pm 03:18 PM

The default playback control style of HTML video cannot be modified directly through CSS. 1. Create custom controls using JavaScript. 2. Beautify these controls through CSS. 3. Consider compatibility, user experience and performance, using libraries such as Video.js or Plyr can simplify the process.

What problems will be caused by using native select on your phone?Apr 30, 2025 pm 03:15 PM

Potential problems with using native select on mobile phones When developing mobile applications, we often encounter the need for selecting boxes. Normally, developers...

What are the disadvantages of using native select on your phone?Apr 30, 2025 pm 03:12 PM

What are the disadvantages of using native select on your phone? When developing applications on mobile devices, it is very important to choose the right UI components. Many developers...

How to optimize collision handling of third-person roaming in a room using Three.js and Octree?Apr 30, 2025 pm 03:09 PM

Use Three.js and Octree to optimize collision handling of third-person roaming in the room. Use Octree in Three.js to implement third-person roaming in the room and add collisions...

What problems will you encounter when using native select on your phone?Apr 30, 2025 pm 03:06 PM

Issues with native select on mobile phones When developing applications on mobile devices, we often encounter scenarios where users need to make choices. Although native sel...

Why can some websites achieve mouse scrolling and penetration effect, while others cannot?Apr 30, 2025 pm 03:03 PM

Exploring the implementation principle of mouse scrolling events When browsing some websites, you may notice that some page elements still allow scrolling the entire page when the mouse is hovering...

See all articles