Home >Web Front-end >HTML Tutorial >Detailed explanation of using HTMLParser (2)
HTMLParser saves the parsed information as a tree structure. Node is the basis of data type for information storage.
Please look at the definition of Node:
public interface Node extends Cloneable;
There are several types of methods included in Node:
For functions that traverse tree structures, these functions are the easiest to understand:
Node getParent ():取得父节点 NodeList getChildren ():取得子节点的列表 Node getFirstChild ():取得第一个子节点 Node getLastChild ():取得最后一个子节点 Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了) Node getNextSibling ():取得下一个兄弟节点
Function to obtain Node content:
String getText ():取得文本 String toPlainTextString():取得纯文本信息。 String toHtml () :取得HTML信息(原始HTML) String toHtml (boolean verbatim):取得HTML信息(原始HTML) String toString ():取得字符串信息(原始HTML) Page getPage ():取得这个Node对应的Page对象 int getStartPosition ():取得这个Node在HTML页面中的起始位置 int getEndPosition ():取得这个Node在HTML页面中的结束位置
Function used for Filter filtering:
void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。
Function used for Visitor traversal:
void accept (NodeVisitor visitor):对这个Node应用visitor
Function used to modify content, this type is rarely used:
void setPage (Page page):设置这个Node对应的Page对象 void setText (String text):设置文本 void setChildren (NodeList children):设置子节点列表
Other functions:
void doSemanticAction ():执行这个Node对应的操作(只有少数Tag有对应的操作) Object clone ():接口Clone的抽象函数。
Actual We use HTMLParser most to process HTML pages. Filter or Visitor related functions are necessary, and the first and second types of functions are the most used. The first type of function is easier to understand. Let’s use an example to illustrate the second type of function.
The following is the HTML file used for testing:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> <html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html>
Test code:
/** * @author www.baizeju.com */ package com.baizeju.htmlparsertester; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.File; import java.net.HttpURLConnection; import java.net.URL; import org.htmlparser.Node; import org.htmlparser.util.NodeIterator; import org.htmlparser.Parser; /** * @author www.baizeju.com */ public class Main { private static String ENCODE = "GBK"; private static void message( String szMsg ) { try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){} } public static String openFile( String szFileName ) { try { BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) ); String szContent=""; String szTemp; while ( (szTemp = bis.readLine()) != null) { szContent+=szTemp+"/n"; } bis.close(); return szContent; } catch( Exception e ) { return ""; } } public static void main(String[] args) { try{ Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() ); for (NodeIterator i = parser.elements (); i.hasMoreNodes(); ) { Node node = i.nextNode(); message("getText:"+node.getText()); message("getPlainText:"+node.toPlainTextString()); message("toHtml:"+node.toHtml()); message("toHtml(true):"+node.toHtml(true)); message("toHtml(false):"+node.toHtml(false)); message("toString:"+node.toString()); message("================================================="); } } catch( Exception e ) { System.out.println( "Exception:"+e ); } } }
Output result:
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" getPlainText: toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121 ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (121[0,121],123[1,0]): /n ================================================= getText:head getPlainText:白泽居-www.baizeju.com toHtml:<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(true):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(false):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toString:HEAD: Tag (123[1,0],129[1,6]): head Tag (129[1,6],197[1,74]): meta http-equiv="Content-Type" content="text/html; ... Tag (197[1,74],204[1,81]): title Txt (204[1,81],223[1,100]): 白泽居-www.baizeju.com End (223[1,100],231[1,108]): /title End (231[1,108],238[1,115]): /head ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (238[1,115],240[2,0]): /n ================================================= getText:html xmlns="http://www.w3.org/1999/xhtml" getPlainText: 白泽居-www.baizeju.com 白泽居-www.baizeju.com 白泽居-www.baizeju.com toHtml:<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(true):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(false):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toString:Tag (240[2,0],283[2,43]): html xmlns="http://www.w3.org/1999/xhtml" Txt (283[2,43],285[3,0]): /n Tag (285[3,0],292[3,7]): body Txt (292[3,7],294[4,0]): /n Tag (294[4,0],313[4,19]): div id="top_main" Txt (313[4,19],316[5,1]): /n/t Tag (316[5,1],336[5,21]): div id="logoindex" Txt (336[5,21],340[6,2]): /n/t/t Rem (340[6,2],351[6,13]): 这是注释 Txt (351[6,13],376[8,0]): /n/t/t白泽居-www.baizeju.com/n Tag (376[8,0],409[8,33]): a href="http://www.baizeju.com" Txt (409[8,33],428[8,52]): 白泽居-www.baizeju.com End (428[8,52],432[8,56]): /a Txt (432[8,56],435[9,1]): /n/t End (435[9,1],441[9,7]): /div Txt (441[9,7],465[11,0]): /n/t白泽居-www.baizeju.com/n End (465[11,0],471[11,6]): /div Txt (471[11,6],473[12,0]): /n End (473[12,0],480[12,7]): /body Txt (480[12,7],482[13,0]): /n End (482[13,0],489[13,7]): /html
== ===============================================
For the content of the first Node, the corresponding line is 2b68512eae77256b3a72fc9a402719f3, this is easier to understand.
From this output result, you can also see the tree structure of the content. Or rather the structure of the woods. The first-level tags in the Page content, such as DOCTYPE, head and html, respectively form a top-level Node node (many people may be a little strange about the content of the second and fourth Node. In fact, these two Nodes are Two newline symbols. HTMLParser converts all line breaks, spaces, tabs, etc. in the HTML page content into corresponding Tags, so there is a Node like this. Although it has less content, it has a high level, haha)
getPlainTextString is Everything the user can see is included. There are two interesting points. One is that the Title content in the 93f0f5c25f18dab9d176bd4f6de5d30e tag is in plainText, so it may be visible even if it is visible in the title. In addition, as mentioned earlier, line breaks and other characters in the HTML content have also become plainText. This seems to be a bit of a logical problem.
In addition, you may find that there is no difference between the results of toHtml, toHtml(true) and toHtml(false). This is actually the case. If you trace the code of HTMLParser, you can find that the subclass of Node is AbstractNode, which implements the code of toHtml() and directly calls toHtml(false). Among the three subclasses of AbstractNode, RemarkNode, TagNode and TextNode, In the implementation of toHtml(boolean verbatim), the verbatim parameter is not processed, so the results of the three functions are exactly the same. If you don't need to implement any special processing of your own, simply use toHtml.
The above is the detailed explanation of using HTMLParser (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!