HTMLParser saves the parsed information as a tree structure. Node is the basis of data type for information storage.
Please look at the definition of Node:
public interface Node extends Cloneable;
There are several types of methods included in Node:
For functions that traverse tree structures, these functions are the easiest to understand:
Node getParent ():取得父节点 NodeList getChildren ():取得子节点的列表 Node getFirstChild ():取得第一个子节点 Node getLastChild ():取得最后一个子节点 Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了) Node getNextSibling ():取得下一个兄弟节点
Function to obtain Node content:
String getText ():取得文本 String toPlainTextString():取得纯文本信息。 String toHtml () :取得HTML信息(原始HTML) String toHtml (boolean verbatim):取得HTML信息(原始HTML) String toString ():取得字符串信息(原始HTML) Page getPage ():取得这个Node对应的Page对象 int getStartPosition ():取得这个Node在HTML页面中的起始位置 int getEndPosition ():取得这个Node在HTML页面中的结束位置
Function used for Filter filtering:
void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。
Function used for Visitor traversal:
void accept (NodeVisitor visitor):对这个Node应用visitor
Function used to modify content, this type is rarely used:
void setPage (Page page):设置这个Node对应的Page对象 void setText (String text):设置文本 void setChildren (NodeList children):设置子节点列表
Other functions:
void doSemanticAction ():执行这个Node对应的操作(只有少数Tag有对应的操作) Object clone ():接口Clone的抽象函数。
Actual We use HTMLParser most to process HTML pages. Filter or Visitor related functions are necessary, and the first and second types of functions are the most used. The first type of function is easier to understand. Let’s use an example to illustrate the second type of function.
The following is the HTML file used for testing:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> <html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html>
Test code:
/** * @author www.baizeju.com */ package com.baizeju.htmlparsertester; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.File; import java.net.HttpURLConnection; import java.net.URL; import org.htmlparser.Node; import org.htmlparser.util.NodeIterator; import org.htmlparser.Parser; /** * @author www.baizeju.com */ public class Main { private static String ENCODE = "GBK"; private static void message( String szMsg ) { try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){} } public static String openFile( String szFileName ) { try { BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) ); String szContent=""; String szTemp; while ( (szTemp = bis.readLine()) != null) { szContent+=szTemp+"/n"; } bis.close(); return szContent; } catch( Exception e ) { return ""; } } public static void main(String[] args) { try{ Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() ); for (NodeIterator i = parser.elements (); i.hasMoreNodes(); ) { Node node = i.nextNode(); message("getText:"+node.getText()); message("getPlainText:"+node.toPlainTextString()); message("toHtml:"+node.toHtml()); message("toHtml(true):"+node.toHtml(true)); message("toHtml(false):"+node.toHtml(false)); message("toString:"+node.toString()); message("================================================="); } } catch( Exception e ) { System.out.println( "Exception:"+e ); } } }
Output result:
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" getPlainText: toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121 ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (121[0,121],123[1,0]): /n ================================================= getText:head getPlainText:白泽居-www.baizeju.com toHtml:<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(true):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(false):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toString:HEAD: Tag (123[1,0],129[1,6]): head Tag (129[1,6],197[1,74]): meta http-equiv="Content-Type" content="text/html; ... Tag (197[1,74],204[1,81]): title Txt (204[1,81],223[1,100]): 白泽居-www.baizeju.com End (223[1,100],231[1,108]): /title End (231[1,108],238[1,115]): /head ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (238[1,115],240[2,0]): /n ================================================= getText:html xmlns="http://www.w3.org/1999/xhtml" getPlainText: 白泽居-www.baizeju.com 白泽居-www.baizeju.com 白泽居-www.baizeju.com toHtml:<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(true):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(false):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toString:Tag (240[2,0],283[2,43]): html xmlns="http://www.w3.org/1999/xhtml" Txt (283[2,43],285[3,0]): /n Tag (285[3,0],292[3,7]): body Txt (292[3,7],294[4,0]): /n Tag (294[4,0],313[4,19]): div id="top_main" Txt (313[4,19],316[5,1]): /n/t Tag (316[5,1],336[5,21]): div id="logoindex" Txt (336[5,21],340[6,2]): /n/t/t Rem (340[6,2],351[6,13]): 这是注释 Txt (351[6,13],376[8,0]): /n/t/t白泽居-www.baizeju.com/n Tag (376[8,0],409[8,33]): a href="http://www.baizeju.com" Txt (409[8,33],428[8,52]): 白泽居-www.baizeju.com End (428[8,52],432[8,56]): /a Txt (432[8,56],435[9,1]): /n/t End (435[9,1],441[9,7]): /div Txt (441[9,7],465[11,0]): /n/t白泽居-www.baizeju.com/n End (465[11,0],471[11,6]): /div Txt (471[11,6],473[12,0]): /n End (473[12,0],480[12,7]): /body Txt (480[12,7],482[13,0]): /n End (482[13,0],489[13,7]): /html
== ===============================================
For the content of the first Node, the corresponding line is , this is easier to understand.
From this output result, you can also see the tree structure of the content. Or rather the structure of the woods. The first-level tags in the Page content, such as DOCTYPE, head and html, respectively form a top-level Node node (many people may be a little strange about the content of the second and fourth Node. In fact, these two Nodes are Two newline symbols. HTMLParser converts all line breaks, spaces, tabs, etc. in the HTML page content into corresponding Tags, so there is a Node like this. Although it has less content, it has a high level, haha)
getPlainTextString is Everything the user can see is included. There are two interesting points. One is that the Title content in the
In addition, you may find that there is no difference between the results of toHtml, toHtml(true) and toHtml(false). This is actually the case. If you trace the code of HTMLParser, you can find that the subclass of Node is AbstractNode, which implements the code of toHtml() and directly calls toHtml(false). Among the three subclasses of AbstractNode, RemarkNode, TagNode and TextNode, In the implementation of toHtml(boolean verbatim), the verbatim parameter is not processed, so the results of the three functions are exactly the same. If you don't need to implement any special processing of your own, simply use toHtml.
The above is the detailed explanation of using HTMLParser (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

本篇文章带大家了解一下HTML(超文本标记语言),介绍一下HTML的本质,HTML文档的结构、HTML文档的基本标签和图像标签、列表、表格标签、媒体元素、表单,希望对大家有所帮助!

不算。html是一种用来告知浏览器如何组织页面的标记语言,而CSS是一种用来表现HTML或XML等文件样式的样式设计语言;html和css不具备很强的逻辑性和流程控制功能,缺乏灵活性,且html和css不能按照人类的设计对一件工作进行重复的循环,直至得到让人类满意的答案。

总结了一些web前端面试(笔试)题分享给大家,本篇文章就先给大家分享HTML部分的笔试题(附答案),大家可以自己做做,看看能答对几个!

HTML5中画布标签是“<canvas>”。canvas标签用于图形的绘制,它只是一个矩形的图形容器,绘制图形必须通过脚本(通常是JavaScript)来完成;开发者可利用多种js方法来在canvas中绘制路径、盒、圆、字符以及添加图像等。

html5废弃了dir列表标签。dir标签被用来定义目录列表,一般和li标签配合使用,在dir标签对中通过li标签来设置列表项,语法“<dir><li>列表项值</li>...</dir>”。HTML5已经不支持dir,可使用ul标签取代。

在html中,document是文档对象的意思,代表浏览器窗口的文档;document对象是window对象的子对象,所以可通过“window.document”属性对其进行访问,每个载入浏览器的HTML文档都会成为Document对象。

html5支持boolean值属性;boolean值属性指是属性值为true或者false的属性,如input元素中的disabled属性,不使用该属性表示值为flase,不禁用元素,使用该属性可以不设置属性值表示值为true,禁用元素。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Zend Studio 13.0.1
Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 English version
Recommended: Win version, supports code prompts!
