search
HomeWeb Front-endHTML TutorialDetailed explanation of using HTMLParser (1)

Detailed explanation of using HTMLParser (1)

Dec 29, 2016 pm 03:49 PM
htmlparser

In researching the development of search engines, the processing of HTML web pages is a core link. There are many open source codes on the Internet. For Java, HTMLParser is a well-known and widely used one. The homepage of HTMLParser is http://htmlparser.sourceforge.net/, and the last update was version 1.6 in September 2006. But it doesn't matter, the content of HTML has not changed significantly for a long time, and HTMLParser basically has no problem processing it. HTMLParser has the advantages of being compact and fast. The disadvantage is that there are relatively few relevant documents (and few in English), and many functions need to be explored by yourself. For beginners, it still takes some effort, but once you get started, you will find that the structural design of HTMLParser is very clever and very practical, and it can basically meet your various needs.
Here I wrote some introductory stuff based on my experience in the past few months. I hope it will be helpful to friends who are new to HTMLParser. (However, my Chinese language score in the college entrance examination was only one point higher than passing, so I hope everyone will bear with the grammatical issues)

The core module of HTMLParser is the org.htmlparser.Parser class. This class actually completes the processing of HTML pages. analysis work. This class has the following constructors:

public Parser ();
public Parser (Lexer lexer, ParserFeedback fb);
public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
public Parser (String resource, ParserFeedback feedback) throws ParserException;
public Parser (String resource) throws ParserException;
public Parser (Lexer lexer);
public Parser (URLConnection connection) throws ParserException;
和一个静态类 public static Parser createParser (String html, String charset);

For most users, the most commonly used method is to initialize Parser through a URLConnection or a string that holds web page content, or use static Function to generate a Parser object. The code of ParserFeedback is very simple and is designed for debugging and tracking analysis processes, and generally does not need to be changed. Using Lexer is a relatively advanced topic and will be discussed later.
The interesting point here is that if you need to set the encoding method of the page, there is only a static function without using Lexer. For most Chinese pages, it seems that this is a method that should be used more often.

The following is an example of initializing Parser.

/**
* @author www.baizeju.com
*/
package com.baizeju.htmlparsertester;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.File;
import java.net.HttpURLConnection;
import java.net.URL;
import org.htmlparser.visitors.TextExtractingVisitor;
import org.htmlparser.Parser;
/**
* @author www.baizeju.com
*/
public class Main {
private static String ENCODE = "GBK";
private static void message( String szMsg ) {
try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){}
}
public static String openFile( String szFileName ) {
try {
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) );
String szContent="";
String szTemp;
while ( (szTemp = bis.readLine()) != null) {
szContent+=szTemp+"/n";
}
bis.close();
return szContent;
}
catch( Exception e ) {
return "";
}
}
public static void main(String[] args) {
String szContent = openFile( "E:/My Sites/HTMLParserTester.html");
try{
//Parser parser = Parser.createParser(szContent, ENCODE);
//Parser parser = new Parser( szContent );
Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );
TextExtractingVisitor visitor = new TextExtractingVisitor();
parser.visitAllNodesWith(visitor);
String textInPage = visitor.getExtractedText();
message(textInPage);
}
catch( Exception e ) { 
}
}
}

The emphasized part tests several different initialization methods, and the results are shown later. As long as you can see that Parser can output content, we will discuss how to access Parser content later.

The above is the detailed explanation of using HTMLParser (1). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
HTML: Is It a Programming Language or Something Else?HTML: Is It a Programming Language or Something Else?Apr 15, 2025 am 12:13 AM

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

HTML: Building the Structure of Web PagesHTML: Building the Structure of Web PagesApr 14, 2025 am 12:14 AM

HTML is the cornerstone of building web page structure. 1. HTML defines the content structure and semantics, and uses, etc. tags. 2. Provide semantic markers, such as, etc., to improve SEO effect. 3. To realize user interaction through tags, pay attention to form verification. 4. Use advanced elements such as, combined with JavaScript to achieve dynamic effects. 5. Common errors include unclosed labels and unquoted attribute values, and verification tools are required. 6. Optimization strategies include reducing HTTP requests, compressing HTML, using semantic tags, etc.

From Text to Websites: The Power of HTMLFrom Text to Websites: The Power of HTMLApr 13, 2025 am 12:07 AM

HTML is a language used to build web pages, defining web page structure and content through tags and attributes. 1) HTML organizes document structure through tags, such as,. 2) The browser parses HTML to build the DOM and renders the web page. 3) New features of HTML5, such as, enhance multimedia functions. 4) Common errors include unclosed labels and unquoted attribute values. 5) Optimization suggestions include using semantic tags and reducing file size.

Understanding HTML, CSS, and JavaScript: A Beginner's GuideUnderstanding HTML, CSS, and JavaScript: A Beginner's GuideApr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

The Role of HTML: Structuring Web ContentThe Role of HTML: Structuring Web ContentApr 11, 2025 am 12:12 AM

The role of HTML is to define the structure and content of a web page through tags and attributes. 1. HTML organizes content through tags such as , making it easy to read and understand. 2. Use semantic tags such as, etc. to enhance accessibility and SEO. 3. Optimizing HTML code can improve web page loading speed and user experience.

HTML and Code: A Closer Look at the TerminologyHTML and Code: A Closer Look at the TerminologyApr 10, 2025 am 09:28 AM

HTMLisaspecifictypeofcodefocusedonstructuringwebcontent,while"code"broadlyincludeslanguageslikeJavaScriptandPythonforfunctionality.1)HTMLdefineswebpagestructureusingtags.2)"Code"encompassesawiderrangeoflanguagesforlogicandinteract

HTML, CSS, and JavaScript: Essential Tools for Web DevelopersHTML, CSS, and JavaScript: Essential Tools for Web DevelopersApr 09, 2025 am 12:12 AM

HTML, CSS and JavaScript are the three pillars of web development. 1. HTML defines the web page structure and uses tags such as, etc. 2. CSS controls the web page style, using selectors and attributes such as color, font-size, etc. 3. JavaScript realizes dynamic effects and interaction, through event monitoring and DOM operations.

The Roles of HTML, CSS, and JavaScript: Core ResponsibilitiesThe Roles of HTML, CSS, and JavaScript: Core ResponsibilitiesApr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.