Detailed explanation of using HTMLParser (1)-HTML Tutorial-php.cn

Home

Web Front-end

HTML Tutorial

Detailed explanation of using HTMLParser (1)

黄舟

Dec 29, 2016 pm 03:49 PM

htmlparser

In researching the development of search engines, the processing of HTML web pages is a core link. There are many open source codes on the Internet. For Java, HTMLParser is a well-known and widely used one. The homepage of HTMLParser is http://htmlparser.sourceforge.net/, and the last update was version 1.6 in September 2006. But it doesn't matter, the content of HTML has not changed significantly for a long time, and HTMLParser basically has no problem processing it. HTMLParser has the advantages of being compact and fast. The disadvantage is that there are relatively few relevant documents (and few in English), and many functions need to be explored by yourself. For beginners, it still takes some effort, but once you get started, you will find that the structural design of HTMLParser is very clever and very practical, and it can basically meet your various needs.
Here I wrote some introductory stuff based on my experience in the past few months. I hope it will be helpful to friends who are new to HTMLParser. (However, my Chinese language score in the college entrance examination was only one point higher than passing, so I hope everyone will bear with the grammatical issues)

The core module of HTMLParser is the org.htmlparser.Parser class. This class actually completes the processing of HTML pages. analysis work. This class has the following constructors:

public Parser ();
public Parser (Lexer lexer, ParserFeedback fb);
public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
public Parser (String resource, ParserFeedback feedback) throws ParserException;
public Parser (String resource) throws ParserException;
public Parser (Lexer lexer);
public Parser (URLConnection connection) throws ParserException;
和一个静态类 public static Parser createParser (String html, String charset);

For most users, the most commonly used method is to initialize Parser through a URLConnection or a string that holds web page content, or use static Function to generate a Parser object. The code of ParserFeedback is very simple and is designed for debugging and tracking analysis processes, and generally does not need to be changed. Using Lexer is a relatively advanced topic and will be discussed later.
The interesting point here is that if you need to set the encoding method of the page, there is only a static function without using Lexer. For most Chinese pages, it seems that this is a method that should be used more often.

The following is an example of initializing Parser.

/**
* @author www.baizeju.com
*/
package com.baizeju.htmlparsertester;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.File;
import java.net.HttpURLConnection;
import java.net.URL;
import org.htmlparser.visitors.TextExtractingVisitor;
import org.htmlparser.Parser;
/**
* @author www.baizeju.com
*/
public class Main {
private static String ENCODE = "GBK";
private static void message( String szMsg ) {
try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){}
}
public static String openFile( String szFileName ) {
try {
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) );
String szContent="";
String szTemp;
while ( (szTemp = bis.readLine()) != null) {
szContent+=szTemp+"/n";
}
bis.close();
return szContent;
}
catch( Exception e ) {
return "";
}
}
public static void main(String[] args) {
String szContent = openFile( "E:/My Sites/HTMLParserTester.html");
try{
//Parser parser = Parser.createParser(szContent, ENCODE);
//Parser parser = new Parser( szContent );
Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );
TextExtractingVisitor visitor = new TextExtractingVisitor();
parser.visitAllNodesWith(visitor);
String textInPage = visitor.getExtractedText();
message(textInPage);
}
catch( Exception e ) { 
}
}
}

The emphasized part tests several different initialization methods, and the results are shown later. As long as you can see that Parser can output content, we will discuss how to access Parser content later.

The above is the detailed explanation of using HTMLParser (1). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What is the difference between an HTML tag and an HTML attribute?May 14, 2025 am 12:01 AM

HTMLtagsdefinethestructureofawebpage,whileattributesaddfunctionalityanddetails.1)Tagslike,,andoutlinethecontent'splacement.2)Attributessuchassrc,class,andstyleenhancetagsbyspecifyingimagesources,styling,andmore,improvingfunctionalityandappearance.

The Future of HTML: Evolution and TrendsMay 13, 2025 am 12:01 AM

The future of HTML will develop in a more semantic, functional and modular direction. 1) Semanticization will make the tag describe the content more clearly, improving SEO and barrier-free access. 2) Functionalization will introduce new elements and attributes to meet user needs. 3) Modularity will support component development and improve code reusability.

Why are HTML attributes important for web development?May 12, 2025 am 12:01 AM

HTMLattributesarecrucialinwebdevelopmentforcontrollingbehavior,appearance,andfunctionality.Theyenhanceinteractivity,accessibility,andSEO.Forexample,thesrcattributeintagsimpactsSEO,whileonclickintagsaddsinteractivity.Touseattributeseffectively:1)Usese

What is the purpose of the alt attribute? Why is it important?May 11, 2025 am 12:01 AM

The alt attribute is an important part of the tag in HTML and is used to provide alternative text for images. 1. When the image cannot be loaded, the text in the alt attribute will be displayed to improve the user experience. 2. Screen readers use the alt attribute to help visually impaired users understand the content of the picture. 3. Search engines index text in the alt attribute to improve the SEO ranking of web pages.

HTML, CSS, and JavaScript: Examples and Practical ApplicationsMay 09, 2025 am 12:01 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML is used to build web page structure; 2. CSS is used to beautify the appearance of web pages; 3. JavaScript is used to achieve dynamic interaction. Through tags, styles and scripts, these three together build the core functions of modern web pages.

How do you set the lang attribute on the tag? Why is this important?May 08, 2025 am 12:03 AM

Setting the lang attributes of a tag is a key step in optimizing web accessibility and SEO. 1) Set the lang attribute in the tag, such as. 2) In multilingual content, set lang attributes for different language parts, such as. 3) Use language codes that comply with ISO639-1 standards, such as "en", "fr", "zh", etc. Correctly setting the lang attribute can improve the accessibility of web pages and search engine rankings.

What is the purpose of HTML attributes?May 07, 2025 am 12:01 AM

HTMLattributesareessentialforenhancingwebelements'functionalityandappearance.Theyaddinformationtodefinebehavior,appearance,andinteraction,makingwebsitesinteractive,responsive,andvisuallyappealing.Attributeslikesrc,href,class,type,anddisabledtransform

How do you create a list in HTML?May 06, 2025 am 12:01 AM

TocreatealistinHTML,useforunorderedlistsandfororderedlists:1)Forunorderedlists,wrapitemsinanduseforeachitem,renderingasabulletedlist.2)Fororderedlists,useandfornumberedlists,customizablewiththetypeattributefordifferentnumberingstyles.

See all articles