Introduction to two methods of parsing HTML under C#-C#.Net Tutorial-php.cn

Home

Backend Development

C#.Net Tutorial

Introduction to two methods of parsing HTML under C#

高洛峰

Jan 13, 2017 pm 05:21 PM

在搜索引擎的开发中，我们需要对Html进行解析。本文介绍C#解析HTML的两种方法。
AD：
在搜索引擎的开发中，我们需要对网页的Html内容进行检索，难免的就需要对Html进行解析。拆分每一个节点并且获取节点间的内容。此文介绍两种C#解析Html的方法。

C#解析Html的第一种方法：
用System.Net.WebClient下载Web Page存到本地文件或者String中，用正则表达式来分析。这个方法可以用在Web Crawler等需要分析很多Web Page的应用中。
估计这也是大家最直接，最容易想到的一个方法。
转自网上的一个实例：所有的href都抽取出来：

using System; 
using System.Net; 
using System.Text; 
using System.Text.RegularExpressions; 
namespace HttpGet 
{ 
class Class1 
{ 
[STAThread] 
static void Main(string[] args) 
{ 
System.Net.WebClient client = new WebClient(); 
byte[] page = client.DownloadData("http://www.google.com"); 
string content = System.Text.Encoding.UTF8.GetString(page); 
string regex = "href=[\\\"\\\&#39;](http:\\/\\/|\\.\\/|\\/)?\\w+(\\.\\w+)*(\\/\\w+(\\.\\w+)?)*(\\/|\\?\\w*=\\w*(&\\w*=\\w*)*)?[\\\"\\\&#39;]"; 
Regex re = new Regex(regex); 
MatchCollection matches = re.Matches(content);
System.Collections.IEnumerator enu = matches.GetEnumerator(); 
while (enu.MoveNext() && enu.Current != null) 
{ 
Match match = (Match)(enu.Current); 
Console.Write(match.Value + "\r\n"); 
} 
} 
} 
}

C#解析Html的第二种方法：
利用Winista.Htmlparser.Net 解析Html。这是.NET平台下解析Html的开源代码，网上有源码下载，百度一下就能搜到，这里就不提供了。并且有英文的帮助文档。找不到的留下邮箱。

个人认为这是.net平台下解析html不错的解决方案，基本上能够满足我们对html的解析工作。
自己做了个实例：

using System; 
using System.Collections.Generic; 
using System.ComponentModel; 
using System.Data; 
using System.Drawing; 
using System.Linq; 
using System.Text; 
using System.Windows.Forms; 
using Winista.Text.HtmlParser; 
using Winista.Text.HtmlParser.Lex; 
using Winista.Text.HtmlParser.Util; 
using Winista.Text.HtmlParser.Tags; 
using Winista.Text.HtmlParser.Filters;

namespace HTMLParser 
{ 
public partial class Form1 : Form 
{ 
public Form1() 
{ 
InitializeComponent(); 
AddUrl(); 
}
private void btnParser_Click(object sender, EventArgs e) 
{ 
#region 获得网页的html 
try 
{
txtHtmlWhole.Text = ""; 
string url = CBUrl.SelectedItem.ToString().Trim(); 
System.Net.WebClient aWebClient = new System.Net.WebClient(); 
aWebClient.Encoding = System.Text.Encoding.Default; 
string html = aWebClient.DownloadString(url); 
txtHtmlWhole.Text = html; 
} 
catch (Exception ex) 
{ 
MessageBox.Show(ex.Message); 
} 
#endregion
#region 分析网页html节点 
Lexer lexer = new Lexer(this.txtHtmlWhole.Text); 
Parser parser = new Parser(lexer); 
NodeList htmlNodes = parser.Parse(null); 
this.treeView1.Nodes.Clear(); 
this.treeView1.Nodes.Add("root"); 
TreeNode treeRoot = this.treeView1.Nodes[0]; 
for (int i = 0; i < htmlNodes.Count; i++) 
{ 
this.RecursionHtmlNode(treeRoot, htmlNodes[i], false); 
}
#endregion
}
private void RecursionHtmlNode(TreeNode treeNode, INode htmlNode, bool siblingRequired) 
{ 
if (htmlNode == null || treeNode == null) return;
TreeNode current = treeNode; 
TreeNode content ; 
//current node 
if (htmlNode is ITag) 
{ 
ITag tag = (htmlNode as ITag); 
if (!tag.IsEndTag()) 
{ 
string nodeString = tag.TagName; 
if (tag.Attributes != null && tag.Attributes.Count > 0) 
{ 
if (tag.Attributes["ID"] != null) 
{ 
nodeString = nodeString + " { id=\"" + tag.Attributes["ID"].ToString() + "\" }"; 
} 
if (tag.Attributes["HREF"] != null) 
{ 
nodeString = nodeString + " { href=\"" + tag.Attributes["HREF"].ToString() + "\" }"; 
} 
}
current = new TreeNode(nodeString); 
treeNode.Nodes.Add(current); 
} 
} 
//获取节点间的内容 
if (htmlNode.Children != null && htmlNode.Children.Count > 0) 
{ 
this.RecursionHtmlNode(current, htmlNode.FirstChild, true); 
content = new TreeNode(htmlNode.FirstChild.GetText()); 
treeNode.Nodes.Add(content); 
} 
//the sibling nodes 
if (siblingRequired) 
{ 
INode sibling = htmlNode.NextSibling; 
while (sibling != null) 
{ 
this.RecursionHtmlNode(treeNode, sibling, false); 
sibling = sibling.NextSibling; 
} 
} 
} 
private void AddUrl() 
{ 
CBUrl.Items.Add("http://www.hao123.com"); 
CBUrl.Items.Add("http://www.sina.com"); 
CBUrl.Items.Add("http://www.heuet.edu.cn"); 
} 
} 
}

运行效果：

Introduction to two methods of parsing HTML under C#

实现取来很容易，结合Winista.Htmlparser源码很快就可以实现想要的效果。

小结：
简单介绍了两种C#解析Html的的方法，大家有什么其他好的方法还望指教。

更多Introduction to two methods of parsing HTML under C#相关文章请关注PHP中文网！

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Is C# Always Associated with .NET? Exploring AlternativesMay 04, 2025 am 12:06 AM

C# is not always tied to .NET. 1) C# can run in the Mono runtime environment and is suitable for Linux and macOS. 2) In the Unity game engine, C# is used for scripting and does not rely on the .NET framework. 3) C# can also be used for embedded system development, such as .NETMicroFramework.

The .NET Ecosystem: C#'s Role and BeyondMay 03, 2025 am 12:04 AM

C# plays a core role in the .NET ecosystem and is the preferred language for developers. 1) C# provides efficient and easy-to-use programming methods, combining the advantages of C, C and Java. 2) Execute through .NET runtime (CLR) to ensure efficient cross-platform operation. 3) C# supports basic to advanced usage, such as LINQ and asynchronous programming. 4) Optimization and best practices include using StringBuilder and asynchronous programming to improve performance and maintainability.

C# as a .NET Language: The Foundation of the EcosystemMay 02, 2025 am 12:01 AM

C# is a programming language released by Microsoft in 2000, aiming to combine the power of C and the simplicity of Java. 1.C# is a type-safe, object-oriented programming language that supports encapsulation, inheritance and polymorphism. 2. The compilation process of C# converts the code into an intermediate language (IL), and then compiles it into machine code execution in the .NET runtime environment (CLR). 3. The basic usage of C# includes variable declarations, control flows and function definitions, while advanced usages cover asynchronous programming, LINQ and delegates, etc. 4. Common errors include type mismatch and null reference exceptions, which can be debugged through debugger, exception handling and logging. 5. Performance optimization suggestions include the use of LINQ, asynchronous programming, and improving code readability.

C# vs. .NET: Clarifying the Key Differences and SimilaritiesMay 01, 2025 am 12:12 AM

C# is a programming language, while .NET is a software framework. 1.C# is developed by Microsoft and is suitable for multi-platform development. 2..NET provides class libraries and runtime environments, and supports multilingual. The two work together to build modern applications.

Beyond the Hype: Assessing the Current Role of C# .NETApr 30, 2025 am 12:06 AM

C#.NET is a powerful development platform that combines the advantages of the C# language and .NET framework. 1) It is widely used in enterprise applications, web development, game development and mobile application development. 2) C# code is compiled into an intermediate language and is executed by the .NET runtime environment, supporting garbage collection, type safety and LINQ queries. 3) Examples of usage include basic console output and advanced LINQ queries. 4) Common errors such as empty references and type conversion errors can be solved through debuggers and logging. 5) Performance optimization suggestions include asynchronous programming and optimization of LINQ queries. 6) Despite the competition, C#.NET maintains its important position through continuous innovation.

The Future of C# .NET: Trends and OpportunitiesApr 29, 2025 am 12:02 AM

The future trends of C#.NET are mainly focused on three aspects: cloud computing, microservices, AI and machine learning integration, and cross-platform development. 1) Cloud computing and microservices: C#.NET optimizes cloud environment performance through the Azure platform and supports the construction of an efficient microservice architecture. 2) Integration of AI and machine learning: With the help of the ML.NET library, C# developers can embed machine learning models in their applications to promote the development of intelligent applications. 3) Cross-platform development: Through .NETCore and .NET5, C# applications can run on Windows, Linux and macOS, expanding the deployment scope.

C# .NET Development Today: Trends and Best PracticesApr 28, 2025 am 12:25 AM

The latest developments and best practices in C#.NET development include: 1. Asynchronous programming improves application responsiveness, and simplifies non-blocking code using async and await keywords; 2. LINQ provides powerful query functions, efficiently manipulating data through delayed execution and expression trees; 3. Performance optimization suggestions include using asynchronous programming, optimizing LINQ queries, rationally managing memory, improving code readability and maintenance, and writing unit tests.

C# .NET: Building Applications with the .NET EcosystemApr 27, 2025 am 12:12 AM

How to build applications using .NET? Building applications using .NET can be achieved through the following steps: 1) Understand the basics of .NET, including C# language and cross-platform development support; 2) Learn core concepts such as components and working principles of the .NET ecosystem; 3) Master basic and advanced usage, from simple console applications to complex WebAPIs and database operations; 4) Be familiar with common errors and debugging techniques, such as configuration and database connection issues; 5) Application performance optimization and best practices, such as asynchronous programming and caching.

See all articles