在搜索引擎的开发中,我们需要对Html进行解析。本文介绍C#解析HTML的两种方法。
AD:
在搜索引擎的开发中,我们需要对网页的Html内容进行检索,难免的就需要对Html进行解析。拆分每一个节点并且获取节点间的内容。此文介绍两种C#解析Html的方法。
C#解析Html的第一种方法:
用System.Net.WebClient下载Web Page存到本地文件或者String中,用正则表达式来分析。这个方法可以用在Web Crawler等需要分析很多Web Page的应用中。
估计这也是大家最直接,最容易想到的一个方法。
转自网上的一个实例:所有的href都抽取出来:
using System; using System.Net; using System.Text; using System.Text.RegularExpressions; namespace HttpGet { class Class1 { [STAThread] static void Main(string[] args) { System.Net.WebClient client = new WebClient(); byte[] page = client.DownloadData("http://www.google.com"); string content = System.Text.Encoding.UTF8.GetString(page); string regex = "href=[\\\"\\\'](http:\\/\\/|\\.\\/|\\/)?\\w+(\\.\\w+)*(\\/\\w+(\\.\\w+)?)*(\\/|\\?\\w*=\\w*(&\\w*=\\w*)*)?[\\\"\\\']"; Regex re = new Regex(regex); MatchCollection matches = re.Matches(content); System.Collections.IEnumerator enu = matches.GetEnumerator(); while (enu.MoveNext() && enu.Current != null) { Match match = (Match)(enu.Current); Console.Write(match.Value + "\r\n"); } } } }
C#解析Html的第二种方法:
利用Winista.Htmlparser.Net 解析Html。这是.NET平台下解析Html的开源代码,网上有源码下载,百度一下就能搜到,这里就不提供了。并且有英文的帮助文档。找不到的留下邮箱。
个人认为这是.net平台下解析html不错的解决方案,基本上能够满足我们对html的解析工作。
自己做了个实例:
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using Winista.Text.HtmlParser; using Winista.Text.HtmlParser.Lex; using Winista.Text.HtmlParser.Util; using Winista.Text.HtmlParser.Tags; using Winista.Text.HtmlParser.Filters; namespace HTMLParser { public partial class Form1 : Form { public Form1() { InitializeComponent(); AddUrl(); } private void btnParser_Click(object sender, EventArgs e) { #region 获得网页的html try { txtHtmlWhole.Text = ""; string url = CBUrl.SelectedItem.ToString().Trim(); System.Net.WebClient aWebClient = new System.Net.WebClient(); aWebClient.Encoding = System.Text.Encoding.Default; string html = aWebClient.DownloadString(url); txtHtmlWhole.Text = html; } catch (Exception ex) { MessageBox.Show(ex.Message); } #endregion #region 分析网页html节点 Lexer lexer = new Lexer(this.txtHtmlWhole.Text); Parser parser = new Parser(lexer); NodeList htmlNodes = parser.Parse(null); this.treeView1.Nodes.Clear(); this.treeView1.Nodes.Add("root"); TreeNode treeRoot = this.treeView1.Nodes[0]; for (int i = 0; i < htmlNodes.Count; i++) { this.RecursionHtmlNode(treeRoot, htmlNodes[i], false); } #endregion } private void RecursionHtmlNode(TreeNode treeNode, INode htmlNode, bool siblingRequired) { if (htmlNode == null || treeNode == null) return; TreeNode current = treeNode; TreeNode content ; //current node if (htmlNode is ITag) { ITag tag = (htmlNode as ITag); if (!tag.IsEndTag()) { string nodeString = tag.TagName; if (tag.Attributes != null && tag.Attributes.Count > 0) { if (tag.Attributes["ID"] != null) { nodeString = nodeString + " { id=\"" + tag.Attributes["ID"].ToString() + "\" }"; } if (tag.Attributes["HREF"] != null) { nodeString = nodeString + " { href=\"" + tag.Attributes["HREF"].ToString() + "\" }"; } } current = new TreeNode(nodeString); treeNode.Nodes.Add(current); } } //获取节点间的内容 if (htmlNode.Children != null && htmlNode.Children.Count > 0) { this.RecursionHtmlNode(current, htmlNode.FirstChild, true); content = new TreeNode(htmlNode.FirstChild.GetText()); treeNode.Nodes.Add(content); } //the sibling nodes if (siblingRequired) { INode sibling = htmlNode.NextSibling; while (sibling != null) { this.RecursionHtmlNode(treeNode, sibling, false); sibling = sibling.NextSibling; } } } private void AddUrl() { CBUrl.Items.Add("http://www.hao123.com"); CBUrl.Items.Add("http://www.sina.com"); CBUrl.Items.Add("http://www.heuet.edu.cn"); } } }
运行效果:
实现取来很容易,结合Winista.Htmlparser源码很快就可以实现想要的效果。
小结:
简单介绍了两种C#解析Html的的方法,大家有什么其他好的方法还望指教。
更多Introduction to two methods of parsing HTML under C#相关文章请关注PHP中文网!

C# is not always tied to .NET. 1) C# can run in the Mono runtime environment and is suitable for Linux and macOS. 2) In the Unity game engine, C# is used for scripting and does not rely on the .NET framework. 3) C# can also be used for embedded system development, such as .NETMicroFramework.

C# plays a core role in the .NET ecosystem and is the preferred language for developers. 1) C# provides efficient and easy-to-use programming methods, combining the advantages of C, C and Java. 2) Execute through .NET runtime (CLR) to ensure efficient cross-platform operation. 3) C# supports basic to advanced usage, such as LINQ and asynchronous programming. 4) Optimization and best practices include using StringBuilder and asynchronous programming to improve performance and maintainability.

C# is a programming language released by Microsoft in 2000, aiming to combine the power of C and the simplicity of Java. 1.C# is a type-safe, object-oriented programming language that supports encapsulation, inheritance and polymorphism. 2. The compilation process of C# converts the code into an intermediate language (IL), and then compiles it into machine code execution in the .NET runtime environment (CLR). 3. The basic usage of C# includes variable declarations, control flows and function definitions, while advanced usages cover asynchronous programming, LINQ and delegates, etc. 4. Common errors include type mismatch and null reference exceptions, which can be debugged through debugger, exception handling and logging. 5. Performance optimization suggestions include the use of LINQ, asynchronous programming, and improving code readability.

C# is a programming language, while .NET is a software framework. 1.C# is developed by Microsoft and is suitable for multi-platform development. 2..NET provides class libraries and runtime environments, and supports multilingual. The two work together to build modern applications.

C#.NET is a powerful development platform that combines the advantages of the C# language and .NET framework. 1) It is widely used in enterprise applications, web development, game development and mobile application development. 2) C# code is compiled into an intermediate language and is executed by the .NET runtime environment, supporting garbage collection, type safety and LINQ queries. 3) Examples of usage include basic console output and advanced LINQ queries. 4) Common errors such as empty references and type conversion errors can be solved through debuggers and logging. 5) Performance optimization suggestions include asynchronous programming and optimization of LINQ queries. 6) Despite the competition, C#.NET maintains its important position through continuous innovation.

The future trends of C#.NET are mainly focused on three aspects: cloud computing, microservices, AI and machine learning integration, and cross-platform development. 1) Cloud computing and microservices: C#.NET optimizes cloud environment performance through the Azure platform and supports the construction of an efficient microservice architecture. 2) Integration of AI and machine learning: With the help of the ML.NET library, C# developers can embed machine learning models in their applications to promote the development of intelligent applications. 3) Cross-platform development: Through .NETCore and .NET5, C# applications can run on Windows, Linux and macOS, expanding the deployment scope.

The latest developments and best practices in C#.NET development include: 1. Asynchronous programming improves application responsiveness, and simplifies non-blocking code using async and await keywords; 2. LINQ provides powerful query functions, efficiently manipulating data through delayed execution and expression trees; 3. Performance optimization suggestions include using asynchronous programming, optimizing LINQ queries, rationally managing memory, improving code readability and maintenance, and writing unit tests.

How to build applications using .NET? Building applications using .NET can be achieved through the following steps: 1) Understand the basics of .NET, including C# language and cross-platform development support; 2) Learn core concepts such as components and working principles of the .NET ecosystem; 3) Master basic and advanced usage, from simple console applications to complex WebAPIs and database operations; 4) Be familiar with common errors and debugging techniques, such as configuration and database connection issues; 5) Application performance optimization and best practices, such as asynchronous programming and caching.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use

Dreamweaver CS6
Visual web development tools

Atom editor mac version download
The most popular open source editor
