Home >Web Front-end >JS Tutorial >How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?
Introduction
Retrieving HTML code dynamically generated by web pages is a common task in web automation and scraping scenarios. .NET provides two options for achieving this: the System.Windows.Forms.WebBrowser class and the mshtml.HTMLDocument interface. However, using them effectively can be challenging.
The WebBrowser Class
The System.Windows.Forms.WebBrowser class is designed to embed web pages within your application. While it supports custom navigation and document events, it's limited in its ability to capture dynamically generated HTML.
The following code snippet illustrates using WebBrowser:
<code class="csharp">using System.Windows.Forms; using mshtml; namespace WebBrowserTest { public class Program { public static void Main() { WebBrowser wb = new WebBrowser(); wb.Navigate("https://www.google.com/#q=where+am+i"); wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e) { mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument; foreach (IHTMLElement element in doc.all) { System.Diagnostics.Debug.WriteLine(element.outerHTML); } }; Form f = new Form(); f.Controls.Add(wb); Application.Run(f); } } }</code>
The mshtml.HTMLDocument Interface
The mshtml.HTMLDocument interface provides direct access to the underlying HTML document object. However, it requires manual navigation and rendering, making it less convenient for dynamically generated content.
The following code snippet illustrates using mshtml.HTMLDocument:
<code class="csharp">using mshtml; namespace HTMLDocumentTest { public class Program { public static void Main() { mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument(); doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i")); foreach (IHTMLElement e in doc.all) { System.Diagnostics.Debug.WriteLine(e.outerHTML); } } } }</code>
A More Robust Approach
To overcome the limitations of WebBrowser and mshtml.HTMLDocument, you can use the following approach:
Example Code
The following C# code demonstrates this approach:
<code class="csharp">using System; using System.ComponentModel; using System.Threading; using System.Threading.Tasks; using System.Windows.Forms; using mshtml; namespace DynamicHTMLFetcher { public partial class MainForm : Form { public MainForm() { InitializeComponent(); this.webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted; this.Load += MainForm_Load; } private async void MainForm_Load(object sender, EventArgs e) { try { var cts = new CancellationTokenSource(10000); // cancel in 10s var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token); MessageBox.Show(html.Substring(0, 1024) + "..."); // it's too long! } catch (Exception ex) { MessageBox.Show(ex.Message); } } private async Task<string> LoadDynamicPage(string url, CancellationToken token) { var tcs = new TaskCompletionSource<bool>(); WebBrowserDocumentCompletedEventHandler handler = (s, arg) => tcs.TrySetResult(true); using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true)) { this.webBrowser.DocumentCompleted += handler; try { this.webBrowser.Navigate(url); await tcs.Task; // wait for DocumentCompleted } finally { this.webBrowser.DocumentCompleted -= handler; } } var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0]; var html = documentElement.OuterHtml; while (true) { await Task.Delay(500, token); if (this.webBrowser.IsBusy) continue; var htmlNow = documentElement.OuterHtml; if (html == htmlNow) break; // no changes detected, end the poll loop html = htmlNow; } token.ThrowIfCancellationRequested(); return html; } private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) { // Intentional no-op handler, we receive the DocumentCompleted event in the calling method. } } }</code>
This approach ensures that you obtain the fully rendered HTML code even if it is dynamically generated.
The above is the detailed content of How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?. For more information, please follow other related articles on the PHP Chinese website!