Home > Article > Backend Development > Example of using C# to obtain the HTML source code of a web page
I'm working on a project recently, and one of the functions is to get the source code of a web page based on a URL address. In ASP.NET (C#), there seem to be many ways to obtain the source code of a web page. I just made a simple WebClient, which is very simple and easy. But a very annoying problem came out later, and that was the garbled Chinese characters.
After careful study, Chinese web pages are nothing more than GB2312 and UTF-8. So we have the following code:
/// <summary> /// 根据网址的URL,获取源代码HTML /// </summary> /// <param name="url"></param> /// <returns></returns> public static string GetHtmlByUrl(string url) { using (WebClient wc = new WebClient()) { try { wc.UseDefaultCredentials = true; wc.Proxy = new WebProxy(); wc.Proxy.Credentials = CredentialCache.DefaultCredentials; wc.Credentials = System.Net.CredentialCache.DefaultCredentials; byte[] bt = wc.DownloadData(url); string txt = System.Text.Encoding.GetEncoding("GB2312").GetString(bt); switch (GetCharset(txt).ToUpper()) { case "UTF-8": txt = System.Text.Encoding.UTF8.GetString(bt); break; case "UNICODE": txt = System.Text.Encoding.Unicode.GetString(bt); break; default: break; } return txt; } catch (Exception ex) { return null; } } }
To explain a little bit, WebClient is used here to create a wc object (the naming is a bit awkward). Then call the DownloadData method of the wc object, pass in the URL value, and return a byte array. By default, GB2312 is used to read this byte array and convert it into a string. Find the characteristic characters of the encoding format of the web page from the string of the web page source code, such as finding information such as charset="utf-8", to determine the encoding format of the current web page.
The GetCharset function is used to obtain the encoding format of the current web page. The specific code is as follows:
/// <summary> /// 从HTML中获取获取charset /// </summary> /// <param name="html"></param> /// <returns></returns> public static string GetCharset(string html) { string charset = ""; Regex regCharset = new Regex(@"content=[""'].*\s*charset\b\s*=\s*""?(?<charset>[^""']*)", RegexOptions.IgnoreCase); if (regCharset.IsMatch(html)) { charset = regCharset.Match(html).Groups["charset"].Value; } if (charset.Equals("")) { regCharset = new Regex(@"<\s*meta\s*charset\s*=\s*[""']?(?<charset>[^""']*)", RegexOptions.IgnoreCase); if (regCharset.IsMatch(html)) { charset = regCharset.Match(html).Groups["charset"].Value; } } return charset; }
For more related articles on examples of using C# to obtain the HTML source code of web pages, please pay attention to the PHP Chinese website!