Home  >  Article  >  Backend Development  >  Example of using C# to obtain the HTML source code of a web page

Example of using C# to obtain the HTML source code of a web page

高洛峰
高洛峰Original
2017-01-14 13:29:512082browse

I'm working on a project recently, and one of the functions is to get the source code of a web page based on a URL address. In ASP.NET (C#), there seem to be many ways to obtain the source code of a web page. I just made a simple WebClient, which is very simple and easy. But a very annoying problem came out later, and that was the garbled Chinese characters.

After careful study, Chinese web pages are nothing more than GB2312 and UTF-8. So we have the following code:

       /// <summary>
       /// 根据网址的URL,获取源代码HTML
       /// </summary>
       /// <param name="url"></param>
       /// <returns></returns>
       public static string GetHtmlByUrl(string url)
       {
           using (WebClient wc = new WebClient())
           {
               try
               {
                   wc.UseDefaultCredentials = true;
                   wc.Proxy = new WebProxy();
                   wc.Proxy.Credentials = CredentialCache.DefaultCredentials;
                   wc.Credentials = System.Net.CredentialCache.DefaultCredentials;
                   byte[] bt = wc.DownloadData(url);
                   string txt = System.Text.Encoding.GetEncoding("GB2312").GetString(bt);
                   switch (GetCharset(txt).ToUpper())
                   {
                       case "UTF-8":
                           txt = System.Text.Encoding.UTF8.GetString(bt);
                           break;
                       case "UNICODE":
                           txt = System.Text.Encoding.Unicode.GetString(bt);
                           break;
                       default:
                           break;
                   }
                   return txt;
               }
               catch (Exception ex)
               {
                   return null;
               }
           }
       }

To explain a little bit, WebClient is used here to create a wc object (the naming is a bit awkward). Then call the DownloadData method of the wc object, pass in the URL value, and return a byte array. By default, GB2312 is used to read this byte array and convert it into a string. Find the characteristic characters of the encoding format of the web page from the string of the web page source code, such as finding information such as charset="utf-8", to determine the encoding format of the current web page.

The GetCharset function is used to obtain the encoding format of the current web page. The specific code is as follows:

      /// <summary>
       /// 从HTML中获取获取charset
       /// </summary>
       /// <param name="html"></param>
       /// <returns></returns>
       public static string GetCharset(string html)
       {
           string charset = "";
           Regex regCharset = new Regex(@"content=[""'].*\s*charset\b\s*=\s*""?(?<charset>[^""']*)", RegexOptions.IgnoreCase);
           if (regCharset.IsMatch(html))
           {
               charset = regCharset.Match(html).Groups["charset"].Value;
           }
           if (charset.Equals(""))
           {
               regCharset = new Regex(@"<\s*meta\s*charset\s*=\s*[""']?(?<charset>[^""']*)", RegexOptions.IgnoreCase);
               if (regCharset.IsMatch(html))
               {
                   charset = regCharset.Match(html).Groups["charset"].Value;
               }
           }
           return charset;
       }


For more related articles on examples of using C# to obtain the HTML source code of web pages, please pay attention to the PHP Chinese website!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn