Detailed introduction to the code of C# web crawler and search engine research-C#.Net Tutorial-php.cn

Home

Backend Development

C#.Net Tutorial

Detailed introduction to the code of C# web crawler and search engine research

黄舟

Mar 03, 2017 pm 01:12 PM

Effect page:

General idea:

An entrance link, for example: www.sina.com.cn, Start crawling from it and find the link (here you can parse the content of the web page, enter a keyword, judge whether the entered keyword is included, and put the link and related content of the web page into the cache), and put the crawled The connection is put into cache and executed recursively.

The work is relatively simple, so I can summarize it myself.

Start 10 threads at the same time, each thread corresponds to its own connection pool cache, put all connections containing keywords into the same cache, prepare a service page, refresh regularly, and display the current results (only It is a simulation. A real search engine must first use the word segmentation method to analyze keywords, and then combine the content of the web page to save the qualified web pages and links into the file. The next time you search, you must find the results from the file. Their The crawler crawls 24 hours a day). Let’s take a look at the specific implementation.

Entity class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Threading;
namespace SpiderDemo.Entity
{
////爬虫线程
    publicclass ClamThread
    {
       public Thread _thread { get; set; }
       public List<Link> lnkPool { get; set; }
}
 
////爬到的连接
  publicclass Link
    {
       public string Href { get; set; }
       public string LinkName { get; set; }
       public string Context { get; set; }
 
       public int TheadId { get; set; }
    }
 
}

Cache class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using SpiderDemo.Entity;
using System.Threading;
 
namespace SpiderDemo.SearchUtil
{
   public static class CacheHelper
    {
       public static bool EnableSearch;
 
       /// <summary>
       /// 起始URL
       /// </summary>
       public const string StartUrl = "http://www.sina.com.cn";
 
 
       /// <summary>
       /// 爬取的最大数量，性能优化一下，如果可以及时释放资源就可以一直爬了
       /// </summary>
       public const int MaxNum = 300;
 
       /// <summary>
       /// 最多爬出1000个结果
       /// </summary>
       public const int MaxResult = 1000;
 
 
       /// <summary>
       /// 当前爬到的数量
       /// </summary>
       public static int SpideNum;
 
       /// <summary>
       /// 关键字
        /// </summary>
       public static string KeyWord;
 
       /// <summary>
       /// 运行时间
       /// </summary>
       public static int RuningTime;
 
       /// <summary>
       /// 最多运行时间
       /// </summary>
       public static int MaxRuningtime;
 
       /// <summary>
       /// 10个线程同时去爬
       /// </summary>
       public static ClamThread[] ThreadList = new ClamThread[10];
 
       /// <summary>
       /// 第一次爬到的连接，连接池
       /// </summary>
       public static List<Link> LnkPool = new List<Link>();
 
       /// <summary>
       /// 拿到的合法连接
       /// </summary>
       public static List<Link> validLnk = new List<Link>();
 
       /// <summary>
       /// 拿连接的时候  不要拿同样的
       /// </summary>
       public static readonly object syncObj = new object();
    }
}

HTTP request class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Text;
using System.Net;
using System.IO;
using System.Threading;
 
namespace SpiderDemo.SearchUtil
{
   public static class HttpPostUtility
    {
       /// <summary>
       /// 暂时写成同步的吧，等后期再优化
       /// </summary>
       /// <param name="url"></param>
       /// <returns></returns>
       public static Stream SendReq(string url)
       {
           try
           {
                if (string.IsNullOrEmpty(url)){
                    return null;
                }
                // WebProxy wp = newWebProxy("10.0.1.33:8080");
                //wp.Credentials = new System.Net.NetworkCredential("*****","******", "feinno");///之前需要使用代理才能
 
                HttpWebRequest myRequest =(HttpWebRequest)WebRequest.Create(url);
                //myRequest.Proxy = wp;
                HttpWebResponse myResponse =(HttpWebResponse)myRequest.GetResponse();
 
                returnmyResponse.GetResponseStream();
           }
           ////给一些网站发请求权限会受到限制
           catch (Exception ex)
           {
                return null;
           }
       }
    }
}

Parsing web page class, a component is used here, HtmlAgilityPack.dll, very easy to use, download link: http://www.php.cn/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Threading;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using HtmlAgilityPack;
using System.IO;
using SpiderDemo.Entity;
namespace SpiderDemo.SearchUtil
{
    public static class UrlAnalysisProcessor
    {
 
       public static void GetHrefs(Link url, Stream s, List<Link>lnkPool)
       {
           try
           {
                ////没有HTML流，直接返回
                if (s == null)
                {
                    return;
                }
 
                ////解析出连接往缓存里面放，等着前面页面来拿，目前每个线程最多缓存300个，多了就别存了，那边取的太慢了！
                if (lnkPool.Count >=CacheHelper.MaxNum)
                {
                    return;
                }
 
                ////加载HTML，找到了HtmlAgilityPack，试试这个组件怎么样
                HtmlAgilityPack.HtmlDocumentdoc = new HtmlDocument();
 
                ////指定了UTF8编码，理论上不会出现中文乱码了
                doc.Load(s, Encoding.Default);
 
                /////获得所有连接
                IEnumerable<HtmlNode> nodeList=
doc.DocumentNode.SelectNodes("//a[@href]");////抓连接的方法，详细去看stackoverflow里面的：
////http://www.php.cn/
 
                ////移除脚本
                foreach (var script indoc.DocumentNode.Descendants("script").ToArray())
                    script.Remove();
 
                ////移除样式
                foreach (var style indoc.DocumentNode.Descendants("style").ToArray())
                    style.Remove();
 
                string allText =doc.DocumentNode.InnerText;
                int index = 0;
                ////如果包含关键字，为符合条件的连接
                if ((index =allText.IndexOf(CacheHelper.KeyWord)) != -1)
                {
                    ////把包含关键字的上下文取出来，取40个字符吧
                    if (index > 20&& index < allText.Length - 20 - CacheHelper.KeyWord.Length)
                    {
                        string keyText =allText.Substring(index - 20, index) +
                          "<spanstyle=&#39;color:green&#39;>" + allText.Substring(index,CacheHelper.KeyWord.Length) + "</span> " +
                           allText.Substring(index +CacheHelper.KeyWord.Length, 20) + "<br />";////关键字突出显示
 
                        url.Context = keyText;
                    }
 
 
                   CacheHelper.validLnk.Add(url);
                   //RecordUtility.AppendLog(url.LinkName + "<br />");
                    ////爬到了一个符合条件的连接，计数器+1
                    CacheHelper.SpideNum++;
                }
 
                foreach (HtmlNode node innodeList)
                {
                    if(node.Attributes["href"] == null)
                   {
                        continue;
                    }
                    else
                    {
 
                        Link lk = new Link()
                        {
                            Href =node.Attributes["href"].Value,
                            LinkName ="<a href=&#39;" + node.Attributes["href"].Value +
                            "&#39;target=&#39;blank&#39; >" + node.InnerText + "  " +
                           node.Attributes["href"].Value + "</a>" +"<br />"
                        };
                        if(lk.Href.StartsWith("javascript"))
                        {
                            continue;
                        }
                        else if(lk.Href.StartsWith("#"))
                        {
                           continue;
                        }
                        else if(lnkPool.Contains(lk))
                        {
                            continue;
                        }
                        else
                        {
                            ////添加到指定的连接池里面
                            lnkPool.Add(lk);
 
                        }
                    }
                }
 
 
 
           }
 
           catch (Exception ex)
           {
 
           }
       }
    }
}

Search page CODE BEHIND:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using SpiderDemo.SearchUtil;
using System.Threading;
using System.IO;
using SpiderDemo.Entity;
 
namespace SpiderDemo
{
   public partial class SearchPage : System.Web.UI.Page
    {
 
       protected void Page_Load(object sender, EventArgs e)
       {
           if (!IsPostBack)
           {
                InitSetting();
           }
       }
 
       private void InitSetting()
       {
         
       }
 
       private void StartWork()
       {
           CacheHelper.EnableSearch = true;
           CacheHelper.KeyWord = txtKeyword.Text;
 
           ////第一个请求给新浪，获得返回的HTML流
           Stream htmlStream = HttpPostUtility.SendReq(CacheHelper.StartUrl);
 
           Link startLnk = new Link()
           {
                Href = CacheHelper.StartUrl,
                LinkName = "<a href =&#39;" + CacheHelper.StartUrl + "&#39; > 新浪 " +CacheHelper.StartUrl + " </a>"
           };
 
           ////解析出连接
           UrlAnalysisProcessor.GetHrefs(startLnk, htmlStream,CacheHelper.LnkPool);
 
           
           
           for (int i = 0; i < CacheHelper.ThreadList.Length; i++)
           {
                CacheHelper.ThreadList[i] = newClamThread();
               CacheHelper.ThreadList[i].lnkPool = new List<Link>();
           }
 
           ////把连接平分给每个线程
           for (int i = 0; i < CacheHelper.LnkPool.Count; i++)
           {
                int tIndex = i %CacheHelper.ThreadList.Length;
               CacheHelper.ThreadList[tIndex].lnkPool.Add(CacheHelper.LnkPool[i]);
           }
 
           Action<ClamThread> clamIt = new Action<ClamThread>((clt)=>
           {
 
                Stream s =HttpPostUtility.SendReq(clt.lnkPool[0].Href);
                DoIt(clt, s, clt.lnkPool[0]);
           });
 
 
           for (int i = 0; i < CacheHelper.ThreadList.Length; i++)
           {
               CacheHelper.ThreadList[i]._thread = new Thread(new ThreadStart(() =>
                {
                   clamIt(CacheHelper.ThreadList[i]);
                }));
 
                /////每个线程开始工作的时候，休眠100ms
               CacheHelper.ThreadList[i]._thread.Start();
                Thread.Sleep(100);
           }
         
 
       }
 
       private void DoIt(ClamThreadthread, Stream htmlStream, Link url)
       {
 
           if (!CacheHelper.EnableSearch)
           {
                return;
           }
 
           if (CacheHelper.SpideNum > CacheHelper.MaxResult)
           {
               return;
           }
 
           ////解析页面,URL符合条件放入缓存，并把页面的连接抓出来放入缓存
           UrlAnalysisProcessor.GetHrefs(url, htmlStream, thread.lnkPool);
 
           ////如果有连接，拿第一个发请求，没有就结束吧，反正这么耗资源的东西
           if (thread.lnkPool.Count > 0)
           {
                Link firstLnk;
                firstLnk = thread.lnkPool[0];
                ////拿到连接之后就在缓存中移除
               thread.lnkPool.Remove(firstLnk);
 
                firstLnk.TheadId =Thread.CurrentThread.ManagedThreadId;
               Stream content =HttpPostUtility.SendReq(firstLnk.Href);
 
                DoIt(thread, content,firstLnk);
           }
           else
           {
                //没连接了，停止吧,看其他线程的表现
                thread._thread.Abort();
           }
       }
 
       protected void btnSearch_Click(object sender, EventArgs e)
       {
           this.StartWork();
 
       }
 
       protected void btnShow_Click(object sender, EventArgs e)
       {
 
       }
 
       protected void btnStop_Click(object sender, EventArgs e)
       {
           foreach (var t in CacheHelper.ThreadList)
           {
                t._thread.Abort();
               t._thread.DisableComObjectEagerCleanup();
           }
           CacheHelper.EnableSearch =false;
           //CacheHelper.ValidLnk.Clear();
           CacheHelper.LnkPool.Clear();
           CacheHelper.validLnk.Clear();
       }
    }
}

Search page front-end code:

<%@ Page Language="C#"AutoEventWireup="true" CodeBehind="SearchPage.aspx.cs"Inherits="SpiderDemo.SearchPage" %>
 
<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 
<htmlxmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
   <title></title>
</head>
<body>
   <form id="form1" runat="server">
   <p>
    关键字：<asp:TextBoxrunat="server" ID="txtKeyword" ></asp:TextBox>
   <asp:Button runat="server" ID="btnSearch"Text="搜索" onclick="btnSearch_Click"/>
         
   <asp:Button runat="server" ID="btnStop"Text="停止" onclick="btnStop_Click" />
   
   </p>
   <p>
    
  <iframe width="800px" height="700px"src="ShowPage.aspx">
  
  </iframe>
  </p>
 
   </form>
</body>
</html>
 
 
ShowPage.aspx（嵌在SearchPage里面，ajax请求一个handler）：
 
<%@ Page Language="C#"AutoEventWireup="true" CodeBehind="ShowPage.aspx.cs"Inherits="SpiderDemo.ShowPage" %>
 
<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
   <title></title>
   <script src="js/jquery-1.6.js"></script>
</head>
<body>
   <form id="form1" runat="server">
   <p>
       
   </p>
   <p id="pRet">
       
   </p>
   <script type="text/javascript">
 
       $(document).ready(
       function () {
 
           var timer = setInterval(
       function () {
 
           $.ajax({
                type: "POST",
                url:"http://localhost:26820/StateServicePage.ashx",
                data: "op=info",
                success: function (msg) {
               
                   $("#pRet").html(msg);
                }
           });
       }, 2000);
 
 
       });
   </script>
   </form>
</body>
</html>

StateServicePage.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Text;
using SpiderDemo.SearchUtil;
using SpiderDemo.Entity;
 
namespace SpiderDemo
{
   /// <summary>
   /// StateServicePage 的摘要说明
   /// </summary>
   public class StateServicePage : IHttpHandler
    {
 
       public void ProcessRequest(HttpContext context)
       {
           context.Response.ContentType = "text/plain";
 
 
           if (context.Request["op"] != null &&context.Request["op"] == "info")
           {
               context.Response.Write(ShowState());
           }
       }
 
 
       public string ShowState()
       {
           StringBuilder sbRet = new StringBuilder(100);
           string ret = GetValidLnkStr();
 
           int count = 0;
           
                for (int i = 0; i <CacheHelper.ThreadList.Length; i++)
                {
                    if(CacheHelper.ThreadList[i] != null && CacheHelper.ThreadList[i].lnkPool!= null)
                    count += CacheHelper.ThreadList[i].lnkPool.Count;
                }
           
           sbRet.AppendLine("服务是否运行 : " + CacheHelper.EnableSearch + "<br />");
           sbRet.AppendLine("连接池总数: " + count + "<br />");
           sbRet.AppendLine("搜索结果：<br /> " + ret);
 
           return sbRet.ToString();
       }
 
       private string GetValidLnkStr()
       {
           StringBuilder sb = new StringBuilder(120);
           Link[] cloneLnk = new Link[CacheHelper.validLnk.Count];
 
           CacheHelper.validLnk.CopyTo(cloneLnk, 0);
 
           for (int i = 0; i < cloneLnk.Length; i++)
           {
                sb.AppendLine("<br/>" + cloneLnk[i].LinkName + "<br />" +cloneLnk[i].Context);
           }
 
            return sb.ToString();
       }
 
 
       public bool IsReusable
       {
           get
           {
                return false;
           }
       }
    }
}

The above is the code details of C# web crawler and search engine research. For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Developing with C# .NET: A Practical Guide and ExamplesMay 12, 2025 am 12:16 AM

C# and .NET provide powerful features and an efficient development environment. 1) C# is a modern, object-oriented programming language that combines the power of C and the simplicity of Java. 2) The .NET framework is a platform for building and running applications, supporting multiple programming languages. 3) Classes and objects in C# are the core of object-oriented programming. Classes define data and behaviors, and objects are instances of classes. 4) The garbage collection mechanism of .NET automatically manages memory to simplify the work of developers. 5) C# and .NET provide powerful file operation functions, supporting synchronous and asynchronous programming. 6) Common errors can be solved through debugger, logging and exception handling. 7) Performance optimization and best practices include using StringBuild

C# .NET: Understanding the Microsoft .NET FrameworkMay 11, 2025 am 12:17 AM

.NETFramework is a cross-language, cross-platform development platform that provides a consistent programming model and a powerful runtime environment. 1) It consists of CLR and FCL, which manages memory and threads, and FCL provides pre-built functions. 2) Examples of usage include reading files and LINQ queries. 3) Common errors involve unhandled exceptions and memory leaks, and need to be resolved using debugging tools. 4) Performance optimization can be achieved through asynchronous programming and caching, and maintaining code readability and maintainability is the key.

The Longevity of C# .NET: Reasons for its Enduring PopularityMay 10, 2025 am 12:12 AM

Reasons for C#.NET to remain lasting attractive include its excellent performance, rich ecosystem, strong community support and cross-platform development capabilities. 1) Excellent performance and is suitable for enterprise-level application and game development; 2) The .NET framework provides a wide range of class libraries and tools to support a variety of development fields; 3) It has an active developer community and rich learning resources; 4) .NETCore realizes cross-platform development and expands application scenarios.

Mastering C# .NET Design Patterns: From Singleton to Dependency InjectionMay 09, 2025 am 12:15 AM

Design patterns in C#.NET include Singleton patterns and dependency injection. 1.Singleton mode ensures that there is only one instance of the class, which is suitable for scenarios where global access points are required, but attention should be paid to thread safety and abuse issues. 2. Dependency injection improves code flexibility and testability by injecting dependencies. It is often used for constructor injection, but it is necessary to avoid excessive use to increase complexity.

C# .NET in the Modern World: Applications and IndustriesMay 08, 2025 am 12:08 AM

C#.NET is widely used in the modern world in the fields of game development, financial services, the Internet of Things and cloud computing. 1) In game development, use C# to program through the Unity engine. 2) In the field of financial services, C#.NET is used to develop high-performance trading systems and data analysis tools. 3) In terms of IoT and cloud computing, C#.NET provides support through Azure services to develop device control logic and data processing.

C# .NET Framework vs. .NET Core/5/6: What's the Difference?May 07, 2025 am 12:06 AM

.NETFrameworkisWindows-centric,while.NETCore/5/6supportscross-platformdevelopment.1).NETFramework,since2002,isidealforWindowsapplicationsbutlimitedincross-platformcapabilities.2).NETCore,from2016,anditsevolutions(.NET5/6)offerbetterperformance,cross-

The Community of C# .NET Developers: Resources and SupportMay 06, 2025 am 12:11 AM

The C#.NET developer community provides rich resources and support, including: 1. Microsoft's official documents, 2. Community forums such as StackOverflow and Reddit, and 3. Open source projects on GitHub. These resources help developers improve their programming skills from basic learning to advanced applications.

The C# .NET Advantage: Features, Benefits, and Use CasesMay 05, 2025 am 12:01 AM

The advantages of C#.NET include: 1) Language features, such as asynchronous programming simplifies development; 2) Performance and reliability, improving efficiency through JIT compilation and garbage collection mechanisms; 3) Cross-platform support, .NETCore expands application scenarios; 4) A wide range of practical applications, with outstanding performance from the Web to desktop and game development.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Nordhold: Fusion System, Explained

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux latest version

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1

Easy-to-use and free code editor

Hot Topics

1670

1428

1329

1276

1256