Home  >  Article  >  Backend Development  >  The basic components and writing methods of golang crawler

The basic components and writing methods of golang crawler

PHPz
PHPzOriginal
2023-04-25 16:19:25611browse

With the popularization of the Internet and the accelerated development of informatization, more and more data are stored on the Internet, so web crawlers have become an indispensable tool for many people. Among them, golang crawler has become the preferred crawler writing language for many programmers due to its simplicity, efficiency and scalability.

This article will introduce the basic components and writing methods of golang crawler.

1. The basic components of golang crawler

  1. URL Manager (UrlManager)

URL Manager is mainly responsible for managing the URL queue that needs to be crawled , as well as related operations such as deduplication. It mainly includes the following functions:

  • Add URL: Add the URL to be crawled to the queue;
  • Get URL: Get the URL to be crawled from the URL queue;
  • Storage URL: Store the crawled URL;
  • Deduplication: Prevent repeated crawling of the same URL.
  1. Webpage Downloader (Downloader)

The webpage downloader is mainly responsible for downloading the webpage corresponding to the URL to the local. It can use different download methods according to different characteristics of the URL, such as HTTP, HTTPS, FTP, etc. In golang, web pages can be downloaded by using third-party libraries such as net/http.

  1. Webpage Parser (Parser)

The webpage parser is mainly responsible for parsing the downloaded webpage, obtaining the required data and saving it. In golang, web pages can be parsed through regular expressions, html5 parser, goquery and other methods.

  1. Storage (Storage)

Storage is mainly responsible for storing the parsed data. There are generally two methods of database storage and local file storage. Third-party libraries such as GORM, orm, etc. can be used in golang for data storage.

2. How to write golang crawler

  1. Create a URL manager

The URL manager is mainly used to manage the resources to be crawled/crawled URL, provides operations such as adding URL, obtaining URL, and determining whether URL exists.

type UrlManager struct {
    Urls map[string]bool
}
// 新建URL管理器
func NewUrlManager() *UrlManager {
    return &UrlManager{Urls: make(map[string]bool)}
}
// 添加URL到管理器队列
func (um *UrlManager) AddUrl(url string) bool {
    if um.Urls[url] {
        // URL已经存在
        return false
    }
    um.Urls[url] = true
    return true
}
// 添加URL列表到管理器队列
func (um *UrlManager) AddUrls(urls []string) bool {
    added := false
    for _, url := range urls {
        if um.AddUrl(url) {
            added = true
        }
    }
    return added
}
// 判断URL是否存在
func (um *UrlManager) HasUrl(url string) bool {
    return um.Urls[url]
}
// 获取待爬取的URL
func (um *UrlManager) GetUrl() string {
    for url := range um.Urls {
        delete(um.Urls, url)
        return url
    }
    return ""
}
// 获取URL数量
func (um *UrlManager) UrlCount() int {
    return len(um.Urls)
}
  1. Create a webpage downloader

The webpage downloader is mainly used to download the webpage content corresponding to the specified URL and return it.

type Downloader struct {
    client *http.Client
}
// 新建网页下载器
func NewDownloader() *Downloader {
    return &Downloader{client: &http.Client{}}
}
// 网页下载
func (d *Downloader) Download(url string) ([]byte, error) {
    req, err := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    resp, err := d.client.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    // 读取响应正文内容
    contents, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }
    return contents, nil
}
  1. Create a web page parser

The web page parser is mainly used to parse the downloaded web page content and extract the required data. The following is an example of a parser using goquery as an example:

type Parser struct{}
// 新建网页解析器
func NewParser() *Parser {
    return &Parser{}
}
// 网页解析
func (parser *Parser) Parse(content []byte) []string {
    doc, err := goquery.NewDocumentFromReader(bytes.NewReader(content))
    if err != nil {
        log.Fatal(err)
    }
    var urls []string
    doc.Find("a").Each(func(i int, s *goquery.Selection) {
        href, exists := s.Attr("href")
        if exists && !strings.HasPrefix(href, "javascript") && len(href) > 1 {
            // 绝对路径和相对路径都考虑
            u, err := url.Parse(href)
            if err != nil {
                return
            }
            if u.IsAbs() {
                urls = append(urls, href)
                return
            }
            // 补全相对路径,例如:./abc --> http://example.com/abc
            base, _ := url.Parse(contentUrl)
            urls = append(urls, base.ResolveReference(u).String())
        }
    })
    return urls
}
  1. Create storage

Storage is mainly used to store parsed data locally or in a database. Here, use MySQL database is an example:

type Storage struct {
    db *gorm.DB
}
//新建数据存储器
func NewStorage() *Storage{
    db, _ := gorm.Open("mysql", "root:password@tcp(localhost:3306)/mydb?charset=utf8&parseTime=True&loc=Local")
    return &Storage{db:db}
}
// 保存数据到数据库
func (storage *Storage) SaveData(data []string) {
    for _, item := range data {
        storage.db.Create(&MyModel{Name: item})
    }
}
  1. Crawler Controller

The crawler controller mainly implements the scheduling and coordination functions of crawlers. The main process is:

  • Initialize the URL manager, web page downloader, web page parser, and storage;
  • Add the URL to be crawled to the URL manager queue;
  • Loop to obtain the URL to be crawled;
  • Determine whether the URL has been crawled, if it has been crawled, skip the URL;
  • Download the corresponding URL Web page;
  • Parse the web page and retrieve the data;
  • Store the data in the database;
  • Add the URL to the list of crawled URLs.
func Run() {
    // 初始化URL管理器、网页下载器、网页解析器、存储器
    urlManager := NewUrlManager()
    downLoader := NewDownloader()
    parser := NewParser()
    storage := NewStorage()
    // 添加待爬取的URL
    urlManager.AddUrl("http://example.com")
    // 爬虫运行
    for urlManager.UrlCount() > 0 {
        // 获取待爬取的URL
        url := urlManager.GetUrl()
        // 判断URL是否已爬取过
        if downLoader.IsCrawled(url) {
            continue
        }
        // 下载网页
        contents, err := downLoader.Download(url)
        if err != nil {
            continue
        }
        // 解析网页
        urls := parser.Parse(contents)
        // 存储数据
        storage.SaveData(urls)
        // 将URL添加到已爬取过的URL列表
        downLoader.AddCrawled(url)
        // 将解析出来的URL添加到URL队列中
        urlManager.AddUrls(urls)
    }
}
  1. Complete code
package main
import (
    "bytes"
    "github.com/PuerkitoBio/goquery"
    "github.com/jinzhu/gorm"
    _ "github.com/jinzhu/gorm/dialects/mysql"
    "io/ioutil"
    "log"
    "net/http"
    "net/url"
    "strings"
)
type UrlManager struct {
    Urls map[string]bool
}
// 新建URL管理器
func NewUrlManager() *UrlManager {
    return &UrlManager{Urls: make(map[string]bool)}
}
// 添加URL到管理器队列
// 添加URL到管理器队列
func (um *UrlManager) AddUrl(url string) bool {
    if um.Urls[url] {
        // URL已经存在
        return false
    }
    um.Urls[url] = true
    return true
}
// 添加URL列表到管理器队列
func (um *UrlManager) AddUrls(urls []string) bool {
    added := false
    for _, url := range urls {
        if um.AddUrl(url) {
            added = true
        }
    }
    return added
}
// 判断URL是否存在
func (um *UrlManager) HasUrl(url string) bool {
    return um.Urls[url]
}
// 获取待爬取的URL
func (um *UrlManager) GetUrl() string {
    for url := range um.Urls {
        delete(um.Urls, url)
        return url
    }
    return ""
}
// 获取URL数量
func (um *UrlManager) UrlCount() int {
    return len(um.Urls)
}
type Downloader struct {
    client *http.Client
    crawledUrls map[string]bool
}
// 新建网页下载器
func NewDownloader() *Downloader {
    return &Downloader{client: &http.Client{}, crawledUrls: make(map[string]bool)}
}
// 网页下载
func (d *Downloader) Download(url string) ([]byte, error) {
    req, err := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    resp, err := d.client.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    // 读取响应正文内容
    contents, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }
    return contents, nil
}
// 判断URL是否已爬取
func (d *Downloader) IsCrawled(url string) bool {
    return d.crawledUrls[url]
}
// 将URL添加到已爬取列表中
func (d *Downloader) AddCrawled(url string) {
    d.crawledUrls[url] = true
}
type Parser struct{}
// 新建网页解析器
func NewParser() *Parser {
    return &Parser{}
}
// 网页解析
func (parser *Parser) Parse(content []byte,contentUrl string) []string {
    doc, err := goquery.NewDocumentFromReader(bytes.NewReader(content))
    if err != nil {
        log.Fatal(err)
    }
    var urls []string
    doc.Find("a").Each(func(i int, s *goquery.Selection) {
        href, exists := s.Attr("href")
        if exists && !strings.HasPrefix(href, "javascript") && len(href) > 1 {
            // 绝对路径和相对路径都考虑
            u, err := url.Parse(href)
            if err != nil {
                return
            }
            if u.IsAbs() {
                urls = append(urls, href)
                return
            }
            // 补全相对路径
            base, _ := url.Parse(contentUrl)
            urls = append(urls, base.ResolveReference(u).String())
        }
    })
    return urls
}

type MyModel struct {
    gorm.Model
    Name string
}
type Storage struct {
    db *gorm.DB
}

//新建数据存储器
func NewStorage() *Storage{
    db, _ := gorm.Open("mysql", "root:password@tcp(localhost:3306)/mydb?charset=utf8&parseTime=True&loc=Local")
    db.AutoMigrate(&MyModel{})
    return &Storage{db:db}
}

// 保存数据到数据库
func (storage *Storage) SaveData(data []string) {
    for _, item := range data {
        storage.db.Create(&MyModel{Name: item})
    }
}
func Run() {
    // 初始化URL管理器、网页下载器、网页解析器、存储器
    urlManager := NewUrlManager()
    downLoader := NewDownloader()
    parser := NewParser()
    storage := NewStorage()
    // 添加待爬取的URL
    urlManager.AddUrl("http://example.com")
    // 爬虫运行
    for urlManager.UrlCount() > 0 {
        // 获取待爬取的URL
        url := urlManager.GetUrl()
        // 判断URL是否已爬取过
        if downLoader.IsCrawled(url) {
            continue
        }
        // 下载网页
        contents, err := downLoader.Download(url)
        if err != nil {
            continue
        }
        // 解析网页
        urls := parser.Parse(contents,url)
        // 存储数据
        storage.SaveData(urls)
        // 将URL添加到已爬取过的URL列表
        downLoader.AddCrawled(url)
        // 将解析出来的URL添加到URL队列中
        urlManager.AddUrls(urls)
    }
}

func main(){
    Run()
}

3. Summary

The golang crawler has the characteristics of simplicity, efficiency and scalability, and due to its The natural concurrency advantage can greatly improve the speed of crawling data. This article introduces the basic composition and writing method of golang crawler, hoping to be helpful to readers. Readers are also welcome to accumulate more experience in practice.

The above is the detailed content of The basic components and writing methods of golang crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn