Home > Article > Backend Development > The basic components and writing methods of golang crawler
With the popularization of the Internet and the accelerated development of informatization, more and more data are stored on the Internet, so web crawlers have become an indispensable tool for many people. Among them, golang crawler has become the preferred crawler writing language for many programmers due to its simplicity, efficiency and scalability.
This article will introduce the basic components and writing methods of golang crawler.
1. The basic components of golang crawler
URL Manager is mainly responsible for managing the URL queue that needs to be crawled , as well as related operations such as deduplication. It mainly includes the following functions:
The webpage downloader is mainly responsible for downloading the webpage corresponding to the URL to the local. It can use different download methods according to different characteristics of the URL, such as HTTP, HTTPS, FTP, etc. In golang, web pages can be downloaded by using third-party libraries such as net/http.
The webpage parser is mainly responsible for parsing the downloaded webpage, obtaining the required data and saving it. In golang, web pages can be parsed through regular expressions, html5 parser, goquery and other methods.
Storage is mainly responsible for storing the parsed data. There are generally two methods of database storage and local file storage. Third-party libraries such as GORM, orm, etc. can be used in golang for data storage.
2. How to write golang crawler
The URL manager is mainly used to manage the resources to be crawled/crawled URL, provides operations such as adding URL, obtaining URL, and determining whether URL exists.
type UrlManager struct { Urls map[string]bool } // 新建URL管理器 func NewUrlManager() *UrlManager { return &UrlManager{Urls: make(map[string]bool)} } // 添加URL到管理器队列 func (um *UrlManager) AddUrl(url string) bool { if um.Urls[url] { // URL已经存在 return false } um.Urls[url] = true return true } // 添加URL列表到管理器队列 func (um *UrlManager) AddUrls(urls []string) bool { added := false for _, url := range urls { if um.AddUrl(url) { added = true } } return added } // 判断URL是否存在 func (um *UrlManager) HasUrl(url string) bool { return um.Urls[url] } // 获取待爬取的URL func (um *UrlManager) GetUrl() string { for url := range um.Urls { delete(um.Urls, url) return url } return "" } // 获取URL数量 func (um *UrlManager) UrlCount() int { return len(um.Urls) }
The webpage downloader is mainly used to download the webpage content corresponding to the specified URL and return it.
type Downloader struct { client *http.Client } // 新建网页下载器 func NewDownloader() *Downloader { return &Downloader{client: &http.Client{}} } // 网页下载 func (d *Downloader) Download(url string) ([]byte, error) { req, err := http.NewRequest("GET", url, nil) req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36") resp, err := d.client.Do(req) if err != nil { return nil, err } defer resp.Body.Close() // 读取响应正文内容 contents, err := ioutil.ReadAll(resp.Body) if err != nil { return nil, err } return contents, nil }
The web page parser is mainly used to parse the downloaded web page content and extract the required data. The following is an example of a parser using goquery as an example:
type Parser struct{} // 新建网页解析器 func NewParser() *Parser { return &Parser{} } // 网页解析 func (parser *Parser) Parse(content []byte) []string { doc, err := goquery.NewDocumentFromReader(bytes.NewReader(content)) if err != nil { log.Fatal(err) } var urls []string doc.Find("a").Each(func(i int, s *goquery.Selection) { href, exists := s.Attr("href") if exists && !strings.HasPrefix(href, "javascript") && len(href) > 1 { // 绝对路径和相对路径都考虑 u, err := url.Parse(href) if err != nil { return } if u.IsAbs() { urls = append(urls, href) return } // 补全相对路径,例如:./abc --> http://example.com/abc base, _ := url.Parse(contentUrl) urls = append(urls, base.ResolveReference(u).String()) } }) return urls }
Storage is mainly used to store parsed data locally or in a database. Here, use MySQL database is an example:
type Storage struct { db *gorm.DB } //新建数据存储器 func NewStorage() *Storage{ db, _ := gorm.Open("mysql", "root:password@tcp(localhost:3306)/mydb?charset=utf8&parseTime=True&loc=Local") return &Storage{db:db} } // 保存数据到数据库 func (storage *Storage) SaveData(data []string) { for _, item := range data { storage.db.Create(&MyModel{Name: item}) } }
The crawler controller mainly implements the scheduling and coordination functions of crawlers. The main process is:
func Run() { // 初始化URL管理器、网页下载器、网页解析器、存储器 urlManager := NewUrlManager() downLoader := NewDownloader() parser := NewParser() storage := NewStorage() // 添加待爬取的URL urlManager.AddUrl("http://example.com") // 爬虫运行 for urlManager.UrlCount() > 0 { // 获取待爬取的URL url := urlManager.GetUrl() // 判断URL是否已爬取过 if downLoader.IsCrawled(url) { continue } // 下载网页 contents, err := downLoader.Download(url) if err != nil { continue } // 解析网页 urls := parser.Parse(contents) // 存储数据 storage.SaveData(urls) // 将URL添加到已爬取过的URL列表 downLoader.AddCrawled(url) // 将解析出来的URL添加到URL队列中 urlManager.AddUrls(urls) } }
package main import ( "bytes" "github.com/PuerkitoBio/goquery" "github.com/jinzhu/gorm" _ "github.com/jinzhu/gorm/dialects/mysql" "io/ioutil" "log" "net/http" "net/url" "strings" ) type UrlManager struct { Urls map[string]bool } // 新建URL管理器 func NewUrlManager() *UrlManager { return &UrlManager{Urls: make(map[string]bool)} } // 添加URL到管理器队列 // 添加URL到管理器队列 func (um *UrlManager) AddUrl(url string) bool { if um.Urls[url] { // URL已经存在 return false } um.Urls[url] = true return true } // 添加URL列表到管理器队列 func (um *UrlManager) AddUrls(urls []string) bool { added := false for _, url := range urls { if um.AddUrl(url) { added = true } } return added } // 判断URL是否存在 func (um *UrlManager) HasUrl(url string) bool { return um.Urls[url] } // 获取待爬取的URL func (um *UrlManager) GetUrl() string { for url := range um.Urls { delete(um.Urls, url) return url } return "" } // 获取URL数量 func (um *UrlManager) UrlCount() int { return len(um.Urls) } type Downloader struct { client *http.Client crawledUrls map[string]bool } // 新建网页下载器 func NewDownloader() *Downloader { return &Downloader{client: &http.Client{}, crawledUrls: make(map[string]bool)} } // 网页下载 func (d *Downloader) Download(url string) ([]byte, error) { req, err := http.NewRequest("GET", url, nil) req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36") resp, err := d.client.Do(req) if err != nil { return nil, err } defer resp.Body.Close() // 读取响应正文内容 contents, err := ioutil.ReadAll(resp.Body) if err != nil { return nil, err } return contents, nil } // 判断URL是否已爬取 func (d *Downloader) IsCrawled(url string) bool { return d.crawledUrls[url] } // 将URL添加到已爬取列表中 func (d *Downloader) AddCrawled(url string) { d.crawledUrls[url] = true } type Parser struct{} // 新建网页解析器 func NewParser() *Parser { return &Parser{} } // 网页解析 func (parser *Parser) Parse(content []byte,contentUrl string) []string { doc, err := goquery.NewDocumentFromReader(bytes.NewReader(content)) if err != nil { log.Fatal(err) } var urls []string doc.Find("a").Each(func(i int, s *goquery.Selection) { href, exists := s.Attr("href") if exists && !strings.HasPrefix(href, "javascript") && len(href) > 1 { // 绝对路径和相对路径都考虑 u, err := url.Parse(href) if err != nil { return } if u.IsAbs() { urls = append(urls, href) return } // 补全相对路径 base, _ := url.Parse(contentUrl) urls = append(urls, base.ResolveReference(u).String()) } }) return urls } type MyModel struct { gorm.Model Name string } type Storage struct { db *gorm.DB } //新建数据存储器 func NewStorage() *Storage{ db, _ := gorm.Open("mysql", "root:password@tcp(localhost:3306)/mydb?charset=utf8&parseTime=True&loc=Local") db.AutoMigrate(&MyModel{}) return &Storage{db:db} } // 保存数据到数据库 func (storage *Storage) SaveData(data []string) { for _, item := range data { storage.db.Create(&MyModel{Name: item}) } } func Run() { // 初始化URL管理器、网页下载器、网页解析器、存储器 urlManager := NewUrlManager() downLoader := NewDownloader() parser := NewParser() storage := NewStorage() // 添加待爬取的URL urlManager.AddUrl("http://example.com") // 爬虫运行 for urlManager.UrlCount() > 0 { // 获取待爬取的URL url := urlManager.GetUrl() // 判断URL是否已爬取过 if downLoader.IsCrawled(url) { continue } // 下载网页 contents, err := downLoader.Download(url) if err != nil { continue } // 解析网页 urls := parser.Parse(contents,url) // 存储数据 storage.SaveData(urls) // 将URL添加到已爬取过的URL列表 downLoader.AddCrawled(url) // 将解析出来的URL添加到URL队列中 urlManager.AddUrls(urls) } } func main(){ Run() }
3. Summary
The golang crawler has the characteristics of simplicity, efficiency and scalability, and due to its The natural concurrency advantage can greatly improve the speed of crawling data. This article introduces the basic composition and writing method of golang crawler, hoping to be helpful to readers. Readers are also welcome to accumulate more experience in practice.
The above is the detailed content of The basic components and writing methods of golang crawler. For more information, please follow other related articles on the PHP Chinese website!