Home  >  Article  >  Backend Development  >  How to write a crawler in golang

How to write a crawler in golang

WBOY
WBOYOriginal
2023-05-10 14:00:08891browse

With the popularity of the Internet, we need to obtain a large amount of information, and a large part of it requires us to crawl it from the website. There are many crawling methods, among which crawlers written in golang can help us obtain this information more efficiently.

Golang is an intuitive, concise and efficient programming language, suitable for high-concurrency, high-performance application scenarios, and crawlers are a high-concurrency, high-performance task, so it is very suitable to use golang to write crawlers of. In this article, we will introduce the basic process, commonly used libraries and core technologies for writing crawlers in Golang to help beginners quickly master the basic methods of Golang crawlers.

1. Basic steps for writing crawlers in golang

Before introducing the basic steps for writing crawlers in golang, we need to understand the basic HTML structure.

  1. HTTP request

In golang’s standard library, related functions for HTTP requests have been provided. We only need to set the URL, request headers, cookies, and request parameters. Once you have the basic information, you can construct the HTTP request you need. The main code is as follows:

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
)

func main() {
    resp, err := http.Get("http://www.baidu.com")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer resp.Body.Close()
    
    body, _ := ioutil.ReadAll(resp.Body)
    fmt.Println(string(body))
}

This code uses the http.Get function to initiate an HTTP request and read the response body from the response. The key point is the defer statement, which will be executed at the end of the function to close the response body and avoid resource leaks.

  1. Parsing the HTML page

The response data obtained by the HTTP request is an HTML document, which we need to parse in order to obtain the required data. In golang, we can use the GoQuery library to parse HTML documents. This library is based on jQuery's syntax and is easy to use.

The main parsing functions provided by GoQuery are: Find, Filter, Each, Attr, etc. The Find function is used to find sub-elements that meet the criteria, and the Filter function is used to filter the elements that meet the criteria. The Each function is used to traverse all elements that meet the conditions, while the Attr function is used to obtain the attributes of the element. Taking the analysis of Baidu homepage as an example, the code is as follows:

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "log"
)

func main() {
    resp, err := http.Get("http://www.baidu.com")
    if err != nil {
        log.Fatal(err)
    }
    body := resp.Body
    defer body.Close()

    doc, err := goquery.NewDocumentFromReader(body)
    if err != nil {
        log.Fatal(err)
    }

    doc.Find("title").Each(func(i int, s *goquery.Selection) {
        fmt.Println(s.Text())
    })
}

In the above code, the goquery.NewDocumentFromReader function is used to construct the document object, and then the title element is found through the Find method, and all qualified elements are traversed through the Each method, and the text.

  1. Storing data

The last step is to save the obtained data. For data storage, we have many ways to choose from, such as databases, files, caches, etc.

For example, we want to save the crawled data into a CSV file. The steps are as follows:

package main

import (
    "encoding/csv"
    "log"
    "os"
)

func main() {
    file, err := os.Create("data.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()
    
    writer.Write([]string{"name", "address", "tel"})
    writer.Write([]string{"John Smith", "123 Main St, Los Angeles, CA 90012", "123-456-7890"})
    writer.Write([]string{"Jane Smith", "456 Oak Ave, San Francisco, CA 94107", "123-456-7891"})
}

The above code uses the os.Create function to create a file named data.csv. Then create a CSV writer through the csv.NewWriter function. Finally, we write the data to be saved into the CSV file through the writer.Write method.

2. Commonly used libraries for writing crawlers in golang

Writing crawlers in golang does not require you to write a lot of underlying code yourself. Common crawler libraries are as follows:

  1. Gocolly

Gocolly is a lightweight crawler framework based on golang, which provides many convenient methods to help crawl data. It can automatically handle issues such as redirection, cookies, proxies, speed limits, etc., allowing us to focus more on defining data extraction rules. The following code demonstrates how to use Gocolly to get Baidu titles:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector()
    
    c.OnHTML("head", func(e *colly.HTMLElement) {
        title := e.ChildText("title")
        fmt.Println(title)
    })
    
    c.Visit("http://www.baidu.com")
}
  1. beautifulsoup4go
##beautifulsoup4go is a golang-based HTML parser, the same as the famous Python library BeautifulSoup4 , can parse different HTML pages from the Internet. The following code demonstrates how to use beautifulsoup4go to get Baidu's title:

package main

import (
    "fmt"
    "github.com/sundy-li/go_commons/crawler"
)

func main() {
    html := crawler.FetchHTML("http://www.baidu.com", "GET", nil, "")

    bs := crawler.NewSoup(html)

    title := bs.Find("title").Text()
    
    fmt.Println(title)
}

    goquery
The goquery library has been introduced before, it is an HTML parser based on CSS selectors , supports chain operations and is a very practical library. The following code demonstrates how to use goquery to obtain Baidu titles:

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "log"
)

func main() {
    resp, err := http.Get("http://www.baidu.com")
    if err != nil {
        log.Fatal(err)
    }
    body := resp.Body
    defer body.Close()

    doc, err := goquery.NewDocumentFromReader(body)
    if err != nil {
        log.Fatal(err)
    }

    title := doc.Find("title").Text()
    
    fmt.Println(title)
}

The above three libraries have their own characteristics. Choosing the library that suits you can complete the crawler more efficiently.

3. Core technology for writing crawlers in golang

    Concurrency
In the process of implementing crawlers, a very important feature is concurrency, that is, simultaneous access Multiple websites or multiple URLs. In golang, we can perform tasks concurrently through coroutines, for example:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    urls := []string{
        "http://www.baidu.com",
        "http://www.sogou.com",
        "http://www.google.com",
    }

    ch := make(chan string, len(urls))

    for _, url := range urls {
        go func(url string) {
            c := colly.NewCollector()

            c.OnHTML("head", func(e *colly.HTMLElement) {
                title := e.ChildText("title")
                ch <- title
            })

            c.Visit(url)
        }(url)
    }

    for range urls {
        title := <-ch
        fmt.Println(title)
    }
}

In the above code, we use coroutines to access multiple URLs concurrently and extract the title from the head tag of each website. information and print.

    Anti-crawler mechanism
As we all know, in order to limit crawler access, many websites will adopt anti-crawler mechanisms, such as limiting request frequency, adding verification codes, and identifying common crawler tools. wait. For these anti-crawler mechanisms, we need to use some technical means to avoid being banned from the website. Here are two technical means:

(1) Access frequency control

In order to avoid being restricted by the website, we can set the access interval, use proxy IP, use distributed methods, etc. Means to avoid being identified by anti-reptile mechanisms.

For example, in the Gocolly framework, we can use methods such as WaitTime, RandomDelay and Limit to set crawling frequency and request limits:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
    "time"
)

func main() {
    c := colly.NewCollector()

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })

    c.OnHTML("head", func(e *colly.HTMLElement) {
        title := e.ChildText("title")
        fmt.Println(title)
    })

    c.Visit("http://www.baidu.com")
}

The above code sets the number of concurrent accesses to 2 and the request interval It is 5 seconds, which can effectively avoid being restricted by the website. Of course, in actual use, we also need to set reasonable access intervals according to different websites.

(2) Distributed crawling

Distributed crawling can effectively avoid being restricted by the website and improve crawling efficiency. The basic idea is to assign different tasks to different nodes or machines, process them independently, and summarize the results together. Distributed crawling requires scheduling, communication and other technologies, which is relatively complex. In actual crawlers, we can use third-party libraries or cloud services to implement distributed crawling.

Conclusion

This article introduces how to use golang to write a crawler, including basic steps, commonly used libraries and core technologies. Golang is a high-performance, concise and clear language that can well meet the needs of crawlers. However, in the practice of crawling, we still need to understand more technologies and constantly learn newer anti-crawling technologies in order to successfully complete the crawling task.

The above is the detailed content of How to write a crawler in golang. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn