Home > Article > Backend Development > golang does not direct crawlers

golang does not direct crawlers

PHPzOriginal: 2023-05-10 10:06:06435browse

1. Foreword

With the development of the Internet, the application scope of web crawlers is getting wider and wider. In daily life, we can obtain various information through web crawlers, such as news, stocks, weather, movies, music, etc. Especially in the fields of big data analysis and artificial intelligence, web crawlers play an important role. This article mainly explains how to use golang language to write a non-directional (that is, no specific target website) crawler to obtain information on the Internet.

2. Introduction to golang

Golang is a programming language developed by Google. Due to its concurrency, high performance, simplicity and ease of learning, it is increasingly favored by programmers. The golang version used in this article is 1.14.2.

3. Implementation ideas

This crawler is mainly divided into the following steps:

Get the starting URL

You can pass Obtain the starting URL by manually entering the URL, reading the URL from a file, reading the URL from the database, etc.

Send http request

Send an http request through Get or Post to obtain the response data.

Parse response data

Use regular expressions or third-party libraries to parse the data according to the format of the response data.

Storing data

You can store data in files, in databases, or use other storage methods, depending on your needs.

Parse the new URL

According to the hyperlink and other information in the response data, parse the new URL as the next URL to be crawled.

Repeat the above steps

According to the new URL, send the http request again, parse the response data, store the data, parse the new URL, and repeat until there is no new one Until the URL.

4. Code Implementation

In golang, use the net/http package to send http requests, and use the regexp package or a third-party library to parse the response data. This article uses the goquery library.

Initialization function

First, we need to define an initial function, which is responsible for obtaining the starting URL, setting up the http client and other operations.

func init() {
    // 获取起始网址
    flag.StringVar(&startUrl, "url", "", "请输入起始网址")
    flag.Parse()

    // 设置http客户端
    client = &http.Client{
        Timeout: 30 * time.Second,
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }
}

Send http request function

Define a function responsible for sending http requests and obtaining response data.

func GetHtml(url string) (string, error) {
    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return "", err
    }
    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return "", err
    }

    return string(body), nil
}

Parse response data function

Use goquery library to parse response data. The specific implementation is as follows:

func ParseSingleHTML(html string, query string) []string {
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Println(err)
        return nil
    }

    result := make([]string, 0)
    doc.Find(query).Each(func(i int, selection *goquery.Selection) {
        href, ok := selection.Attr("href")
        if ok {
            result = append(result, href)
        }
    })

    return result
}

Storage data function

Define a function responsible for storing data into a file.

func SaveData(data []string) error {
    file, err := os.OpenFile("data.txt", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        log.Println(err)
        return err
    }
    defer file.Close()

    writer := bufio.NewWriter(file)
    for _, line := range data {
        _, err := writer.WriteString(line + "
")
        if err != nil {
            log.Println(err)
            return err
        }
    }
    writer.Flush()

    return nil
}

Parse new URL function

Use regular expressions to parse new URLs in hyperlinks.

func ParseHref(url, html string) []string {
    re := regexp.MustCompile(`<a[sS]+?href="(.*?)"[sS]*?>`)
    matches := re.FindAllStringSubmatch(html, -1)

    result := make([]string, 0)
    for _, match := range matches {
        href := match[1]
        if strings.HasPrefix(href, "//") {
            href = "http:" + href
        } else if strings.HasPrefix(href, "/") {
            href = strings.TrimSuffix(url, "/") + href
        } else if strings.HasPrefix(href, "http://") || strings.HasPrefix(href, "https://") {
            // do nothing
        } else {
            href = url + "/" + href
        }
        result = append(result, href)
    }

    return result
}

Main function

Finally, we need to define a main function to implement the entire crawler process.

func main() {
    // 确认起始网址是否为空
    if startUrl == "" {
        fmt.Println("请指定起始网址")
        return
    }

    // 初始化待访问队列
    queue := list.New()
    queue.PushBack(startUrl)

    // 初始化已访问集合
    visited := make(map[string]bool)

    // 循环爬取
    for queue.Len() > 0 {
        // 从队列中弹出一个网址
        elem := queue.Front()
        queue.Remove(elem)
        url, ok := elem.Value.(string)
        if !ok {
            log.Println("网址格式错误")
            continue
        }

        // 确认该网址是否已经访问过
        if visited[url] {
            continue
        }
        visited[url] = true

        // 发送http请求，获取响应数据
        html, err := GetHtml(url)
        if err != nil {
            continue
        }

        // 解析响应数据，获取新的网址
        hrefs := ParseHref(url, html)
        queue.PushBackList(list.New().Init())
        for _, href := range hrefs {
            if !visited[href] {
                hrefHtml, err := GetHtml(href)
                if err != nil {
                    continue
                }
                hrefUrls := ParseSingleHTML(hrefHtml, "a")

                // 将新的网址加入队列
                queue.PushBackList(list.New().Init())
                for _, hrefUrl := range hrefUrls {
                    queue.PushBack(hrefUrl)
                }
            }
        }

        // 存储数据到文件
        data := ParseSingleHTML(html, "title")
        err = SaveData(data)
        if err != nil {
            continue
        }
    }
}

5. Summary

The above is the basic process and implementation method of using golang to write undirected crawlers. Of course, this is just a simple example. In actual development, anti-crawler strategies, thread safety and other issues also need to be considered. Hope it can be helpful to readers.

The above is the detailed content of golang does not direct crawlers. For more information, please follow other related articles on the PHP Chinese website!

golang 正则表达式线程并发 regexp 数据库人工智能数据分析 http

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：golang slice to jsonNext article：golang slice to json

See more

golang does not direct crawlers

Related articles