Home >Backend Development >Golang >Golang with Colly: Use Random Fake User-Agents When Scraping

Golang with Colly: Use Random Fake User-Agents When Scraping

Barbara Streisand
Barbara StreisandOriginal
2025-01-11 07:57:49544browse

Golang with Colly: Use Random Fake User-Agents When Scraping

Website scraping often leads to blocks due to the use of standard or inappropriate user-agents. This article demonstrates a simple method to mitigate this by using randomized fake user-agents within your Go Colly scrapers.

Understanding Fake User-Agents

User-agents are strings identifying the client making a web request. They convey information about the application, operating system (Windows, macOS, Linux), and browser (Chrome, Firefox, Safari). Websites use this information for various purposes, including security and analytics.

A typical user-agent string might look like this (Chrome on Android):

<code>'User-Agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36'</code>

Go Colly's default user-agent:

<code>"User-Agent": "colly - https://www.php.cn/link/953bd83cb0b9c9f9dc4b3ba0bfc1b236",</code>

easily identifies your scraper, increasing the risk of being blocked. Therefore, employing a custom, randomized user-agent is crucial.

Implementing a Fake User-Agent with Go Colly

Modifying request headers to include a custom user-agent is achieved using the OnRequest() callback. This ensures each request uses a different user-agent string.

<code class="language-go">package main

import (
    "bytes"
    "log"
    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(colly.AllowURLRevisit())

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148")
    })

    c.OnResponse(func(r *colly.Response) {
        log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
    })

    for i := 0; i < 5; i++ {
        c.Visit("httpbin.org/headers")
    }
}</code>

This sets a single user-agent for all requests. For more robust scraping, use a randomized approach.

Rotating Through Random User-Agents

The github.com/lib4u/fake-useragent package simplifies random user-agent selection.

<code class="language-go">package main

import (
    "bytes"
    "fmt"
    "log"
    "github.com/gocolly/colly"
    uaFake "github.com/lib4u/fake-useragent"
)

func main() {
    ua, err := uaFake.New()
    if err != nil {
        fmt.Println(err)
    }
    c := colly.NewCollector(colly.AllowURLRevisit())

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", ua.Filter().GetRandom())
    })

    c.OnResponse(func(r *colly.Response) {
        log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
    })

    for i := 0; i < 5; i++ {
        c.Visit("httpbin.org/headers")
    }
}</code>

This code snippet retrieves a random user-agent for each request.

Using Specific Fake User-Agents

github.com/lib4u/fake-useragent provides filtering options. For example, to use a random desktop Chrome user-agent:

<code class="language-go">r.Headers.Set("User-Agent", ua.Filter().Chrome().Platform(uaFake.Desktop).Get())</code>

Remember to always respect a website's robots.txt and terms of service when scraping. Using random user-agents is one technique among many for responsible web scraping; consider using proxies and other header management strategies as well.

References:

The above is the detailed content of Golang with Colly: Use Random Fake User-Agents When Scraping. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn