Home > Article > Backend Development > golang does not direct crawlers
1. Foreword
With the development of the Internet, the application scope of web crawlers is getting wider and wider. In daily life, we can obtain various information through web crawlers, such as news, stocks, weather, movies, music, etc. Especially in the fields of big data analysis and artificial intelligence, web crawlers play an important role. This article mainly explains how to use golang language to write a non-directional (that is, no specific target website) crawler to obtain information on the Internet.
2. Introduction to golang
Golang is a programming language developed by Google. Due to its concurrency, high performance, simplicity and ease of learning, it is increasingly favored by programmers. The golang version used in this article is 1.14.2.
3. Implementation ideas
This crawler is mainly divided into the following steps:
You can pass Obtain the starting URL by manually entering the URL, reading the URL from a file, reading the URL from the database, etc.
Send an http request through Get or Post to obtain the response data.
Use regular expressions or third-party libraries to parse the data according to the format of the response data.
You can store data in files, in databases, or use other storage methods, depending on your needs.
According to the hyperlink and other information in the response data, parse the new URL as the next URL to be crawled.
According to the new URL, send the http request again, parse the response data, store the data, parse the new URL, and repeat until there is no new one Until the URL.
4. Code Implementation
In golang, use the net/http package to send http requests, and use the regexp package or a third-party library to parse the response data. This article uses the goquery library.
First, we need to define an initial function, which is responsible for obtaining the starting URL, setting up the http client and other operations.
func init() { // 获取起始网址 flag.StringVar(&startUrl, "url", "", "请输入起始网址") flag.Parse() // 设置http客户端 client = &http.Client{ Timeout: 30 * time.Second, CheckRedirect: func(req *http.Request, via []*http.Request) error { return http.ErrUseLastResponse }, } }
Define a function responsible for sending http requests and obtaining response data.
func GetHtml(url string) (string, error) { resp, err := client.Get(url) if err != nil { log.Println(err) return "", err } defer resp.Body.Close() body, err := ioutil.ReadAll(resp.Body) if err != nil { log.Println(err) return "", err } return string(body), nil }
Use goquery library to parse response data. The specific implementation is as follows:
func ParseSingleHTML(html string, query string) []string { doc, err := goquery.NewDocumentFromReader(strings.NewReader(html)) if err != nil { log.Println(err) return nil } result := make([]string, 0) doc.Find(query).Each(func(i int, selection *goquery.Selection) { href, ok := selection.Attr("href") if ok { result = append(result, href) } }) return result }
Define a function responsible for storing data into a file.
func SaveData(data []string) error { file, err := os.OpenFile("data.txt", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) if err != nil { log.Println(err) return err } defer file.Close() writer := bufio.NewWriter(file) for _, line := range data { _, err := writer.WriteString(line + " ") if err != nil { log.Println(err) return err } } writer.Flush() return nil }
Use regular expressions to parse new URLs in hyperlinks.
func ParseHref(url, html string) []string { re := regexp.MustCompile(`<a[sS]+?href="(.*?)"[sS]*?>`) matches := re.FindAllStringSubmatch(html, -1) result := make([]string, 0) for _, match := range matches { href := match[1] if strings.HasPrefix(href, "//") { href = "http:" + href } else if strings.HasPrefix(href, "/") { href = strings.TrimSuffix(url, "/") + href } else if strings.HasPrefix(href, "http://") || strings.HasPrefix(href, "https://") { // do nothing } else { href = url + "/" + href } result = append(result, href) } return result }
Finally, we need to define a main function to implement the entire crawler process.
func main() { // 确认起始网址是否为空 if startUrl == "" { fmt.Println("请指定起始网址") return } // 初始化待访问队列 queue := list.New() queue.PushBack(startUrl) // 初始化已访问集合 visited := make(map[string]bool) // 循环爬取 for queue.Len() > 0 { // 从队列中弹出一个网址 elem := queue.Front() queue.Remove(elem) url, ok := elem.Value.(string) if !ok { log.Println("网址格式错误") continue } // 确认该网址是否已经访问过 if visited[url] { continue } visited[url] = true // 发送http请求,获取响应数据 html, err := GetHtml(url) if err != nil { continue } // 解析响应数据,获取新的网址 hrefs := ParseHref(url, html) queue.PushBackList(list.New().Init()) for _, href := range hrefs { if !visited[href] { hrefHtml, err := GetHtml(href) if err != nil { continue } hrefUrls := ParseSingleHTML(hrefHtml, "a") // 将新的网址加入队列 queue.PushBackList(list.New().Init()) for _, hrefUrl := range hrefUrls { queue.PushBack(hrefUrl) } } } // 存储数据到文件 data := ParseSingleHTML(html, "title") err = SaveData(data) if err != nil { continue } } }
5. Summary
The above is the basic process and implementation method of using golang to write undirected crawlers. Of course, this is just a simple example. In actual development, anti-crawler strategies, thread safety and other issues also need to be considered. Hope it can be helpful to readers.
The above is the detailed content of golang does not direct crawlers. For more information, please follow other related articles on the PHP Chinese website!