Home  >  Article  >  Backend Development  >  golang query html

golang query html

WBOY
WBOYOriginal
2023-05-19 10:46:07675browse

Preface

The development of programming languages ​​has brought us infinite possibilities. As a modern programming language, Go language has many advantages such as efficiency, simplicity, and cross-platform. It is widely used in server-side programming, cloud computing, containers and other fields. This article will introduce how to use third-party libraries to query HTML documents in Go.

1. Go language and HTML

HTML is a markup language used to build web pages. It can specify the structure and style of elements and be used with other technologies such as CSS and JavaScript to achieve complex interactive effects. Go language is a compiled, statically typed, concurrency-safe programming language known for its efficiency. Although the Go language itself does not directly support HTML parsing, we can accomplish this task by using third-party libraries.

2. HTML parsing in Go language

In Go language, we can use a variety of tools to parse HTML documents, such as golang.org/x/net/html, github.com/PuerkitoBio/goquery, etc. These tools provide a set of methods and structures for parsing, traversing, and modifying HTML documents.

2.1 Use golang.org/x/net/html

golang.org/x/net/html is one provided by Go language A standard library that provides a rich API for parsing HTML documents. Next, we'll demonstrate how to use the library to query node data in an HTML document.

The following is a simple HTML document:

<!DOCTYPE html>
<html>
  <head>
    <title>A Simple HTML Document</title>
  </head>
  <body>
    <h1>This is a heading</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>

We now want to query the text content of all paragraph nodes (e388a4556c0f65e1904146cc1a846bee tags) in this document. First, we need to parse the HTML document into a DOM tree structure, and then query the node data by recursively traversing the DOM tree.

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "strings"
)

var htmlString = `
<!DOCTYPE html>
<html>
  <head>
    <title>A Simple HTML Document</title>
  </head>
  <body>
    <h1>This is a heading</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>
`

func main() {
    reader := strings.NewReader(htmlString)
    doc, err := html.Parse(reader)
    if err != nil {
        fmt.Println("Failed to parse HTML string:", err)
        return
    }
    var find func(*html.Node)
    find = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "p" {
            fmt.Println(n.FirstChild.Data)
        } else {
            for c := n.FirstChild; c != nil; c = c.NextSibling {
                find(c)
            }
        }
    }
    find(doc)
}

In the above code, we use strings.NewReader() to convert the string to the io.Reader interface type and pass it to html.Parse() Function to parse HTML documents. Then, we define a recursive function named find() to traverse the DOM tree and find nodes that meet the conditions. When a paragraph node is encountered, we output the text content of that node. Finally, we call the find() function to query and output the text content of all paragraph nodes.

2.2 Use github.com/PuerkitoBio/goquery

github.com/PuerkitoBio/goquery is a very popular Go language library. It provides a simple and convenient way for HTML parsing and querying. We can use goquery to traverse and query HTML documents without having to delve into the structure of the DOM tree.

The following is a sample HTML document:

<!DOCTYPE html>
<html>
  <head>
    <title>A Simple HTML Document</title>
  </head>
  <body>
    <h1>This is a heading</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>

We now want to query the text content of all paragraph nodes in the document, which can be easily achieved using goquery:

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "strings"
)

var htmlString = `
<!DOCTYPE html>
<html>
  <head>
    <title>A Simple HTML Document</title>
  </head>
  <body>
    <h1>This is a heading</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
  </body>
</html>
`

func main() {
    reader := strings.NewReader(htmlString)
    doc, err := goquery.NewDocumentFromReader(reader)
    if err != nil {
        fmt.Println("Failed to parse HTML string:", err)
        return
    }
    doc.Find("p").Each(func(i int, s *goquery.Selection) {
        fmt.Println(s.Text())
    })
}

In the above code, we use strings.NewReader() to convert the string to the io.Reader interface type and pass it to the goquery.NewDocumentFromReader() function to Parse HTML documents. Then, we use doc.Find("p") to query all paragraph nodes and output their text content through the s.Text() method.

3. Summary

This article introduces how to query the content of HTML documents in Go language. We explored two different approaches, using golang.org/x/net/html and github.com/PuerkitoBio/goquery. These tools are not only able to parse HTML documents, but also provide a rich API for traversing and manipulating the DOM tree. No matter which method you choose, you can easily obtain data from HTML documents, helping you build more elegant and efficient applications.

The above is the detailed content of golang query html. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:golang implements atoiNext article:golang implements atoi