Home >Backend Development >Golang >How to remove html in golang

How to remove html in golang

PHPz
PHPzOriginal
2023-04-27 09:08:051153browse

Go language practice: How to remove HTML tags?

In web development, we often need to remove HTML tags to obtain plain text content, such as analysis and processing of comments, articles, etc. For this requirement, the Go language provides a variety of methods, and this article will introduce you to several of them.

Method 1: Use string replacement

The Go language provides the strings package to operate strings. We can use the strings.ReplaceAll() method to replace HTML tags with whitespace characters to get plain text content. The specific implementation code is as follows:

package main

import (
    "fmt"
    "strings"
)

func main() {
    html := "<html><head><title>Test Page</title></head><body><p>Hello, Go!</p></body></html>"

    // 使用 strings.ReplaceAll() 将 HTML 标签替换为空白字符
    text := strings.ReplaceAll(html, "<", " <")
    text = strings.ReplaceAll(text, ">", "> ")
    text = strings.TrimSpace(strings.Join(strings.Fields(text), " "))

    fmt.Println(text)
}

In the above code, we first use the strings.ReplaceAll() method to replace all left angle brackets ("<") with space left angle brackets, and replace all right angle brackets (" >") is replaced with a right angle bracket space, that is, a space is added between the label and the text to facilitate subsequent use of the strings.Fields() method to split the string into multiple substrings. Next, we use the strings.Fields() method to split the string into multiple substrings, then use strings.Join() to connect these substrings with whitespace characters, and finally use the strings.TrimSpace() method to remove the strings at both ends. White space characters to get the final plain text content.

Run the above code, the output is as follows:

Test Page Hello, Go!

The above code is simple to implement, but there are several problems:

  1. If the HTML tag contains attributes, such as Google, we need to add blank characters between the left and right angle brackets, otherwise the link text "Google" in the replaced string will be closely together with the left and right angle brackets, making the result difficult to read.
  2. If the HTML tag contains too much content, such as JavaScript, CSS, etc., the replacement speed will be slower.

Considering these issues, we can use the second method.

Method 2: Use the Goquery library

Goquery is an HTML parsing and manipulation library in the Go language, providing a convenient and flexible API. We can use the Goquery library to parse HTML and filter text nodes to obtain plain text content. The specific implementation code is as follows:

package main

import (
    "fmt"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    html := "<html><head><title>Test Page</title></head><body><p>Hello, Go!</p></body></html>"
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))

    // 筛选文本节点
    var text string
    doc.Find(":not(script):not(style)").Each(func(_ int, sel *goquery.Selection) {
        if sel.Children().Length() == 0 {
            text += sel.Text() + " "
        }
    })

    fmt.Println(strings.TrimSpace(text))
}

In the above code, we use the goquery.NewDocumentFromReader() method to convert HTML into a goquery.Document object. Next, we use the doc.Find() method to select all nodes except script and style tags, and use the sel.Children().Length() method to determine whether the current node is a text node. If so, add its content to the text variable. Finally, use the strings.TrimSpace() method to remove the blank characters at both ends of the string to obtain the final plain text content.

Run the above code, the output is as follows:

Test Page Hello, Go!

Using the Goquery library can handle various tag formats, and the code is easier to read and maintain.

This article introduces two methods for removing HTML tags, of which regular expressions are also commonly used. In practical applications, we can choose the most suitable method for specific situations.

The above is the detailed content of How to remove html in golang. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn