Home >Backend Development >Golang >How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?

How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?

Susan Sarandon
Susan SarandonOriginal
2024-10-30 02:24:021009browse

 How to Handle Non-ASCII Characters in Go's Regular Expression Boundaries?

Golang Regular Expression Boundary and Non-ASCII Characters

Go's regular expression boundary (b) is designed to match the boundary between ASCII characters and non-ASCII characters. However, in certain scenarios, it may not behave as expected when Latin characters are involved.

The Problem

In Go, the b boundary only works when it surrounds ASCII characters. For instance, the regex b(vis)b is intended to match the word "vis". However, when the word "vis" contains Latin characters, such as "révisé", b fails to recognize it as a word boundary.

Consider the following Go code:

<code class="go">package main

import (
    "fmt"
    "regexp"
)

func main() {
    r, _ := regexp.Compile(`\b(vis)\b`)
    fmt.Println(r.MatchString("re vis e")) // Expected true
    fmt.Println(r.MatchString("revise"))  // Expected true
    fmt.Println(r.MatchString("révisé")) // Expected false
}</code>

Running this code produces:

true
true
true

Notice that the last line incorrectly matches "révisé".

The Solution

To handle cases with non-ASCII characters, you can define your own custom boundary pattern. One approach is to replace b with the following regex:

(?:\A|\s)(vis)(?:\s|\z)

This pattern means:

  • (?:A|s): Matches the start of the string or a whitespace character.
  • (vis): Captures the word "vis".
  • (?:s|z): Matches a whitespace character or the end of the string.

This custom boundary effectively achieves what b does for ASCII characters, but it also extends to non-ASCII characters like Latin characters.

By incorporating this custom pattern into the regex, you can obtain the desired result:

<code class="go">package main

import (
    "fmt"
    "regexp"
)

func main() {
    r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
    fmt.Println(r.MatchString("vis")) // Added this case
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}</code>

Running this code now gives:

true
true
false
false

As you can see, "révisé" is correctly excluded as a match.

The above is the detailed content of How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn