Home >Backend Development >Golang >Why Does Go\'s Regex \\b Boundary Fail with Non-ASCII Characters?
Golang Regex Boundary Issue with Non-ASCII Characters
In Go, the b boundary option is expected to match at the boundary of ASCII characters, excluding accented characters such as é. This behavior can lead to unexpected results when working with strings containing non-ASCII characters. For instance, consider the following code:
<code class="go">package main import ( "fmt" "regexp" ) func main() { r, _ := regexp.Compile(`\b(vis)\b`) fmt.Println(r.MatchString("re vis e")) // True fmt.Println(r.MatchString("revise")) // False fmt.Println(r.MatchString("révisé")) // True }</code>
In this example, the b(vis)b regex matches the substring "vis" at word boundaries. However, when applied to "révisé", it incorrectly returns True because é is not considered a word character. To address this issue, you can employ an alternative approach:
<code class="go">r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`) fmt.Println(r.MatchString("vis")) // True fmt.Println(r.MatchString("re vis e")) // True fmt.Println(r.MatchString("revise")) // False fmt.Println(r.MatchString("révisé")) // False</code>
This solution utilizes a non-capturing group (?:A|s)(vis)(?:s|z) to match any of the following characters:
This mimics the behavior of b but includes non-ASCII characters as potential word boundaries. By combining these components, it successfully matches "vis" at the beginning or end of a word, regardless of the surrounding characters.
The above is the detailed content of Why Does Go\'s Regex \\b Boundary Fail with Non-ASCII Characters?. For more information, please follow other related articles on the PHP Chinese website!