首页 >Java >使用正则表达式查找具有相似性的文本

使用正则表达式查找具有相似性的文本

王林转载: 2024-02-14 19:03:08869浏览

php小编柚子正则表达式是一种强大的文本匹配工具，能够帮助我们快速查找具有相似性的文本。无论是在字符串处理、数据提取还是验证输入等方面，正则表达式都发挥着重要作用。它的灵活性和高效性使得我们能够更加方便地处理复杂的文本操作，大大提高了开发效率。无论是初学者还是有经验的开发者，掌握正则表达式都是一项必备技能，让我们一起来探索它的魅力吧！

问题内容

我识别了不同 pdf 文档中的文本列表。现在我需要使用正则表达式从每个文本中提取一些值。我的一些模式是这样的：

some text[ -]?(.+)[ ,-]+some other text

但问题是，识别后有些字母可能会出错（"0" 代替 "o"、"i" 代替 "l" 等）。这就是为什么我的模式与它不匹配。

我想使用类似 jaro-winkler 或 levenshtein 相似性的正则表达式，这样我就可以从 s0me 文本 my_value、一些其他文本 等文本中提取 my_value。

我知道这看起来棒极了。但也许这个问题有解决方案。

顺便说一句，我正在使用 java，但可以接受其他语言的解决方案

解决方法

如果在python中使用regex模块，则可以使用模糊匹配。以下正则表达式允许每个短语最多出现 2 个错误。您可以使用更复杂的错误测试（用于插入、替换和删除），有关详细信息，请参阅链接文档。

import regex

txt = 's0me text my_value, some otner text'
pattern = regex.compile(r'(?:some text){e<=2}[ -]?(.+?)[ ,-]+(?:some other text){e<=2}')

m = pattern.search(txt)
if m is not none:
    print(m.group(1))

输出：

my_value

package main

import (
    "fmt"
    "regexp"
    "strings"

    "github.com/agnivade/levenshtein"
)

func findClosestMatch(text string, candidates []string, threshold int) (string, bool) {
    for _, candidate := range candidates {
        if levenshtein.ComputeDistance(strings.ToLower(text), strings.ToLower(candidate)) <= threshold {
            return candidate, true
        }
    }
    return "", false
}

func findMatches(text string, threshold int) []string {
    // Broad regex to capture potential matches
    re := regexp.MustCompile(`(?i)(some\s*\w*\s*text\s*)([^,]+)`)
    potentialMatches := re.FindAllStringSubmatch(text, -1)

    var validMatches []string
    expectedPattern := "some text" // The pattern we expect to find

    for _, match := range potentialMatches {
        // Check if the first part of the match is close to our expected pattern
        closestMatch, isClose := findClosestMatch(match[1], []string{expectedPattern}, threshold)
        if isClose {
            // If the first part is close to 'some text', add the second part to valid matches
            validMatches = append(validMatches, strings.TrimSpace(closestMatch))
        }
    }

    return validMatches
}

func main() {
    text := "This is a sample text with s0me text MY_VALUE, some otner text."
    threshold := 10 

    matches := findMatches(text, threshold)
    fmt.Println("Matches found:", matches)
}

正则表达式模式 (?i)(somes*w*s*texts*)([^,]+) 用于捕获类似于“some text”的短语，后跟逗号之前的任何字符

以上是使用正则表达式查找具有相似性的文本的详细内容。更多信息请关注PHP中文网其他相关文章！

Python Java php 正则表达式字符串 Regex

声明：

本文转载于：stackoverflow.com。如有侵权，请联系admin@php.cn删除

上一篇：速率限制 Gatling-grpc 用户出站请求/秒下一篇：Hibernate 6 - IdentifierGenerator - 委托给 NULL ID 的默认生成器

查看更多