首页 >后端开发 >Python教程 >混合相似度算法

混合相似度算法

Linda Hamilton
Linda Hamilton原创
2025-01-21 22:17:09453浏览

HybridSimilarity Algorithm

深入研究混合相似度算法

本文探讨了 HybridSimilarity 算法,这是一种复杂的神经网络,旨在评估文本对之间的相似性。 这种混合模型巧妙地整合了词汇、语音、语义和句法比较,以获得全面的相似度得分。

<code class="language-python">import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sentence_transformers import SentenceTransformer
from Levenshtein import ratio as levenshtein_ratio
from phonetics import metaphone
import torch
import torch.nn as nn

class HybridSimilarity(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.tfidf = TfidfVectorizer()
        self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
        self.fc = nn.Sequential(
            nn.Linear(1152, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def _extract_features(self, text1, text2):
        # Feature Extraction
        features = {}

        # Lexical Analysis
        features['levenshtein'] = levenshtein_ratio(text1, text2)
        features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

        # Phonetic Analysis
        features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

        # Semantic Analysis (BERT)
        emb1 = self.bert.encode(text1, convert_to_tensor=True)
        emb2 = self.bert.encode(text2, convert_to_tensor=True)
        features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

        # Syntactic Analysis (LSA-TFIDF)
        tfidf_matrix = self.tfidf.fit_transform([text1, text2])
        svd = TruncatedSVD(n_components=1)
        lsa = svd.fit_transform(tfidf_matrix)
        features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

        # Attention Mechanism
        att_output, _ = self.attention(
            emb1.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0)
        )
        features['attention_score'] = att_output.mean().item()

        return torch.tensor(list(features.values())).unsqueeze(0)

    def forward(self, text1, text2):
        features = self._extract_features(text1, text2)
        return self.fc(features).item()

def similarity_coefficient(text1, text2):
    model = HybridSimilarity()
    return model(text1, text2)</code>

核心组件

HybridSimilarity 模型依赖于以下关键组件:

  • 句子变压器:利用预先训练的变压器模型进行语义嵌入生成。
  • Levenshtein Distance: 基于字符级编辑计算词汇相似度。
  • 元音位: 确定语音相似性。
  • TF-IDF 和截断 SVD: 应用潜在语义分析 (LSA) 来实现语法相似性。
  • PyTorch:提供了用于构建具有注意力机制和全连接层的自定义神经网络的框架。

详细分解

1.模型设置

HybridSimilarity 类,扩展 nn.Module,初始化:

  • 一个基于BERT的句子嵌入模型 (all-MiniLM-L6-v2)。
  • TF-IDF 矢量化器
  • 一个多头注意力机制
  • 一个完全连接的网络来聚合特征并生成最终的相似度分数。
<code class="language-python">self.bert = SentenceTransformer('all-MiniLM-L6-v2')
self.tfidf = TfidfVectorizer()
self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
self.fc = nn.Sequential(
    nn.Linear(1152, 256),
    nn.ReLU(),
    nn.LayerNorm(256),
    nn.Linear(256, 1),
    nn.Sigmoid()
)</code>
2.特征提取

_extract_features 方法计算几个相似特征:

  • 词汇相似度:
    • 编辑率:量化将一个文本转换为另一个文本的编辑(插入、删除、替换)次数。
    • 杰卡德指数:测量两个文本中唯一单词的重叠。
<code class="language-python">features['levenshtein'] = levenshtein_ratio(text1, text2)
features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))</code>
  • 语音相似度:
    • 元音位编码:比较语音表示。
<code class="language-python">features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0</code>
  • 语义相似度:
    • 生成 BERT 嵌入,并计算余弦相似度。
<code class="language-python">emb1 = self.bert.encode(text1, convert_to_tensor=True)
emb2 = self.bert.encode(text2, convert_to_tensor=True)
features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()</code>
  • 语法相似性:
    • TF-IDF 对文本进行矢量化,并使用 TruncatedSVD 应用 LSA。
<code class="language-python">tfidf_matrix = self.tfidf.fit_transform([text1, text2])
svd = TruncatedSVD(n_components=1)
lsa = svd.fit_transform(tfidf_matrix)
features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]</code>
  • 基于注意力的特征:
    • 多头注意力处理嵌入,并使用平均注意力分数。
<code class="language-python">att_output, _ = self.attention(
    emb1.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0),
    emb2.unsqueeze(0).unsqueeze(0)
)
features['attention_score'] = att_output.mean().item()</code>
3.神经网络融合

提取的特征被组合并输入到完全连接的神经网络中。该网络输出相似度得分 (0-1)。

<code class="language-python">import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sentence_transformers import SentenceTransformer
from Levenshtein import ratio as levenshtein_ratio
from phonetics import metaphone
import torch
import torch.nn as nn

class HybridSimilarity(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = SentenceTransformer('all-MiniLM-L6-v2')
        self.tfidf = TfidfVectorizer()
        self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
        self.fc = nn.Sequential(
            nn.Linear(1152, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def _extract_features(self, text1, text2):
        # Feature Extraction
        features = {}

        # Lexical Analysis
        features['levenshtein'] = levenshtein_ratio(text1, text2)
        features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

        # Phonetic Analysis
        features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

        # Semantic Analysis (BERT)
        emb1 = self.bert.encode(text1, convert_to_tensor=True)
        emb2 = self.bert.encode(text2, convert_to_tensor=True)
        features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

        # Syntactic Analysis (LSA-TFIDF)
        tfidf_matrix = self.tfidf.fit_transform([text1, text2])
        svd = TruncatedSVD(n_components=1)
        lsa = svd.fit_transform(tfidf_matrix)
        features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

        # Attention Mechanism
        att_output, _ = self.attention(
            emb1.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0), 
            emb2.unsqueeze(0).unsqueeze(0)
        )
        features['attention_score'] = att_output.mean().item()

        return torch.tensor(list(features.values())).unsqueeze(0)

    def forward(self, text1, text2):
        features = self._extract_features(text1, text2)
        return self.fc(features).item()

def similarity_coefficient(text1, text2):
    model = HybridSimilarity()
    return model(text1, text2)</code>

实际应用

similarity_coefficient 函数初始化模型并计算两个输入文本之间的相似度。

<code class="language-python">self.bert = SentenceTransformer('all-MiniLM-L6-v2')
self.tfidf = TfidfVectorizer()
self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4)
self.fc = nn.Sequential(
    nn.Linear(1152, 256),
    nn.ReLU(),
    nn.LayerNorm(256),
    nn.Linear(256, 1),
    nn.Sigmoid()
)</code>

这会返回 0 到 1 之间的浮点数,表示相似度。

结论

HybridSimilarity 算法通过集成文本比较的各个方面,提供了一种稳健的文本相似性方法。 它将词汇、语音、语义和句法分析相结合,可以更全面、更细致地理解文本相似性,使其适用于各种应用,包括重复检测、文本聚类和信息检索。

以上是混合相似度算法的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn