首页 >后端开发 >Python教程 >在发送给法学硕士之前删除 PII 的简单方法

在发送给法学硕士之前删除 PII 的简单方法

Barbara Streisand原创: 2024-11-25 20:20:18384浏览

An easy way to remove PII before sending to LLMs

并非所有场景都需要完美的匿名化。在不太严重的情况下，轻量级匿名化管道就足够了。在这里，我分享了一种基于 Python 的方法，利用 GLiNER、Faker 和 Rapidfuzz，通过用真实的占位符替换敏感实体来匿名化文本。

代码首先使用 GLiNER 识别敏感实体（例如姓名、国家/地区和职业）。然后，它用 Faker 生成的虚假实体替换这些实体。近似字符串匹配 (rapidfuzz) 确保即使文本中的变化也是匿名的。经过LLM处理后，原始实体被恢复。

此方法专为不强制要求完美匿名的非关键用例而设计。例如，在不保存数据的情况下分析评论或回答网站上聊天机器人的查询通常属于不太严重的情况。该代码并不完美，但足以帮助您入门。

from gliner import GLiNER
from faker import Faker
from faker.providers import job
import google.generativeai as genai
import re
import warnings
from rapidfuzz import process, utils
warnings.filterwarnings("ignore")

genai.configure(api_key="key")
model_llm = genai.GenerativeModel("gemini-1.5-flash-002")
fake = Faker()
fake.add_provider(job)
model_gliner = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# let's say we have this prompt along with context that we want to anonymize before sending to LLM
prompt= f"""Given the context, answer the question. \n context: Hi, I am Mayank Laddha.  I lives in India. I love my country. But I would like to go to Singapore once. I am a software developer.\n question: Where does Mayank Laddha want to go?"
"""
# Perform entity prediction
labels = ["Person", "Country", "Profession"]
entities = model_gliner.predict_entities(prompt, labels, threshold=0.4)
print(entities)

# create a replacement dictionary
replacement = {}
for entity in entities: 
    if "Person" in entity["label"] and entity["text"] not in replacement:
        fake_set = {fake.name() for _ in range(3)}
        fake_set.discard(entity["text"])
        new_name = fake_set.pop()
        replacement[entity["text"]] = new_name
    elif "Country" in entity["label"] and entity["text"] not in replacement:
        name_set = {fake.country() for _ in range(10)}
        print(name_set)
        name_set.discard(entity["text"])
        new_name = name_set.pop()
        replacement[entity["text"]] = new_name
    elif "Profession" in entity["label"] and entity["text"] not in replacement:
        name_set = {fake.job() for _ in range(20)}
        name_set = {k for k in name_set if len(k.split())==1}
        print(name_set)
        name_set.discard(entity["text"])
        new_name = name_set.pop()
        replacement[entity["text"]] = new_name

#also create a reverse dictionary
replacement_reversed = {v: k for k, v in replacement.items()}

#perform replacement
for k, v in replacement.items():
    # Split text into a list of words
    words = prompt.split()  
    n = len(k.split()) 
    # so the key appears fully in choices
    choices = [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)] 
    matches = process.extract(k, choices, limit=1, processor=utils.default_process)
    for match in matches:
        if match[1]>80:
            prompt = re.sub(match[0], v, prompt, flags=re.IGNORECASE)

#prompt
response = model_llm.generate_content(prompt)
content = response.text
print("llm response",content)

#perform replacement again
for k, v in replacement_reversed.items():
    words = content.split()  
    n = len(k.split())
    choices = [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]
    matches = process.extract(k, choices, limit=1, processor=utils.default_process)
    for match in matches:
        if match[1]>80:
            content = re.sub(match[0], v, content, flags=re.IGNORECASE)

print("final result", content)

以上是在发送给法学硕士之前删除 PII 的简单方法的详细内容。更多信息请关注PHP中文网其他相关文章！

Python less String for using this

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：How Can I Convert a Naive Datetime Object to a Timezone-Aware Object in Python?下一篇：How Can I Efficiently Find the Differences Between Two Pandas DataFrames?

查看更多