首頁 >後端開發 >Python教學 >使用 Regex 和 spaCy 屏蔽提示中的機密數據

使用 Regex 和 spaCy 屏蔽提示中的機密數據

Barbara Streisand
Barbara Streisand原創
2024-12-05 04:07:08538瀏覽

Masking confidential data in prompts using Regex and spaCy

人們對 OpenAI、Gemini、Claude 等流行的法學碩士存在隱私問題。除非它是開源模型,否則我們真的不知道螢幕後面發生了什麼。所以,我們必須要小心。

第一件事是處理我們傳遞給法學碩士的資訊。專家建議避免在提示中包含機密資訊或個人識別碼。聽起來更容易,但隨著法學碩士上下文大小的增加,我們可以將大文本傳遞給模型。因此,它可能會變得嚴格審查並掩蓋所有標識符。 

因此,我嘗試建立 python 腳本來偵測和封鎖識別碼和機密資訊。正規表示式很神奇,可以識別不同的機密資訊並用掩碼替換它。也使用 spacy 庫來檢測常見標識符,例如名稱、地點等,

注意:目前,這適用於印度語境,但仍可偵測到通用識別碼。 

那麼讓我們看看實作(我已經在LLM的幫助下實現了)
如果你想跳過解釋。 

這是程式碼庫的連結:aditykris/prompt-masker-Indian-context
導入必要的模組/函式庫

import re 

from typing import Dict, List, Tuple

import spacy

nlp = spacy.load("en_core_web_sm")

您必須使用以下程式碼段手動安裝「en_core_web_sm」

python -m spacy download en_core_web_sm

設定印度共同機密資訊。

class IndianIdentifier:
    '''Regex for common Indian identifiers'''
    PAN = r'[A-Z]{5}[0-9]{4}[A-Z]{1}'
    AADHAR = r'[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}'
    INDIAN_PASSPORT = r'[A-PR-WYa-pr-wy][1-9]\d\s?\d{4}[1-9]'
    DRIVING_LICENSE = r'(([A-Z]{2}[0-9]{2})( )|([A-Z]{2}-[0-9]{2}))((19|20)[0-9][0-9])[0-9]{7}'
    UPI_ID = r'[\.\-a-z0-9]+@[a-z]+'
    INDIAN_BANK_ACCOUNT = r'\d{9,18}'
    IFSC_CODE = r'[A-Z]{4}0[A-Z0-9]{6}'
    INDIAN_PHONE_NUMBER = r'(\+91|\+91\-|0)?[789]\d{9}'
    EMAIL = r'[\w\.-]+@[\w\.-]+\.\w+'

    @classmethod
    def get_all_patterns(cls) -> Dict[str, str]:
        """Returns all regex patterns defined in the class"""
        return {
            name: pattern 
            for name, pattern in vars(cls).items() 
            if isinstance(pattern, str) and not name.startswith('_')
        }

所以,我正在修改 python 類別和方法,因此在這裡實現它。 
我從 DebugPointer 中找到了這些標識符的正規表示式,非常有幫助。
現在介紹檢測功能。簡單的 re.finditer() 用於循環不同的模式以查找匹配項。匹配項儲存在清單中。

def find_matches(text: str, pattern: str) -> List[Tuple[int, int, str]]:
    """
    Find all matches of a pattern in text and return their positions and matched text
    """
    matches = []
    for match in re.finditer(pattern, text):
        matches.append((match.start(), match.end(), match.group()))
    return matches

使用簡單的字典來儲存替換文字。將其包裝在函數中以返回替換文字。

def get_replacement_text(identifier_type: str) -> str:
    """
    Returns appropriate replacement text based on the type of identifier
    """
    replacements = {
        'PAN': '[PAN_NUMBER]',
        'AADHAR': '[AADHAR_NUMBER]',
        'INDIAN_PASSPORT': '[PASSPORT_NUMBER]',
        'DRIVING_LICENSE': '[DL_NUMBER]',
        'UPI_ID': '[UPI_ID]',
        'INDIAN_BANK_ACCOUNT': '[BANK_ACCOUNT]',
        'IFSC_CODE': '[IFSC_CODE]',
        'INDIAN_PHONE_NUMBER': '[PHONE_NUMBER]',
        'EMAIL': '[EMAIL_ADDRESS]',
        'PERSON': '[PERSON_NAME]',
        'ORG': '[ORGANIZATION]',
        'GPE': '[LOCATION]'
    }
    return replacements.get(identifier_type, '[MASKED]')

啊!主要部分開始。

def analyze_identifiers(text: str) -> Tuple[str, Dict[str, List[str]]]:
    """
    Function to identify and hide sensitive information.
    Returns:
        - masked_text: Text with all sensitive information masked
        - found_identifiers: Dictionary containing all identified sensitive information
    """
    # Initialize variables
    masked_text = text
    found_identifiers = {}
    positions_to_mask = []

    # First, find all regex matches
    for identifier_name, pattern in IndianIdentifier.get_all_patterns().items():
        matches = find_matches(text, pattern)
        if matches:
            found_identifiers[identifier_name] = [match[2] for match in matches]
            positions_to_mask.extend(
                (start, end, identifier_name) for start, end, _ in matches
            )

    # Then, process named entities using spaCy
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            positions_to_mask.append((ent.start_char, ent.end_char, ent.label_))
            if ent.label_ not in found_identifiers:
                found_identifiers[ent.label_] = []
            found_identifiers[ent.label_].append(ent.text)

    # Sort positions by start index in reverse order to handle overlapping matches
    positions_to_mask.sort(key=lambda x: x[0], reverse=True)

    # Apply masking
    for start, end, identifier_type in positions_to_mask:
        replacement = get_replacement_text(identifier_type)
        masked_text = masked_text[:start] + replacement + masked_text[end:]

    return masked_text, found_identifiers

此函數將提示作為輸入,並將屏蔽的提示與識別的元素一起作為字典傳回。

讓我一一解釋一下。

以下循環透過不同標識符的正規表示式來尋找提示中的匹配項。如果找到,那麼它將:
 1. 將辨識的資訊儲存在字典中,以標識符類型作為鍵來追蹤。
 2. 記下位置並將其儲存在positions_to_mask中,以便我們稍後可以應用遮罩。

import re 

from typing import Dict, List, Tuple

import spacy

nlp = spacy.load("en_core_web_sm")

現在是空閒時間。它是一個很棒的自然語言處理 (nlp) 任務庫。我們可以使用 nlp 模組從文字中提取標識符。
目前,我已經習慣了它檢測姓名、組織和位置。
這與上面的循環相同,用於識別和儲存位置。

class IndianIdentifier:
    '''Regex for common Indian identifiers'''
    PAN = r'[A-Z]{5}[0-9]{4}[A-Z]{1}'
    AADHAR = r'[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}'
    INDIAN_PASSPORT = r'[A-PR-WYa-pr-wy][1-9]\d\s?\d{4}[1-9]'
    DRIVING_LICENSE = r'(([A-Z]{2}[0-9]{2})( )|([A-Z]{2}-[0-9]{2}))((19|20)[0-9][0-9])[0-9]{7}'
    UPI_ID = r'[\.\-a-z0-9]+@[a-z]+'
    INDIAN_BANK_ACCOUNT = r'\d{9,18}'
    IFSC_CODE = r'[A-Z]{4}0[A-Z0-9]{6}'
    INDIAN_PHONE_NUMBER = r'(\+91|\+91\-|0)?[789]\d{9}'
    EMAIL = r'[\w\.-]+@[\w\.-]+\.\w+'

    @classmethod
    def get_all_patterns(cls) -> Dict[str, str]:
        """Returns all regex patterns defined in the class"""
        return {
            name: pattern 
            for name, pattern in vars(cls).items() 
            if isinstance(pattern, str) and not name.startswith('_')
        }

在一些測試案例中,我注意到一些掩碼丟失了,這主要是由於標識符重疊造成的。所以,逆序排序有助於解決這個問題。

 

def find_matches(text: str, pattern: str) -> List[Tuple[int, int, str]]:
    """
    Find all matches of a pattern in text and return their positions and matched text
    """
    matches = []
    for match in re.finditer(pattern, text):
        matches.append((match.start(), match.end(), match.group()))
    return matches

最後,我們使用來自found_identifiers和positions_to_mask的資料來屏蔽發生。

def get_replacement_text(identifier_type: str) -> str:
    """
    Returns appropriate replacement text based on the type of identifier
    """
    replacements = {
        'PAN': '[PAN_NUMBER]',
        'AADHAR': '[AADHAR_NUMBER]',
        'INDIAN_PASSPORT': '[PASSPORT_NUMBER]',
        'DRIVING_LICENSE': '[DL_NUMBER]',
        'UPI_ID': '[UPI_ID]',
        'INDIAN_BANK_ACCOUNT': '[BANK_ACCOUNT]',
        'IFSC_CODE': '[IFSC_CODE]',
        'INDIAN_PHONE_NUMBER': '[PHONE_NUMBER]',
        'EMAIL': '[EMAIL_ADDRESS]',
        'PERSON': '[PERSON_NAME]',
        'ORG': '[ORGANIZATION]',
        'GPE': '[LOCATION]'
    }
    return replacements.get(identifier_type, '[MASKED]')

程式的範例輸入為:

輸入:

def analyze_identifiers(text: str) -> Tuple[str, Dict[str, List[str]]]:
    """
    Function to identify and hide sensitive information.
    Returns:
        - masked_text: Text with all sensitive information masked
        - found_identifiers: Dictionary containing all identified sensitive information
    """
    # Initialize variables
    masked_text = text
    found_identifiers = {}
    positions_to_mask = []

    # First, find all regex matches
    for identifier_name, pattern in IndianIdentifier.get_all_patterns().items():
        matches = find_matches(text, pattern)
        if matches:
            found_identifiers[identifier_name] = [match[2] for match in matches]
            positions_to_mask.extend(
                (start, end, identifier_name) for start, end, _ in matches
            )

    # Then, process named entities using spaCy
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            positions_to_mask.append((ent.start_char, ent.end_char, ent.label_))
            if ent.label_ not in found_identifiers:
                found_identifiers[ent.label_] = []
            found_identifiers[ent.label_].append(ent.text)

    # Sort positions by start index in reverse order to handle overlapping matches
    positions_to_mask.sort(key=lambda x: x[0], reverse=True)

    # Apply masking
    for start, end, identifier_type in positions_to_mask:
        replacement = get_replacement_text(identifier_type)
        masked_text = masked_text[:start] + replacement + masked_text[end:]

    return masked_text, found_identifiers

輸出:
蒙版文本:

for identifier_name, pattern in IndianIdentifier.get_all_patterns().items():
        matches = find_matches(text, pattern)
        if matches:
            found_identifiers[identifier_name] = [match[2] for match in matches]
            positions_to_mask.extend(
                (start, end, identifier_name) for start, end, _ in matches
            )

以上是使用 Regex 和 spaCy 屏蔽提示中的機密數據的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn