人們對 OpenAI、Gemini、Claude 等流行的法學碩士存在隱私問題。除非它是開源模型,否則我們真的不知道螢幕後面發生了什麼。所以,我們必須要小心。
第一件事是處理我們傳遞給法學碩士的資訊。專家建議避免在提示中包含機密資訊或個人識別碼。聽起來更容易,但隨著法學碩士上下文大小的增加,我們可以將大文本傳遞給模型。因此,它可能會變得嚴格審查並掩蓋所有標識符。
因此,我嘗試建立 python 腳本來偵測和封鎖識別碼和機密資訊。正規表示式很神奇,可以識別不同的機密資訊並用掩碼替換它。也使用 spacy 庫來檢測常見標識符,例如名稱、地點等,
注意:目前,這適用於印度語境,但仍可偵測到通用識別碼。
那麼讓我們看看實作(我已經在LLM的幫助下實現了)
如果你想跳過解釋。
這是程式碼庫的連結:aditykris/prompt-masker-Indian-context
導入必要的模組/函式庫
import re from typing import Dict, List, Tuple import spacy nlp = spacy.load("en_core_web_sm")
您必須使用以下程式碼段手動安裝「en_core_web_sm」
python -m spacy download en_core_web_sm
設定印度共同機密資訊。
class IndianIdentifier: '''Regex for common Indian identifiers''' PAN = r'[A-Z]{5}[0-9]{4}[A-Z]{1}' AADHAR = r'[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}' INDIAN_PASSPORT = r'[A-PR-WYa-pr-wy][1-9]\d\s?\d{4}[1-9]' DRIVING_LICENSE = r'(([A-Z]{2}[0-9]{2})( )|([A-Z]{2}-[0-9]{2}))((19|20)[0-9][0-9])[0-9]{7}' UPI_ID = r'[\.\-a-z0-9]+@[a-z]+' INDIAN_BANK_ACCOUNT = r'\d{9,18}' IFSC_CODE = r'[A-Z]{4}0[A-Z0-9]{6}' INDIAN_PHONE_NUMBER = r'(\+91|\+91\-|0)?[789]\d{9}' EMAIL = r'[\w\.-]+@[\w\.-]+\.\w+' @classmethod def get_all_patterns(cls) -> Dict[str, str]: """Returns all regex patterns defined in the class""" return { name: pattern for name, pattern in vars(cls).items() if isinstance(pattern, str) and not name.startswith('_') }
所以,我正在修改 python 類別和方法,因此在這裡實現它。
我從 DebugPointer 中找到了這些標識符的正規表示式,非常有幫助。
現在介紹檢測功能。簡單的 re.finditer() 用於循環不同的模式以查找匹配項。匹配項儲存在清單中。
def find_matches(text: str, pattern: str) -> List[Tuple[int, int, str]]: """ Find all matches of a pattern in text and return their positions and matched text """ matches = [] for match in re.finditer(pattern, text): matches.append((match.start(), match.end(), match.group())) return matches
使用簡單的字典來儲存替換文字。將其包裝在函數中以返回替換文字。
def get_replacement_text(identifier_type: str) -> str: """ Returns appropriate replacement text based on the type of identifier """ replacements = { 'PAN': '[PAN_NUMBER]', 'AADHAR': '[AADHAR_NUMBER]', 'INDIAN_PASSPORT': '[PASSPORT_NUMBER]', 'DRIVING_LICENSE': '[DL_NUMBER]', 'UPI_ID': '[UPI_ID]', 'INDIAN_BANK_ACCOUNT': '[BANK_ACCOUNT]', 'IFSC_CODE': '[IFSC_CODE]', 'INDIAN_PHONE_NUMBER': '[PHONE_NUMBER]', 'EMAIL': '[EMAIL_ADDRESS]', 'PERSON': '[PERSON_NAME]', 'ORG': '[ORGANIZATION]', 'GPE': '[LOCATION]' } return replacements.get(identifier_type, '[MASKED]')
啊!主要部分開始。
def analyze_identifiers(text: str) -> Tuple[str, Dict[str, List[str]]]: """ Function to identify and hide sensitive information. Returns: - masked_text: Text with all sensitive information masked - found_identifiers: Dictionary containing all identified sensitive information """ # Initialize variables masked_text = text found_identifiers = {} positions_to_mask = [] # First, find all regex matches for identifier_name, pattern in IndianIdentifier.get_all_patterns().items(): matches = find_matches(text, pattern) if matches: found_identifiers[identifier_name] = [match[2] for match in matches] positions_to_mask.extend( (start, end, identifier_name) for start, end, _ in matches ) # Then, process named entities using spaCy doc = nlp(text) for ent in doc.ents: if ent.label_ in ["PERSON", "ORG", "GPE"]: positions_to_mask.append((ent.start_char, ent.end_char, ent.label_)) if ent.label_ not in found_identifiers: found_identifiers[ent.label_] = [] found_identifiers[ent.label_].append(ent.text) # Sort positions by start index in reverse order to handle overlapping matches positions_to_mask.sort(key=lambda x: x[0], reverse=True) # Apply masking for start, end, identifier_type in positions_to_mask: replacement = get_replacement_text(identifier_type) masked_text = masked_text[:start] + replacement + masked_text[end:] return masked_text, found_identifiers
此函數將提示作為輸入,並將屏蔽的提示與識別的元素一起作為字典傳回。
讓我一一解釋一下。
以下循環透過不同標識符的正規表示式來尋找提示中的匹配項。如果找到,那麼它將:
1. 將辨識的資訊儲存在字典中,以標識符類型作為鍵來追蹤。
2. 記下位置並將其儲存在positions_to_mask中,以便我們稍後可以應用遮罩。
import re from typing import Dict, List, Tuple import spacy nlp = spacy.load("en_core_web_sm")
現在是空閒時間。它是一個很棒的自然語言處理 (nlp) 任務庫。我們可以使用 nlp 模組從文字中提取標識符。
目前,我已經習慣了它檢測姓名、組織和位置。
這與上面的循環相同,用於識別和儲存位置。
class IndianIdentifier: '''Regex for common Indian identifiers''' PAN = r'[A-Z]{5}[0-9]{4}[A-Z]{1}' AADHAR = r'[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}' INDIAN_PASSPORT = r'[A-PR-WYa-pr-wy][1-9]\d\s?\d{4}[1-9]' DRIVING_LICENSE = r'(([A-Z]{2}[0-9]{2})( )|([A-Z]{2}-[0-9]{2}))((19|20)[0-9][0-9])[0-9]{7}' UPI_ID = r'[\.\-a-z0-9]+@[a-z]+' INDIAN_BANK_ACCOUNT = r'\d{9,18}' IFSC_CODE = r'[A-Z]{4}0[A-Z0-9]{6}' INDIAN_PHONE_NUMBER = r'(\+91|\+91\-|0)?[789]\d{9}' EMAIL = r'[\w\.-]+@[\w\.-]+\.\w+' @classmethod def get_all_patterns(cls) -> Dict[str, str]: """Returns all regex patterns defined in the class""" return { name: pattern for name, pattern in vars(cls).items() if isinstance(pattern, str) and not name.startswith('_') }
在一些測試案例中,我注意到一些掩碼丟失了,這主要是由於標識符重疊造成的。所以,逆序排序有助於解決這個問題。
def find_matches(text: str, pattern: str) -> List[Tuple[int, int, str]]: """ Find all matches of a pattern in text and return their positions and matched text """ matches = [] for match in re.finditer(pattern, text): matches.append((match.start(), match.end(), match.group())) return matches
最後,我們使用來自found_identifiers和positions_to_mask的資料來屏蔽發生。
def get_replacement_text(identifier_type: str) -> str: """ Returns appropriate replacement text based on the type of identifier """ replacements = { 'PAN': '[PAN_NUMBER]', 'AADHAR': '[AADHAR_NUMBER]', 'INDIAN_PASSPORT': '[PASSPORT_NUMBER]', 'DRIVING_LICENSE': '[DL_NUMBER]', 'UPI_ID': '[UPI_ID]', 'INDIAN_BANK_ACCOUNT': '[BANK_ACCOUNT]', 'IFSC_CODE': '[IFSC_CODE]', 'INDIAN_PHONE_NUMBER': '[PHONE_NUMBER]', 'EMAIL': '[EMAIL_ADDRESS]', 'PERSON': '[PERSON_NAME]', 'ORG': '[ORGANIZATION]', 'GPE': '[LOCATION]' } return replacements.get(identifier_type, '[MASKED]')
程式的範例輸入為:
輸入:
def analyze_identifiers(text: str) -> Tuple[str, Dict[str, List[str]]]: """ Function to identify and hide sensitive information. Returns: - masked_text: Text with all sensitive information masked - found_identifiers: Dictionary containing all identified sensitive information """ # Initialize variables masked_text = text found_identifiers = {} positions_to_mask = [] # First, find all regex matches for identifier_name, pattern in IndianIdentifier.get_all_patterns().items(): matches = find_matches(text, pattern) if matches: found_identifiers[identifier_name] = [match[2] for match in matches] positions_to_mask.extend( (start, end, identifier_name) for start, end, _ in matches ) # Then, process named entities using spaCy doc = nlp(text) for ent in doc.ents: if ent.label_ in ["PERSON", "ORG", "GPE"]: positions_to_mask.append((ent.start_char, ent.end_char, ent.label_)) if ent.label_ not in found_identifiers: found_identifiers[ent.label_] = [] found_identifiers[ent.label_].append(ent.text) # Sort positions by start index in reverse order to handle overlapping matches positions_to_mask.sort(key=lambda x: x[0], reverse=True) # Apply masking for start, end, identifier_type in positions_to_mask: replacement = get_replacement_text(identifier_type) masked_text = masked_text[:start] + replacement + masked_text[end:] return masked_text, found_identifiers
輸出:
蒙版文本:
for identifier_name, pattern in IndianIdentifier.get_all_patterns().items(): matches = find_matches(text, pattern) if matches: found_identifiers[identifier_name] = [match[2] for match in matches] positions_to_mask.extend( (start, end, identifier_name) for start, end, _ in matches )
以上是使用 Regex 和 spaCy 屏蔽提示中的機密數據的詳細內容。更多資訊請關注PHP中文網其他相關文章!