大家好! ?
你知道是什麼讓我徹夜難眠嗎?思考如何讓我們的人工智慧系統更聰明、更有效率。今天,我想談談一些聽起來很基礎但在建立強大的人工智慧應用程式時至關重要的事情:分塊 ✨。
將分塊視為人工智慧將大量資訊分解為可管理的小部分的方式。就像你不會嘗試一下子把整個披薩塞進嘴裡一樣(或者也許你會,這裡沒有判斷力!),你的人工智慧需要將大文本分解成更小的片段才能有效地處理它們。
這對於我們所謂的 RAG(檢索增強生成)模型尤其重要。這些壞孩子不只是編造事實——他們實際上從外部來源獲取真實資訊。很整潔,對吧?
看,如果你正在建立任何處理文字的東西- 無論是客戶支援聊天機器人還是花哨的知識庫搜尋- 正確進行分塊是提供準確答案的人工智慧與僅給出答案的人工智慧之間的區別。 嗯。
塊太大?你的模型沒有抓到重點。
塊太小?它迷失在細節中。
首先,讓我們來看一個使用 LangChain 進行語意分塊的 Python 範例:
from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.document_loaders import TextLoader def semantic_chunk(file_path): # Load the document loader = TextLoader(file_path) document = loader.load() # Create a text splitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, separators=["\n\n", "\n", " ", ""] ) # Split the document into chunks chunks = text_splitter.split_documents(document) return chunks # Example usage chunks = semantic_chunk('knowledge_base.txt') for i, chunk in enumerate(chunks): print(f"Chunk {i}: {chunk.page_content[:50]}...")
現在,讓我們建立一些真實的東西 - 使用 AWS CDK 和 Node.js 的無伺服器知識庫! ?
首先,CDK 基礎設施(這就是神奇發生的地方):
import * as cdk from 'aws-cdk-lib'; import * as s3 from 'aws-cdk-lib/aws-s3'; import * as lambda from 'aws-cdk-lib/aws-lambda'; import * as opensearch from 'aws-cdk-lib/aws-opensearch'; import * as iam from 'aws-cdk-lib/aws-iam'; export class KnowledgeBaseStack extends cdk.Stack { constructor(scope: cdk.App, id: string, props?: cdk.StackProps) { super(scope, id, props); // S3 bucket to store our documents const documentBucket = new s3.Bucket(this, 'DocumentBucket', { removalPolicy: cdk.RemovalPolicy.DESTROY, }); // OpenSearch domain for storing our chunks const openSearchDomain = new opensearch.Domain(this, 'DocumentSearch', { version: opensearch.EngineVersion.OPENSEARCH_2_5, capacity: { dataNodes: 1, dataNodeInstanceType: 't3.small.search', }, ebs: { volumeSize: 10, }, }); // Lambda function for processing documents const processorFunction = new lambda.Function(this, 'ProcessorFunction', { runtime: lambda.Runtime.NODEJS_18_X, handler: 'index.handler', code: lambda.Code.fromAsset('lambda'), environment: { OPENSEARCH_DOMAIN: openSearchDomain.domainEndpoint, }, timeout: cdk.Duration.minutes(5), }); // Grant permissions documentBucket.grantRead(processorFunction); openSearchDomain.grantWrite(processorFunction); } }
現在,執行分塊和索引的 Lambda 函數:
import { S3Event } from 'aws-lambda'; import { S3 } from 'aws-sdk'; import { Client } from '@opensearch-project/opensearch'; import { defaultProvider } from '@aws-sdk/credential-provider-node'; import { AwsSigv4Signer } from '@opensearch-project/opensearch/aws'; const s3 = new S3(); const CHUNK_SIZE = 1000; const CHUNK_OVERLAP = 200; // Create OpenSearch client const client = new Client({ ...AwsSigv4Signer({ region: process.env.AWS_REGION, service: 'es', getCredentials: () => { const credentialsProvider = defaultProvider(); return credentialsProvider(); }, }), node: `https://${process.env.OPENSEARCH_DOMAIN}`, }); export const handler = async (event: S3Event) => { for (const record of event.Records) { const bucket = record.s3.bucket.name; const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' ')); // Get the document from S3 const { Body } = await s3.getObject({ Bucket: bucket, Key: key }).promise(); const text = Body.toString('utf-8'); // Chunk the document const chunks = chunkText(text); // Index chunks in OpenSearch for (const [index, chunk] of chunks.entries()) { await client.index({ index: 'knowledge-base', body: { content: chunk, documentKey: key, chunkIndex: index, timestamp: new Date().toISOString(), }, }); } } }; function chunkText(text: string): string[] { const chunks: string[] = []; let start = 0; while (start < text.length) { const end = Math.min(start + CHUNK_SIZE, text.length); let chunk = text.slice(start, end); // Try to break at a sentence boundary const lastPeriod = chunk.lastIndexOf('.'); if (lastPeriod !== -1 && lastPeriod !== chunk.length - 1) { chunk = chunk.slice(0, lastPeriod + 1); } chunks.push(chunk); start = Math.max(start + chunk.length - CHUNK_OVERLAP, start + 1); } return chunks; }
以下是如何查詢此知識庫的快速範例:
async function queryKnowledgeBase(query: string) { const response = await client.search({ index: 'knowledge-base', body: { query: { multi_match: { query: query, fields: ['content'], }, }, }, }); return response.body.hits.hits.map(hit => ({ content: hit._source.content, documentKey: hit._source.documentKey, score: hit._score, })); }
使用 S3、Lambda 和 OpenSearch 等 AWS 服務可以讓我們:
好了,夥伴們!如何在無伺服器知識庫中實現分塊的真實範例。最好的部分?它會自動縮放並可以處理任何尺寸的文件。
記住,良好分塊的關鍵是:
您在建立知識庫方面有什麼經驗?您嘗試過不同的分塊策略嗎?請在下面的評論中告訴我! ?
以上是人工智慧中的分塊 - 你缺少的秘密武器的詳細內容。更多資訊請關注PHP中文網其他相關文章!