>  기사  >  백엔드 개발  >  작은 벡터 저장소를 처음부터 구축하기

작은 벡터 저장소를 처음부터 구축하기

王林
王林원래의
2024-08-27 06:34:02639검색

생성 AI 환경이 진화하면서 벡터 데이터베이스는 생성 AI 애플리케이션을 구동하는 데 중요한 역할을 하고 있습니다. 현재 Chroma, Milvus 등의 오픈 소스인 벡터 데이터베이스와 Pinecone, SingleStore 등 기타 널리 사용되는 독점 벡터 데이터베이스를 사용할 수 있습니다. 이 사이트에서 다양한 벡터 데이터베이스에 대한 자세한 비교 내용을 읽을 수 있습니다.

그러나 이러한 벡터 데이터베이스가 뒤에서 어떻게 작동하는지 궁금한 적이 있습니까?

무언가를 배우는 가장 좋은 방법은 내부적으로 작동하는 방식을 이해하는 것입니다. 이 기사에서는 NumPy만 종속성으로 사용하여 Python을 사용하여 작은 인메모리 벡터 저장소 "Pixie"를 처음부터 구축할 것입니다.

Building a tiny vector store from scratch


코드를 살펴보기에 앞서 벡터 스토어가 무엇인지 간략하게 살펴보겠습니다.

벡터 스토어란 무엇인가요?

벡터 스토어는 벡터 임베딩을 효율적으로 저장하고 검색하도록 설계된 데이터베이스입니다. 이러한 임베딩은 고차원 공간에서 의미론적 의미를 포착하는 데이터(종종 텍스트이지만 이미지, 오디오 등일 수 있음)를 숫자로 표현한 것입니다. 벡터 저장소의 주요 기능은 효율적인 유사성 검색을 수행하여 벡터 표현을 기반으로 가장 관련성이 높은 데이터 포인트를 찾는 기능입니다. 벡터 저장소는 다음과 같은 다양한 작업에 사용될 수 있습니다.

  1. 의미 검색
  2. RAG(증강세대 검색)
  3. 추천 시스템

코딩하자

이 글에서는 "Pixie"라는 작은 메모리 내 벡터 저장소를 만들어 보겠습니다. 프로덕션 등급 시스템의 모든 최적화 기능이 포함되어 있지는 않지만 핵심 개념을 보여줍니다. Pixie에는 두 가지 주요 기능이 있습니다:

  1. 문서 삽입 저장
  2. 유사성 검색 수행

벡터 저장소 설정

먼저 Pixie라는 클래스를 만듭니다.

import numpy as np
from sentence_transformers import SentenceTransformer
from helpers import cosine_similarity


class Pixie:
    def __init__(self, embedder) -> None:
        self.store: np.ndarray = None
        self.embedder: SentenceTransformer = embedder
  1. 먼저 효율적인 수치 연산과 다차원 배열 저장을 위해 numpy를 가져옵니다.
  2. 또한 문장_변환기 라이브러리에서 SentenceTransformer를 가져올 것입니다. 임베딩 생성을 위해 SentenceTransformer를 사용하고 있지만 텍스트를 벡터로 변환하는 모든 임베딩 모델을 사용할 수 있습니다. 이 기사에서는 임베딩 생성이 아닌 벡터 저장소 자체에 중점을 둘 것입니다.
  3. 다음으로 임베더를 사용하여 Pixie 클래스를 초기화하겠습니다. 임베더는 기본 벡터 저장소 외부로 이동할 수 있지만 단순화를 위해 벡터 저장소 클래스 내부에서 초기화하겠습니다.
  4. self.store는 문서 임베딩을 NumPy 배열로 보유합니다.
  5. self.embedder는 문서와 쿼리를 벡터로 변환하는 데 사용할 임베딩 모델을 보유합니다.

문서 수집 중

벡터 저장소에서 문서/데이터를 수집하기 위해 from_docs 메소드를 구현합니다.

def from_docs(self, docs):
        self.docs = np.array(docs)
        self.store = self.embedder.encode(self.docs)
        return f"Ingested {len(docs)} documents"

이 방법은 몇 가지 주요 작업을 수행합니다.

  1. 문서 목록을 가져와 self.docs에 NumPy 배열로 저장합니다.
  2. 임베더 모델을 사용하여 각 문서를 벡터 임베딩으로 변환합니다. 이러한 임베딩은 self.store에 저장됩니다.
  3. 수집된 문서 수를 확인하는 메시지를 반환합니다. 여기서는 Embedder의 인코딩 방법이 무거운 작업을 수행하여 각 텍스트 문서를 고차원 벡터 표현으로 변환합니다.

유사성 검색 수행

저희 벡터 스토어의 핵심은 유사성 검색 기능입니다.

def similarity_search(self, query, top_k=3):
        matches = list()
        q_embedding = self.embedder.encode(query)
        top_k_indices = cosine_similarity(self.store, q_embedding, top_k)
        for i in top_k_indices:
            matches.append(self.docs[i])
        return matches

이를 분석해 보겠습니다.

  1. 일치 항목을 저장하기 위해 match라는 빈 목록을 만드는 것부터 시작합니다.
  2. 문서 수집에 사용한 것과 동일한 임베더 모델을 사용하여 사용자 쿼리를 인코딩합니다. 이렇게 하면 쿼리 벡터가 문서 벡터와 동일한 공간에 있게 됩니다.
  3. 가장 유사한 문서를 찾기 위해 cosine_similarity 함수(다음에 정의할 함수)를 호출합니다.
  4. 반환된 색인을 사용하여 self.docs에서 실제 문서를 가져옵니다.
  5. 마지막으로 일치하는 문서 목록을 반환합니다.

코사인 유사성 구현

import numpy as np


def cosine_similarity(store_embeddings, query_embedding, top_k):
    dot_product = np.dot(store_embeddings, query_embedding)
    magnitude_a = np.linalg.norm(store_embeddings, axis=1)
    magnitude_b = np.linalg.norm(query_embedding)

    similarity = dot_product / (magnitude_a * magnitude_b)

    sim = np.argsort(similarity)
    top_k_indices = sim[::-1][:top_k]

    return top_k_indices

이 기능은 몇 가지 중요한 작업을 수행합니다.

  1. It calculates the cosine similarity using the formula: cos(θ) = (A · B) / (||A|| * ||B||)
  2. First, we calculate the dot product between the query embeddings and all document embeddings in the store.
  3. Then, we compute the magnitudes (Euclidean norms) of all vectors.
  4. Lastly, we sort the found similarities and return the indices of the top-k most similar documents. We are using cosine similarity because it measures the angle between vectors, ignoring their magnitudes. This means it can find semantically similar documents regardless of their length.
  5. There are other similarity metrics that you can explore such as:
    1. Euclidean distance
    2. Dot product similarity

You can read more about cosine similarity here.

Piecing everything together

Now that we have built all the pieces, let's understand how they work together:

  1. When we create a Pixie instance, we provide it with an embedding model.
  2. When we ingest documents, we create vector embeddings for each document and store them in self.store.
  3. For a similarity search:
    1. We create an embedding for the query.
    2. We calculate cosine similarity between the query embeddings and all document embeddings.
    3. We return the most similar documents. All the magic happens inside the cosine similarity calculation. By comparing the angle between vectors rather than their magnitude, we can find semantically similar documents even if they use different words or phrasing.

Seeing it in action

Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.

import os
import sys
import warnings

warnings.filterwarnings("ignore")

import ollama
import numpy as np
from sentence_transformers import SentenceTransformer

current_dir = os.path.dirname(os.path.abspath(__file__))
root_dir = os.path.abspath(os.path.join(current_dir, ".."))
sys.path.append(root_dir)

from pixie import Pixie


# creating an instance of a pre-trained embedder model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# creating an instance of Pixie vector store
pixie = Pixie(embedder)


# generate an answer using llama3 and context docs
def generate_answer(prompt):
    response = ollama.chat(
        model="llama3",
        options={"temperature": 0.7},
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return response["message"]["content"]


with open("example/spacebattle.txt") as f:
    content = f.read()
    # ingesting the data into vector store
    ingested = pixie.from_docs(docs=content.split("\n\n"))
    print(ingested)

# system prompt
PROMPT = """
    User has asked you following question and you need to answer it based on the below provided context. 
If you don't find any answer in the given context then just say 'I don't have answer for that'. 
In the final answer, do not add "according to the context or as per the context". 
You can be creative while using the context to generate the final answer. DO NOT just share the context as it is.

    CONTEXT: {0}
    QUESTION: {1}

    ANSWER HERE:
"""

while True:
    query = input("\nAsk anything: ")
    if len(query) == 0:
        print("Ask a question to continue...")
        quit()

    if query == "/bye":
        quit()

    # search similar matches for query in the embedding store
    similarities = pixie.similarity_search(query, top_k=5)
    print(f"query: {query}, top {len(similarities)} matched results:\n")

    print("-" * 5, "Matched Documents Start", "-" * 5)
    for match in similarities:
        print(f"{match}\n")
    print("-" * 5, "Matched Documents End", "-" * 5)

    context = ",".join(similarities)
    answer = generate_answer(prompt=PROMPT.format(context, query))
    print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer))

    continue

Here is the output:

Ingested 8 documents

Ask anything: What was the invasion about?
query: What was the invasion about?, top 5 matched results:

----- Matched Documents Start -----
Epilogue: A New Dawn
Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished.

Chapter 3: The Invasion
Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks.

Chapter 5: The Final Battle
Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path.

Chapter 1: The Warning
It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline.

Chapter 4: Turning the Tide
The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back.
----- Matched Documents End -----


Question: What was the invasion about?
Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.

Building a tiny vector store from scratch

We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.

Github repo: Pixie

Happy coding, and may your vectors always point in the right direction!

위 내용은 작은 벡터 저장소를 처음부터 구축하기의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명:
본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.