Maison  >  Article  >  développement back-end  >  Construire un petit magasin de vecteurs à partir de zéro

Construire un petit magasin de vecteurs à partir de zéro

王林
王林original
2024-08-27 06:34:02550parcourir

Avec l'évolution du paysage de l'IA générative, les bases de données vectorielles jouent un rôle crucial dans le fonctionnement des applications d'IA générative. Il existe actuellement de nombreuses bases de données vectorielles open source telles que Chroma, Milvus, ainsi que d'autres bases de données vectorielles propriétaires populaires telles que Pinecone, SingleStore. Vous pouvez lire la comparaison détaillée des différentes bases de données vectorielles sur ce site.

Mais vous êtes-vous déjà demandé comment ces bases de données vectorielles fonctionnent en coulisses ?

Une excellente façon d’apprendre quelque chose est de comprendre comment les choses fonctionnent sous le capot. Dans cet article, nous allons créer un petit magasin de vecteurs en mémoire "Pixie" à partir de zéro en utilisant Python avec uniquement NumPy comme dépendance.

Building a tiny vector store from scratch


Avant de plonger dans le code, discutons brièvement de ce qu'est un magasin vectoriel.

Qu'est-ce qu'un magasin vectoriel ?

Un magasin de vecteurs est une base de données conçue pour stocker et récupérer efficacement des intégrations de vecteurs. Ces intégrations sont des représentations numériques de données (souvent du texte mais peuvent être des images, du son, etc.) qui capturent une signification sémantique dans un espace de grande dimension. La principale caractéristique d'un magasin vectoriel est sa capacité à effectuer des recherches de similarité efficaces, en trouvant les points de données les plus pertinents en fonction de leurs représentations vectorielles. Les magasins de vecteurs peuvent être utilisés dans de nombreuses tâches telles que :

  1. Recherche sémantique
  2. Génération augmentée de récupération (RAG)
  3. Système de recommandation

Codons

Dans cet article, nous allons créer un petit magasin de vecteurs en mémoire appelé "Pixie". Bien qu'il ne dispose pas de toutes les optimisations d'un système de production, il démontrera les concepts de base. Pixie aura deux fonctionnalités principales :

  1. Stockage des intégrations de documents
  2. Effectuer des recherches de similarité

Mise en place du magasin de vecteurs

Tout d'abord, nous allons créer une classe appelée Pixie :

import numpy as np
from sentence_transformers import SentenceTransformer
from helpers import cosine_similarity


class Pixie:
    def __init__(self, embedder) -> None:
        self.store: np.ndarray = None
        self.embedder: SentenceTransformer = embedder
  1. Tout d'abord, nous importons numpy pour des opérations numériques efficaces et le stockage de tableaux multidimensionnels.
  2. Nous importerons également SentenceTransformer depuis la bibliothèque sentence_transformers. Nous utilisons SentenceTransformer pour la génération d'intégrations, mais vous pouvez utiliser n'importe quel modèle d'intégration qui convertit le texte en vecteurs. Dans cet article, nous nous concentrerons principalement sur le magasin de vecteurs lui-même, et non sur la génération d'intégrations.
  3. Ensuite, nous initialiserons la classe Pixie avec un intégrateur. L'intégrateur peut être déplacé en dehors du magasin de vecteurs principal, mais pour des raisons de simplicité, nous l'initialiserons à l'intérieur de la classe du magasin de vecteurs.
  4. self.store contiendra nos intégrations de documents sous forme de tableau NumPy.
  5. self.embedder contiendra le modèle d'intégration que nous utiliserons pour convertir les documents et les requêtes en vecteurs.

Ingestion de documents

Pour ingérer des documents/données dans notre magasin de vecteurs, nous implémenterons la méthode from_docs :

def from_docs(self, docs):
        self.docs = np.array(docs)
        self.store = self.embedder.encode(self.docs)
        return f"Ingested {len(docs)} documents"

Cette méthode fait quelques choses clés :

  1. Il prend une liste de documents et les stocke sous forme de tableau NumPy dans self.docs.
  2. Il utilise le modèle d'intégration pour convertir chaque document en une intégration vectorielle. Ces intégrations sont stockées dans self.store.
  3. Il renvoie un message confirmant le nombre de documents ingérés. La méthode d'encodage de notre intégrateur fait ici le gros du travail, convertissant chaque document texte en une représentation vectorielle de grande dimension.

Effectuer une recherche de similarité

Le cœur de notre boutique de vecteurs est la fonction de recherche de similarité :

def similarity_search(self, query, top_k=3):
        matches = list()
        q_embedding = self.embedder.encode(query)
        top_k_indices = cosine_similarity(self.store, q_embedding, top_k)
        for i in top_k_indices:
            matches.append(self.docs[i])
        return matches

Décomposons cela :

  1. Nous commençons par créer une liste vide appelée correspondances pour stocker nos correspondances.
  2. Nous encodons la requête de l'utilisateur en utilisant le même modèle d'intégration que celui que nous avons utilisé pour ingérer les documents. Cela garantit que le vecteur de requête se trouve dans le même espace que nos vecteurs de document.
  3. Nous appelons une fonction cosine_similarity (que nous définirons ensuite) pour trouver les documents les plus similaires.
  4. Nous utilisons les index renvoyés pour récupérer les documents réels de self.docs.
  5. Enfin, nous renvoyons la liste des documents correspondants.

Implémentation de la similarité cosinus

import numpy as np


def cosine_similarity(store_embeddings, query_embedding, top_k):
    dot_product = np.dot(store_embeddings, query_embedding)
    magnitude_a = np.linalg.norm(store_embeddings, axis=1)
    magnitude_b = np.linalg.norm(query_embedding)

    similarity = dot_product / (magnitude_a * magnitude_b)

    sim = np.argsort(similarity)
    top_k_indices = sim[::-1][:top_k]

    return top_k_indices

Cette fonction fait plusieurs choses importantes :

  1. It calculates the cosine similarity using the formula: cos(θ) = (A · B) / (||A|| * ||B||)
  2. First, we calculate the dot product between the query embeddings and all document embeddings in the store.
  3. Then, we compute the magnitudes (Euclidean norms) of all vectors.
  4. Lastly, we sort the found similarities and return the indices of the top-k most similar documents. We are using cosine similarity because it measures the angle between vectors, ignoring their magnitudes. This means it can find semantically similar documents regardless of their length.
  5. There are other similarity metrics that you can explore such as:
    1. Euclidean distance
    2. Dot product similarity

You can read more about cosine similarity here.

Piecing everything together

Now that we have built all the pieces, let's understand how they work together:

  1. When we create a Pixie instance, we provide it with an embedding model.
  2. When we ingest documents, we create vector embeddings for each document and store them in self.store.
  3. For a similarity search:
    1. We create an embedding for the query.
    2. We calculate cosine similarity between the query embeddings and all document embeddings.
    3. We return the most similar documents. All the magic happens inside the cosine similarity calculation. By comparing the angle between vectors rather than their magnitude, we can find semantically similar documents even if they use different words or phrasing.

Seeing it in action

Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.

import os
import sys
import warnings

warnings.filterwarnings("ignore")

import ollama
import numpy as np
from sentence_transformers import SentenceTransformer

current_dir = os.path.dirname(os.path.abspath(__file__))
root_dir = os.path.abspath(os.path.join(current_dir, ".."))
sys.path.append(root_dir)

from pixie import Pixie


# creating an instance of a pre-trained embedder model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# creating an instance of Pixie vector store
pixie = Pixie(embedder)


# generate an answer using llama3 and context docs
def generate_answer(prompt):
    response = ollama.chat(
        model="llama3",
        options={"temperature": 0.7},
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return response["message"]["content"]


with open("example/spacebattle.txt") as f:
    content = f.read()
    # ingesting the data into vector store
    ingested = pixie.from_docs(docs=content.split("\n\n"))
    print(ingested)

# system prompt
PROMPT = """
    User has asked you following question and you need to answer it based on the below provided context. 
If you don't find any answer in the given context then just say 'I don't have answer for that'. 
In the final answer, do not add "according to the context or as per the context". 
You can be creative while using the context to generate the final answer. DO NOT just share the context as it is.

    CONTEXT: {0}
    QUESTION: {1}

    ANSWER HERE:
"""

while True:
    query = input("\nAsk anything: ")
    if len(query) == 0:
        print("Ask a question to continue...")
        quit()

    if query == "/bye":
        quit()

    # search similar matches for query in the embedding store
    similarities = pixie.similarity_search(query, top_k=5)
    print(f"query: {query}, top {len(similarities)} matched results:\n")

    print("-" * 5, "Matched Documents Start", "-" * 5)
    for match in similarities:
        print(f"{match}\n")
    print("-" * 5, "Matched Documents End", "-" * 5)

    context = ",".join(similarities)
    answer = generate_answer(prompt=PROMPT.format(context, query))
    print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer))

    continue

Here is the output:

Ingested 8 documents

Ask anything: What was the invasion about?
query: What was the invasion about?, top 5 matched results:

----- Matched Documents Start -----
Epilogue: A New Dawn
Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished.

Chapter 3: The Invasion
Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks.

Chapter 5: The Final Battle
Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path.

Chapter 1: The Warning
It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline.

Chapter 4: Turning the Tide
The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back.
----- Matched Documents End -----


Question: What was the invasion about?
Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.

Building a tiny vector store from scratch

We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.

Github repo: Pixie

Happy coding, and may your vectors always point in the right direction!

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!

Déclaration:
Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn