随着生成式人工智能的不断发展,矢量数据库在推动生成式人工智能应用方面发挥着至关重要的作用。目前有很多开源的矢量数据库,例如 Chroma、Milvus 以及其他流行的专有矢量数据库,例如 Pinecone、SingleStore。您可以在此网站上阅读不同矢量数据库的详细比较。
但是,您有没有想过这些矢量数据库在幕后是如何工作的?
学习东西的一个好方法是了解事物的底层工作原理。在本文中,我们将使用 Python 从头开始构建一个小型内存向量存储“Pixie”,仅使用 NumPy 作为依赖项。
在深入代码之前,我们先简单讨论一下什么是向量存储。
向量存储是一个旨在高效存储和检索向量嵌入的数据库。这些嵌入是数据的数字表示(通常是文本,但也可以是图像、音频等),可以捕获高维空间中的语义。矢量存储的关键特征是能够执行有效的相似性搜索,根据矢量表示找到最相关的数据点。矢量存储可用于许多任务,例如:
在本文中,我们将创建一个名为“Pixie”的小型内存向量存储。虽然它不会具有生产级系统的所有优化,但它将演示核心概念。 Pixie 将有两个主要功能:
首先,我们将创建一个名为 Pixie 的类:
import numpy as np from sentence_transformers import SentenceTransformer from helpers import cosine_similarity class Pixie: def __init__(self, embedder) -> None: self.store: np.ndarray = None self.embedder: SentenceTransformer = embedder
为了在我们的向量存储中提取文档/数据,我们将实现 from_docs 方法:
def from_docs(self, docs): self.docs = np.array(docs) self.store = self.embedder.encode(self.docs) return f"Ingested {len(docs)} documents"
这个方法做了一些关键的事情:
我们矢量存储的核心是相似性搜索功能:
def similarity_search(self, query, top_k=3): matches = list() q_embedding = self.embedder.encode(query) top_k_indices = cosine_similarity(self.store, q_embedding, top_k) for i in top_k_indices: matches.append(self.docs[i]) return matches
让我们来分解一下:
import numpy as np def cosine_similarity(store_embeddings, query_embedding, top_k): dot_product = np.dot(store_embeddings, query_embedding) magnitude_a = np.linalg.norm(store_embeddings, axis=1) magnitude_b = np.linalg.norm(query_embedding) similarity = dot_product / (magnitude_a * magnitude_b) sim = np.argsort(similarity) top_k_indices = sim[::-1][:top_k] return top_k_indices
这个函数正在做几件重要的事情:
You can read more about cosine similarity here.
Now that we have built all the pieces, let's understand how they work together:
Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.
import os import sys import warnings warnings.filterwarnings("ignore") import ollama import numpy as np from sentence_transformers import SentenceTransformer current_dir = os.path.dirname(os.path.abspath(__file__)) root_dir = os.path.abspath(os.path.join(current_dir, "..")) sys.path.append(root_dir) from pixie import Pixie # creating an instance of a pre-trained embedder model embedder = SentenceTransformer("all-MiniLM-L6-v2") # creating an instance of Pixie vector store pixie = Pixie(embedder) # generate an answer using llama3 and context docs def generate_answer(prompt): response = ollama.chat( model="llama3", options={"temperature": 0.7}, messages=[ { "role": "user", "content": prompt, }, ], ) return response["message"]["content"] with open("example/spacebattle.txt") as f: content = f.read() # ingesting the data into vector store ingested = pixie.from_docs(docs=content.split("\n\n")) print(ingested) # system prompt PROMPT = """ User has asked you following question and you need to answer it based on the below provided context. If you don't find any answer in the given context then just say 'I don't have answer for that'. In the final answer, do not add "according to the context or as per the context". You can be creative while using the context to generate the final answer. DO NOT just share the context as it is. CONTEXT: {0} QUESTION: {1} ANSWER HERE: """ while True: query = input("\nAsk anything: ") if len(query) == 0: print("Ask a question to continue...") quit() if query == "/bye": quit() # search similar matches for query in the embedding store similarities = pixie.similarity_search(query, top_k=5) print(f"query: {query}, top {len(similarities)} matched results:\n") print("-" * 5, "Matched Documents Start", "-" * 5) for match in similarities: print(f"{match}\n") print("-" * 5, "Matched Documents End", "-" * 5) context = ",".join(similarities) answer = generate_answer(prompt=PROMPT.format(context, query)) print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer)) continue
Here is the output:
Ingested 8 documents Ask anything: What was the invasion about? query: What was the invasion about?, top 5 matched results: ----- Matched Documents Start ----- Epilogue: A New Dawn Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished. Chapter 3: The Invasion Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks. Chapter 5: The Final Battle Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path. Chapter 1: The Warning It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline. Chapter 4: Turning the Tide The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back. ----- Matched Documents End ----- Question: What was the invasion about? Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.
We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.
Github repo: Pixie
Happy coding, and may your vectors always point in the right direction!
以上是从头开始构建一个小型矢量存储的详细内容。更多信息请关注PHP中文网其他相关文章!