首頁 >科技週邊 >人工智慧 >選擇最適合資料的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

選擇最適合資料的嵌入模型:OpenAI 和開源多語言嵌入的對比測試

2024-02-26 18:10:151076瀏覽

OpenAI recently announced the launch of their latest generation embedding model embedding v3, which they claim is the most performant embedding model with higher multi-language performance. This batch of models is divided into two types: the smaller text-embeddings-3-small and the more powerful and larger text-embeddings-3-large.

选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试

Little information is disclosed about how these models are designed and trained, and the models are only accessible through a paid API. So there have been many open source embedding models. But how do these open source models compare with the OpenAI closed source model?

This article will empirically compare the performance of these new models with open source models. We plan to build a data retrieval workflow where the key task is to find the most relevant documents from the corpus based on the user's query.

Our corpus is the European Artificial Intelligence Act, which is currently in the validation phase. This corpus is the world’s first legal framework involving artificial intelligence and is unique in that it is available in 24 languages. This allows us to compare the accuracy of data retrieval in different language backgrounds, providing important support for the cross-cultural application of artificial intelligence.

选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试

We plan to create a custom synthetic question/answer dataset using a multilingual text corpus and use this dataset to compare OpenAI with the state-of-the-art The accuracy of open source embedding models. We will share the full code as our approach can be easily adapted to other data corpora.

Generate a custom Q/A data set

First, we can start by creating a custom question and answer (Q/A) data set, The advantage of doing this is to ensure that the data set will not become a bias factor in model training, avoiding situations that may occur in benchmark references such as MTEB. Furthermore, by generating custom datasets, we can tailor the evaluation process to a specific data corpus, which can be important for scenarios like Retrieval Augmentation Applications (RAG).

We will follow the simple process suggested in the Llama Index documentation. First, the corpus is divided into chunks. Next, for each block, a large language model (LLM) is used to generate a series of synthetic questions to ensure that the answer is in the corresponding block.

选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试

Implementing this strategy using an LLM data frame like Llama Index is very simple, as shown in the code below.

from llama_index.readers.web import SimpleWebPageReader from llama_index.core.node_parser import SentenceSplitter  language = "EN" url_doc = "https://eur-lex.europa.eu/legal-content/"+language+"/TXT/HTML/?uri=CELEX:52021PC0206"  documents = SimpleWebPageReader(html_to_text=True).load_data([url_doc])  parser = SentenceSplitter(chunk_size=1000) nodes = parser.get_nodes_from_documents(documents, show_progress=True)

The corpus is the English version of the EU Artificial Intelligence Act, obtained directly from the web using this official URL. This article uses the draft version from April 2021, as the final version is not yet available in all European languages. So the version we chose can replace language in the URL with any of the other 23 official EU languages, retrieving text in different languages ​​(BG for Bulgarian, ES for Spanish, CS for Czech, etc. ).

选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试

Use a SentenceSplitter object to split the document into chunks of every 1000 tokens. For English, this generates about 100 chunks. Each block is then provided as context to the following prompt (the default prompt suggested in the Llama Index library):

prompts={} prompts["EN"] = """\ Context information is below.  --------------------- {context_str} ---------------------  Given the context information and not prior knowledge, generate only questions based on the below query.  You are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided." """

This prompt can generate questions about the documentation block , the number of questions to generate for each data chunk is passed as parameter "num_questions_per_chunk", which we set to 2. Questions can then be generated by calling generate_qa_embedding_pairs in the Llama Index library:

from llama_index.llms import OpenAI from llama_index.legacy.finetuning import generate_qa_embedding_pairs  qa_dataset = generate_qa_embedding_pairs(llm=OpenAI(model="gpt-3.5-turbo-0125",additional_kwargs={'seed':42}),nodes=nodes,qa_generate_prompt_tmpl = prompts[language],num_questions_per_chunk=2 )

我们依靠OpenAI的GPT-3.5-turbo-0125来完成这项任务,结果对象' qa_dataset '包含问题和答案(块)对。作为生成问题的示例,以下是前两个问题的结果(其中“答案”是文本的第一部分):

  1. What are the main objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) according to the explanatory memorandum?
  2. How does the proposal for a Regulation on artificial intelligence aim to address the risks associated with the use of AI while promoting the uptake of AI in the European Union, as outlined in the context information?


评估函数也是遵循Llama Index文档:首先所有答案(文档块)的嵌入都存储在VectorStoreIndex中,以便有效检索。然后评估函数循环遍历所有查询,检索前k个最相似的文档,并根据MRR (Mean Reciprocal Rank)评估检索的准确性,代码如下:

def evaluate(dataset, embed_model, insert_batch_size=1000, top_k=5):# Get corpus, queries, and relevant documents from the qa_dataset objectcorpus = dataset.corpusqueries = dataset.queriesrelevant_docs = dataset.relevant_docs # Create TextNode objects for each document in the corpus and create a VectorStoreIndex to efficiently store and retrieve embeddingsnodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]index = VectorStoreIndex(nodes, embed_model=embed_model, insert_batch_size=insert_batch_size)retriever = index.as_retriever(similarity_top_k=top_k) # Prepare to collect evaluation resultseval_results = [] # Iterate over each query in the dataset to evaluate retrieval performancefor query_id, query in tqdm(queries.items()):# Retrieve the top_k most similar documents for the current query and extract the IDs of the retrieved documentsretrieved_nodes = retriever.retrieve(query)retrieved_ids = [node.node.node_id for node in retrieved_nodes] # Check if the expected document was among the retrieved documentsexpected_id = relevant_docs[query_id][0]is_hit = expected_id in retrieved_ids # assume 1 relevant doc per query # Calculate the Mean Reciprocal Rank (MRR) and append to resultsif is_hit:rank = retrieved_ids.index(expected_id) + 1mrr = 1 / rankelse:mrr = 0eval_results.append(mrr) # Return the average MRR across all queries as the final evaluation metricreturn np.average(eval_results)

嵌入模型通过' embed_model '参数传递给评估函数,对于OpenAI模型,该参数是一个用模型名称和模型维度初始化的OpenAIEmbedding对象。

from llama_index.embeddings.openai import OpenAIEmbedding  embed_model = OpenAIEmbedding(model=model_spec['model_name'],dimensinotallow=model_spec['dimensions'])







embeddings_model_spec = { }  embeddings_model_spec['OAI-Large-256']={'model_name':'text-embedding-3-large','dimensions':256} embeddings_model_spec['OAI-Large-3072']={'model_name':'text-embedding-3-large','dimensions':3072} embeddings_model_spec['OAI-Small']={'model_name':'text-embedding-3-small','dimensions':1536} embeddings_model_spec['OAI-ada-002']={'model_name':'text-embedding-ada-002','dimensions':None}  results = []  languages = ["EN", "FR", "CS", "HU"]  # Loop through all languages for language in languages: # Load datasetfile_name=language+"_dataset.json"qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name) # Loop through all modelsfor model_name, model_spec in embeddings_model_spec.items(): # Get modelembed_model = OpenAIEmbedding(model=model_spec['model_name'],dimensinotallow=model_spec['dimensions']) # Assess embedding score (in terms of MRR)score = evaluate(qa_dataset, embed_model) results.append([language, model_name, score])  df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR"])


选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试


选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试


围绕嵌入的开源研究也是非常活跃的,Hugging Face 的 MTEB leaderboard会经常发布最新的嵌入模型。


选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试




nomic-embed-text-v1 (Nomic- embed):该模型由Nomic设计,其性能优于OpenAI Ada-002和text-embedding-3-small,而且大小仅为0.55GB。该模型是第一个完全可复制和可审计的(开放数据和开源训练代码)的模型。


embeddings_model_spec = { }  embeddings_model_spec['E5-mistral-7b']={'model_name':'intfloat/e5-mistral-7b-instruct','max_length':32768, 'pooling_type':'last_token', 'normalize': True, 'batch_size':1, 'kwargs': {'load_in_4bit':True, 'bnb_4bit_compute_dtype':torch.float16}} embeddings_model_spec['ML-E5-large']={'model_name':'intfloat/multilingual-e5-large','max_length':512, 'pooling_type':'mean', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}} embeddings_model_spec['BGE-M3']={'model_name':'BAAI/bge-m3','max_length':8192, 'pooling_type':'cls', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}} embeddings_model_spec['Nomic-Embed']={'model_name':'nomic-ai/nomic-embed-text-v1','max_length':8192, 'pooling_type':'mean', 'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'trust_remote_code' : True}}  results = []  languages = ["EN", "FR", "CS", "HU"]  # Loop through all models for model_name, model_spec in embeddings_model_spec.items(): print("Processing model : "+str(model_spec)) # Get modeltokenizer = AutoTokenizer.from_pretrained(model_spec['model_name'])embed_model = AutoModel.from_pretrained(model_spec['model_name'], **model_spec['kwargs']) if model_name=="Nomic-Embed":embed_model.to('cuda') # Loop through all languagesfor language in languages: # Load datasetfile_name=language+"_dataset.json"qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name) start_time_assessment=time.time() # Assess embedding score (in terms of hit rate at k=5)score = evaluate(qa_dataset, tokenizer, embed_model, model_spec['normalize'], model_spec['max_length'], model_spec['pooling_type']) # Get duration of score assessmentduration_assessment = time.time()-start_time_assessment results.append([language, model_name, score, duration_assessment])  df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR", "Duration"])


选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试

BGE-M3的表現最好,其次是ML-E5-Large、E5-mistral- 7b和Nomic-Embed。 BGE-M3模型尚未在MTEB排行榜上進行基準測試,我們的結果表明它可能比其他模型排名更高。雖然BGE-M3針對多語言資料進行了最佳化,但它在英語方面的表現也比其他模型更好。


选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试




选择最适合数据的嵌入模型:OpenAI 和开源多语言嵌入的对比测试


OpenAI的large(3072)、small 和ada模型的表現非常相似。縮小large的嵌入尺寸(256)會導致效能下降,並且沒有像OpenAI說的那樣比ada更好。








以上是選擇最適合資料的嵌入模型:OpenAI 和開源多語言嵌入的對比測試的詳細內容。更多資訊請關注PHP中文網其他相關文章!
