Home >Technology peripherals >AI >Comparison of Gemini Embedding with Multilingual-e5-large & Jina
Gemini Embedding: Multilingual text embedding model under Google Gemini AI framework
Word embedding is crucial for natural language processing (NLP) tasks in Hindi, such as machine translation, question and answer, and information retrieval. These embeddings capture the semantic properties of words, enabling more accurate and context-oriented NLP applications. Given the large number of Hindi speakers and the growing number content of Hindi language, high-quality embedding is critical to improving NLP performance in these languages. Customized embedding can specifically solve the unique language characteristics and resource limitations of the Indian language family. The newly released Gemini Embedding model represents a significant advancement in multilingual text embedding, leveraging Google's powerful Gemini AI framework to achieve state-of-the-art performance in over 100 languages.
The Gemini Embedding model is good at tasks such as classification, retrieval and semantic search, providing greater efficiency and accuracy. By supporting larger input scales and higher dimensional outputs, Gemini Embedding provides richer text representations, enabling it to be widely used in a variety of applications.
*This article is published as part of the *** Data Science Blog Marathon . ***
In March 2025, Google released a new experimental Gemini Embedding text model (gemini-embedding-exp-03-07) that can be used in the Gemini API.
The advanced embedding model originated from the Gemini model, which is said to inherit Gemini's profound understanding of nuances of language and subtle contexts, enabling it to be widely used in a variety of applications. It ranks first in the MTEB multilingual rankings.
Gemini Embedding represents text as dense vectors where text inputs with similar semantics are mapped to vectors in vector space that are close to each other. Currently, it supports over 100 languages, and its embedding can be used for a variety of tasks such as retrieval and classification.
The core of Gemini Embedding is based on the Transformer architecture and initialized from Gemini LLM. This basis provides a deep understanding of language structure and semantics for the model. The model uses a bidirectional attention mechanism to process input sequences so that it can take into account the full context of a word or phrase when generating an embedding.
Loss function : The Gemini Embedding model is trained using noise comparison estimation (NCE) losses with in-batch negative examples. The exact loss will vary slightly depending on the training phase. Generally speaking, a training example includes a query, a positive target, and (optional) a difficult target.
Read Also: Gemini Embedding: Universal Embedding from Gemini
We compare the search for Hindi documents with the latest newly released Gemini embeddings and then compare them with Jina AI embeddings and Multilingual-e5-large embeddings. As shown in the following table, Gemini embedding and Jina AI embedding are high in terms of maximum number of tags, allowing the model to handle long documents or complex queries. Furthermore, as shown in the following table, Gemini embeddings have a higher embedding dimension that captures more detailed and nuanced semantic relationships between words, allowing models to represent nuanced differences in complex language patterns and meanings.
Number of parameters | Embed dimensions | Maximum mark | Number of languages | Doll embedding | |
gemini-embedding-exp-03-07 | unknown | 3072 | 8192 | 100 | Supports truncation of embeddings to various sizes, such as 2048, 1024, 512, 256, and 128 dimensions, |
jinaai/jina-embeddings-v3 | 572 million | 1024 | 8194 | 100 | Supports flexible embed sizes (32, 64, 128, 256, 512, 768, 1024), allowing truncated embeds to fit your application |
multilingual-e5-large-instruct | 560 million | 1024 | 514 | 94 | NA |
In the following practical tutorial, we compare the search for Hindi documents with the latest newly released Gemini embeddings and then compare it with Jina AI embeddings and Multilingual-e5-large embeddings.
Step 1. Install the necessary libraries
<code>!pip install langchain-community !pip install chromadb</code>
Step 2. Load the data
We used Hindi data from the website to evaluate the performance of Gemini embedding in Hindi language retrieval.
<code>from langchain_community.document_loaders import WebBaseLoader loader = WebBaseLoader("https://ckbirlahospitals.com/rbh/blog/pregnancy-early-symptoms-in-hindi") data = loader.load()</code>
Step 3. Block the data
The following code uses RecursiveCharacterTextSplitter to split a large text document into 500-character chunks without overlap. It then applies this split to the datavariable and stores the result in all_splits. Due to the rate limits of the Gemini Embedding API, we only use 10 splits.
<code>from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0) all_splits = text_splitter.split_documents(data) all_splits = all_splits[:10]</code>
Step 4. Store the data in the vector database
We first create a class called "GeminiEmbeddingFunction" which helps query the Gemini Embedding API and return the embedded value of the input query. We then create a function called "create_chroma_db" to create a collection in ChromaDB that will store data as well as embed.
<code>import chromadb from chromadb import Documents, EmbeddingFunction, Embeddings class GeminiEmbeddingFunction(EmbeddingFunction): def __call__(self, input: Documents) -> Embeddings: title = "Custom query" return client.models.embed_content( model="gemini-embedding-exp-03-07", contents=input).embeddings[0].values def create_chroma_db(documents, name): chroma_client = chromadb.Client() db = chroma_client.create_collection(name=name, embedding_function=GeminiEmbeddingFunction()) for i, d in enumerate(documents): db.add( documents=d.page_content, ids=str(i) ) return db db = create_chroma_db(all_splits, "datab")</code>
Step 5. Query the database
<code>def get_relevant_passage(query, db): passage = db.query(query_texts=[query], n_results=1)['documents'][0][0] return passage passage = get_relevant_passage("आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए?", db) print(passage)</code>
Step 6. Compare with Jina AI Embedding
The following code uses the Hugging Face transformer model to define a custom embedding function, as well as a way to process text input to generate embeddings.
<code>from transformers import AutoTokenizer, AutoModel from chromadb import EmbeddingFunction tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3') model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3') # the model returns many hidden states per document so we must aggregate them def average_pool(last_hidden_states, attention_mask): last_hidden = last_hidden_states.masked_fill(~attention_mask[...,None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[...,None] class CustomHuggingFace(EmbeddingFunction): def __call__(self, texts): queries = [f'query: {text}' for text in texts] batch_dict = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) return embeddings.tolist()</code>
Query
<code>def get_relevant_passage(query, db): passage = db.query(query_texts=[query], n_results=1)['documents'][0][0] return passage passage = get_relevant_passage("आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए?", db) print(passage)</code>
For choosing Multilingual-e5-large embed , we simply replace the tokenizer and model with "intfloat/multilingual-e5-large-instruct".
Question number | Query | Gemini Embed | jinaai/jina-embeddings-v3 | intfloat/multilingual-e5-large-instruct |
1 | आपको प्रेगनेंसी टेस्ट कब करवाना चाहिए? | If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake | If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake | If you want to learn more about the early symptoms of pregnancy, this blog post is perfect for you. When should you have a pregnancy test? -mistake |
2 | Pregnancy के kuch symbols क्या होते हैं? | What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct | Signs of pregnancy: Complete information on early symptoms! Home Quick Consultation Patient Login Contact Us: 08062136530 Emergency Phone: 07340054470 Open the main menu to serve patients and visitors International Patients About Us Make an appointment to call back WhatsApp to learn about the early symptoms of pregnancy. Obstetrics and Gynecology | Author: Dr. CP Dadhich | Release Date: February 6, 2025 Contents When should you have a pregnancy test? What are the early symptoms of pregnancy? Early symptoms of pregnancy Pregnancy – Error | What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct |
3 | गर्भावस्था के दौरान एंटीबायोटिक दवा लेने से कब बचा हिए? | During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct | During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct | What every woman should know. For any pregnancy-related questions, we recommend that you contact our gynecologist to eliminate all complications. -mistake |
4 | कब गर्भावस्था में एंटीबायोटिक दवा लेने से बचाया जाए? | During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct | During the first few days of pregnancy, eggs and sperm are fertilized, causing symptoms such as bleeding and abdominal pain. During this period, for a healthy pregnancy, women are advised to avoid taking antibiotics, as this can be dangerous to mothers and babies. Early symptoms of pregnancy are not always delayed menstruation or vomiting. In addition, other symptoms may occur and require special attention, such as – Correct | What every woman should know. For any pregnancy-related questions, we recommend that you contact our gynecologist to eliminate all complications. -mistake |
5 | गर्भधारण का सबसे पहला सामान्य लक्षण क्या है? | Delayed menstruation: This is the earliest and most common symptom of pregnancy. Confirmation of pregnancy based solely on this symptom is not entirely correct. However, if menstruation is delayed for one week or more, pregnancy tests are recommended. Breast changes: During pregnancy, the breasts will swell, become tender or change in color. It mainly changes in the size and color of the nipple (areola). -correct | With this in mind, how to confirm pregnancy? How to take care of the first month of pregnancy? How to do pregnancy checkups? How should I sit during pregnancy? Should sex occur during pregnancy? What fruits should you eat during pregnancy? How much water should you drink during pregnancy? The joy of becoming a mother is the greatest happiness in the world. During pregnancy, there are many changes in women's physical and psychological changes. You call these changes early symptoms of pregnancy – Error | What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct |
6 | गर्भधारण के पहले संकेत क्या होते हैं? | Signs of pregnancy: Complete information on early symptoms! Home Quick Consultation Patient Login Contact Us: 08062136530 Emergency Phone: 07340054470 Open the main menu to serve patients and visitors International Patients About Us Make an appointment to call back WhatsApp to learn about the early symptoms of pregnancy. Obstetrics and Gynecology | Author: Dr. CP Dadhich | Release Date: February 6, 2025 Contents When should you have a pregnancy test? What are the early symptoms of pregnancy? Early symptoms of pregnancy Pregnancy – Error | With this in mind, how to confirm pregnancy? How to take care of the first month of pregnancy? How to do pregnancy checkups? How should I sit during pregnancy? Should sex occur during pregnancy? What fruits should you eat during pregnancy? How much water should you drink during pregnancy? The joy of becoming a mother is the greatest happiness in the world. During pregnancy, there are many changes in women's physical and psychological changes. You call these changes early symptoms of pregnancy – Error | What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -correct |
7 | गर्भावस्था की पुष्टि के लिए कौन से हार्मोन का पता लगाना होता है? | The best time to have a pregnancy test is after menstruation is delayed by at least 7 days. You can use the home pregnancy testing tool to detect hCG levels at home. During pregnancy, the levels of this hormone will increase significantly. One thing you need to note is that premature testing can also lead to wrong results, so if your period is delayed and the test is negative, it is recommended that you wait at least 3 more days before you test again. -correct | There is also a correct way to do this, which you can also see on the test tool manual. To get accurate results, you should use the first urine in the morning, as the correct level of hCG hormone can be measured. Also, if you experience early symptoms of pregnancy and the test results are negative, see your doctor for a blood test immediately. In any case, you must consult a doctor if you have any questions. -correct | What are the early symptoms of pregnancy? During pregnancy, many hormonal changes occur in women. Early symptoms of pregnancy include nausea, vomiting, frequent urination and fatigue, which we will discuss in this blog post. -mistake |
As can be seen from the above Hindi output, using Gemini embedding, we get 5 correct outputs from 7 queries, while using Jina AI embedding and Multilingual-e5-large, we get only 3 correct responses.
This shows that, as reflected in the MTEB benchmark, Gemini embeddings perform well and handle multilinguals such as Hindi better than other embedding models.
In short, Gemini embedding represents a significant advancement in multilingual NLP, especially for Hindi languages such as Hindi. With its strong multilingual capabilities, support for large input sizes, and superior performance in benchmarks such as MTEB, Gemini excels in tasks such as retrieval, classification, and semantic search. Through practical comparisons, Gemini's performance is better than other models, providing higher accuracy and efficiency, making it a valuable tool for promoting multilingual NLP.
The media shown in this article are not owned by Analytics Vidhya and can be used at the discretion of the author.
Q1. What is the Gemini Embedding model? A: The Gemini Embedding model is based on Google's Gemini AI and provides top-notch multilingual text embeddings for more than 100 languages including Hindi.
Q2. What is unique about Gemini Embedding compared to other models? A: Gemini Embedding excels in multilingual support, can process 8000 markers and output 3072 dimensions, ensuring efficiency in classification, retrieval and semantic search.
Q3. How does Gemini Embedding perform in multilingual tasks? Answer: Gemini Embedding performs well in high-resource languages such as English and low-resource languages such as Assamese and Macedonian. It ranks number one on the MTEB multilingual rankings, demonstrating its powerful multilingual capabilities.
Q4. What is the architecture of the Gemini Embedding model? A: The model is initialized from Gemini LLM and uses a Transformer architecture with bidirectional attention to generate high-quality text embeddings that capture context and meaning.
Q5. How is the Gemini Embedding model trained? A: Gemini Embedding uses noise comparison estimation (NCE) loss with in-batch negative examples for training. It goes through two training phases: pre-fine-tuning on a large dataset and task-specific datasets to improve NLP performance.
The above is the detailed content of Comparison of Gemini Embedding with Multilingual-e5-large & Jina. For more information, please follow other related articles on the PHP Chinese website!