Pixtral-12B: Mistral AI's First Multimodal Model

Home

Technology peripherals

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

尊渡假赌尊渡假赌尊渡假赌

Apr 13, 2025 am 11:20 AM

Introduction

Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and text for input. Let’s look more at the model, how it can be used, how well it’s performing the tasks and the other things you need to know.

In this article, you will learn about the Pixtral-12B model. This AI model uses deep learning and a special type of network to create images. We will look at how it works, its uses in machine learning, and how it compares to GPT-3. You’ll also see why its performance is so impressive.

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

Overview

Discover Mistral’s new Pixtral-12B, a multimodal model combining text and image processing for versatile AI applications.
Learn how to use Pixtral-12B, Mistral’s latest AI model, designed to handle both text and high-resolution images.
Explore the capabilities and use cases of the Pixtral-12B model, featuring a vision adapter for enhanced image understanding.
Understand Pixtral-12B’s multimodal features and its potential applications in image captioning, story generation, and more.
Get insights into Pixtral-12B’s design, performance, and how to fine-tune it for specific multimodal tasks.

What is Pixtral-12B?
How to Use Pixtral-12B-2409?

What is Pixtral-12B?

Pixtral-12B is a multimodal model derived from Mistral’s Nemo 12B, with an added 400M-parameter vision adapter. Mistral can be downloaded from a torrent file or on Hugging Face with an Apache 2.0 license. Let’s look at some of the technical features of the Pixtral-12B model:

Feature	Details
Model Size	12 billion parameters
Layers	40 Layers
Vision Adapter	400 million parameters, utilizing GeLU activation
Image Input	Accepts 1024 x 1024 images via URL or base64, segmented into 16 x 16 pixel patches
Vision Encoder	2D RoPE (Rotary Position Embeddings) enhances spatial understanding
Vocabulary Size	Up to 131,072 tokens
Special Tokens	img, img_break, and img_end

How to Use Pixtral-12B-2409?

As of September 15th, 2024, the model is currently not available on Mistral’s Le Chat or La Plateforme to use the chat interface directly or access it through API, but we can download the model through a torrent link and use it or even finetune the weights to suit our needs. We can also use the model with the help of Hugging Face. Let’s look at them in detail:

Torrent link to Use:

magnet:?xt=urn:btih:7278e625de2b1da598b23954c13933047126238a&dn=pixtral-12b-<br>240910&tr=udp:/%http://2Ftracker.opentrackr.org:1337/announce&tr=udp%<br>3A/%http://2Fopen.demonii.com:1337/announce&tr=http:/%http://2Ftrac<br>ker.ipv6tracker.org:80/announce

I’m using an Ubuntu laptop, so I’ll use the Transmission application (it’s pre-installed in most Ubuntu computers). You can use any other application to download the torrent link for the open-source model.

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

Click “File” at the top left and select the open URL option. Then, you can paste the link that you copied.

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

You can click “Open” and download the Pixtral-12B model. The folder will be downloaded which contains these files:

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

Hugging Face

This model demands a high GPU, so I suggest you use the paid version of Google Colab or Jupyter Notebook using RunPod.I’ll be using RunPod for the demo of the Pixtral-12B model. If you’re using a RunPod instance with a 40 GB disk, I suggest you use the A100 PCIe GPU.

We’ll be using the Pixtral-12B with the help of vllm. Make sure to do the following installations.

!pip install vllm<br><br>!pip install --upgrade mistral_common

Go to this link: of Hugging Face and agree to access the model. Then go to your profile, click on “access_tokens,” and create one. If you don’t have an access token, ensure you have checked the following boxes:

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

Now run the following code and paste the Access Token to authenticate with Hugging Face:

from huggingface_hub import notebook_login

notebook_login()

This will take a while as the 25 GB model gets downloaded for use:

from vllm import LLM

from vllm.sampling_params import SamplingParams

model_name = "mistralai/Pixtral-12B-2409"

sampling_params = SamplingParams(max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral",max_model_len=70000)

prompt = "Describe this image"

image_url = "https://images.news18.com/ibnlive/uploads/2024/07/suryakumar-yadav-catch-1-2024-07-4a496281eb830a6fc7ab41e92a0d295e-3x2.jpg"

messages = [

{

"role": "user",

"content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]

},

]

I asked the model to describe the following image, which is from the T20 World Cup 2024:

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya

outputs = llm.chat(messages, sampling_params=sampling_params)

print('\n'  outputs[0].outputs[0].text)

Output

Processed prompts: 100%|██████████| 1/1 [00:06 input: 429.80 toks/s, output: 51.54 toks/s]<br>The image is a composite of three frames showing a cricket player in action,<br> likely from the T20 World Cup. Here's a detailed summary:<br><br>1. **Left Frame**:<br>   - The player is seen mid-jump, looking upwards with his arms outstretched,<br> attempting to catch a cricket ball.<br>   - He is wearing a blue jersey with the name "Surya" and the number "63"<br> printed on the back. <br>   - His expression is intense and focused, emphasizing the concentration<br> required for the catch.<br><br>2. **Middle Frame**:<br>   - The player has his hands cupped together, measurably closer to catching<br> the ball.<br>   - He wears the same blue jersey with red sleeves, which also features the<br> "India" team logo and the sponsor's branding, specifically from Adidas.<br>   - The ball is now very close to his hands, indicating a high likelihood of<br> a successful catch.<br><br>3. **Right Frame**:<br>   - The player has successfully caught the ball, raising both hands above<br> his head with the ball clasped in his left hand.<br>   - He displays a triumphant expression with a broad smile, typically<br> associated with successfully making a catch.<br>   - His uniformity and raised hands convey a sense of victory and great<br> effort.<br><br>The background in all frames features billboards indicating the "T20 World<br> Cup," and specifically, a prominent sign for "Emirates," which has<br> presumably sponsored the event. There is green grass on the field,<br> reflecting the actual playing conditions during the match. <br><br>Overall, the image illustrates a key moment in a cricket game, showcasing the<br> player's athletic skill and team spirit.<br>

From the output, we can see that the model was able to identify the image from the T20 World Cup, and it was able to distinguish the frames in the same image to explain what was happening.

prompt = "Write a story describing the whole event that might have happened"

image_url = "https://images.news18.com/ibnlive/uploads/2024/07/suryakumar-yadav-catch-1-2024-07-4a496281eb830a6fc7ab41e92a0d295e-3x2.jpg"

messages = [

{

"role": "user",

"content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]

},

]

outputs = llm.chat(messages, sampling_params=sampling_params)

print('\n' outputs[0].outputs[0].text)

Output

Processed prompts: 100%|██████████| 1/1 [00:06 input: 401.77 toks/s, output: 52.92 toks/s]<br>In the heart of a bustling stadium, the T20 World Cup is in full swing. The<br> atmosphere is electric, and the crowd is on the edge of their seats, eagerly<br> watching the match between India and one of their formidable rivals. The<br> focus shifts to the Indian fielders, specifically Suryakumar Yadav, who is<br> positioned strategically on the boundary line.<br><br>As the ball sails into the air after a powerful shot from the opposing<br> batsman, Suryakumar's eyes follow its trajectory. With age-old instincts<br> honed by countless hours of practice, he knows that this is his moment. The<br> ball is lofted high, giving him just enough time to gauge its path and plan<br> his next move.<br><br>In the first snapshot, Suryakumar is Hat-cheaper crouched, eyes fixed on the<br> ball as it peaks. The anticipation is palpable. His teammates watch<br> intently, hoping he can execute the catch perfectly.<br><br>The second snapshot captures a moment of pure athleticism. With a burst of<br> energy and agility, Suryakumar springs into action. He takes a few swift<br> steps forward, his body eager to reach the height required to make the<br> catch. The ball is slightly above his head, and he extends his arms, fingers<br> spread wide, ready to secure the prize.<br><br>In the final snapshot, Suryakumar's face lights up with accomplishment. His<br> eyes are focused on the ball, now safely nestled in his palm. The crowd<br> explodes with cheers, acknowledging the outstanding effort. His teammates<br> rush towards him, celebrating the crucial catch that could turn the tide of<br> the match.<br><br>This sequence of successful plays not only highlights Suryakumar's individual<br> skill but also underscores the strategic teamwork and determined spirit that<br> define the Indian cricket team in the prestigious T20 World Cup.<br>

When asked to write a story about the image, the model could gather context on the environment’s characteristics and what exactly happened in the frame.

Conclusion

The Pixtral-12B model significantly advances Mistral’s AI capabilities, blending text and image processing to expand its use cases. Its ability to handle high-resolution 1024 x 1024 images with a detailed understanding of spatial relationships and its strong language capabilities make it an excellent tool for multimodal tasks such as image captioning, story generation, and more.

Despite its powerful features, the model can be further fine-tuned to meet specific needs, whether improving image recognition, enhancing language generation, or adapting it for more specialized domains. This flexibility is a crucial advantage for developers and researchers who want to tailor the model to their use cases.

Q1. What is vLLM?

A. vLLM is a library optimized for efficient inference of large language models, improving speed and memory usage during model execution.

Q2. What’s the use of SamplingParams?

A. SamplingParams in vLLM control how the model generates text, specifying parameters like the maximum number of tokens and sampling techniques for text generation.

Q3. Will the model be available on Mistral’s Le Chat?

A. Yes, Sophia Yang, Head of Mistral Developer Relations, mentioned that the model would soon be available on Le Chat and Le Platform.

The above is the detailed content of Pixtral-12B: Mistral AI's First Multimodal Model - Analytics Vidhya. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

From Friction To Flow: How AI Is Reshaping Legal WorkMay 09, 2025 am 11:29 AM

The legal tech revolution is gaining momentum, pushing legal professionals to actively embrace AI solutions. Passive resistance is no longer a viable option for those aiming to stay competitive. Why is Technology Adoption Crucial? Legal professional

This Is What AI Thinks Of You And Knows About YouMay 09, 2025 am 11:24 AM

Many assume interactions with AI are anonymous, a stark contrast to human communication. However, AI actively profiles users during every chat. Every prompt, every word, is analyzed and categorized. Let's explore this critical aspect of the AI revo

7 Steps To Building A Thriving, AI-Ready Corporate CultureMay 09, 2025 am 11:23 AM

A successful artificial intelligence strategy cannot be separated from strong corporate culture support. As Peter Drucker said, business operations depend on people, and so does the success of artificial intelligence. For organizations that actively embrace artificial intelligence, building a corporate culture that adapts to AI is crucial, and it even determines the success or failure of AI strategies. West Monroe recently released a practical guide to building a thriving AI-friendly corporate culture, and here are some key points: 1. Clarify the success model of AI: First of all, we must have a clear vision of how AI can empower business. An ideal AI operation culture can achieve a natural integration of work processes between humans and AI systems. AI is good at certain tasks, while humans are good at creativity and judgment

Netflix New Scroll, Meta AI's Game Changers, Neuralink Valued At $8.5 BillionMay 09, 2025 am 11:22 AM

Meta upgrades AI assistant application, and the era of wearable AI is coming! The app, designed to compete with ChatGPT, offers standard AI features such as text, voice interaction, image generation and web search, but has now added geolocation capabilities for the first time. This means that Meta AI knows where you are and what you are viewing when answering your question. It uses your interests, location, profile and activity information to provide the latest situational information that was not possible before. The app also supports real-time translation, which completely changed the AI experience on Ray-Ban glasses and greatly improved its usefulness. The imposition of tariffs on foreign films is a naked exercise of power over the media and culture. If implemented, this will accelerate toward AI and virtual production

Take These Steps Today To Protect Yourself Against AI CybercrimeMay 09, 2025 am 11:19 AM

Artificial intelligence is revolutionizing the field of cybercrime, which forces us to learn new defensive skills. Cyber criminals are increasingly using powerful artificial intelligence technologies such as deep forgery and intelligent cyberattacks to fraud and destruction at an unprecedented scale. It is reported that 87% of global businesses have been targeted for AI cybercrime over the past year. So, how can we avoid becoming victims of this wave of smart crimes? Let’s explore how to identify risks and take protective measures at the individual and organizational level. How cybercriminals use artificial intelligence As technology advances, criminals are constantly looking for new ways to attack individuals, businesses and governments. The widespread use of artificial intelligence may be the latest aspect, but its potential harm is unprecedented. In particular, artificial intelligence

A Symbiotic Dance: Navigating Loops Of Artificial And Natural PerceptionMay 09, 2025 am 11:13 AM

The intricate relationship between artificial intelligence (AI) and human intelligence (NI) is best understood as a feedback loop. Humans create AI, training it on data generated by human activity to enhance or replicate human capabilities. This AI

AI's Biggest Secret — Creators Don't Understand It, Experts SplitMay 09, 2025 am 11:09 AM

Anthropic's recent statement, highlighting the lack of understanding surrounding cutting-edge AI models, has sparked a heated debate among experts. Is this opacity a genuine technological crisis, or simply a temporary hurdle on the path to more soph

Bulbul-V2 by Sarvam AI: India's Best TTS ModelMay 09, 2025 am 10:52 AM

India is a diverse country with a rich tapestry of languages, making seamless communication across regions a persistent challenge. However, Sarvam’s Bulbul-V2 is helping to bridge this gap with its advanced text-to-speech (TTS) t

See all articles