Musk's new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds-AI-php.cn

Home

Technology peripherals

Musk's new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

王林

Apr 15, 2024 pm 04:01 PM

pythonaidata

Musk’s first-generation multi-modal model Grok-1.5V is finally here!

Grok 1.5V claims to be able to connect the digital world and the physical world.

In addition to text functions, it can also handle various visual information such as documents, icons, screenshots and photos.

Grok 1.5V will be available to all early testers and all Grok users soon.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Jim Fan, a senior scientist at NVIDIA, discovered the “biggest highlight” of Grok-1.5V: In his opinion, Grok-1.5V is the most What’s interesting is the potential to solve the edge cases of autonomous driving.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

By using "thinking chains", the car can be helped to solve complex scenarios, reason with rules and counterfactuals, and explain its decisions, Grok-1.5 V can enhance perception to the pixel level, thereby promoting operation mapping to pixel->action.

From this, Tesla FSD V13 is likely to understand language Token!

Tesla’s advantage is that it has a highly mature data management line, so it is not difficult to label a large number of edge cases with high-quality human interpretation traces. In multi-modal FSD reasoning, fine-tuning Grok's performance is likely to be far better than GPT-4V and Gemini.

You might say, didn’t Wayve’s LINGO-1 have similar ideas before? But don’t forget, what Tesla is spinning is an unparalleled data flywheel, far larger than before.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Musk also appeared in the message area at the right time and further explained——

Yes Two data sources are infinitely scalable, synthetic data and real-world video. There is a question of authenticity in the former, but not in the latter.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Some netizens said, good guys, this sounds like the most feasible path to AGI! Multimodal models with explicit linguistic reasoning to consistently infer world models.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Model capability

On the official website of xAI, Grok-1.5V is introduced like this .

It can compete with the top multi-modal models such as GPT-4V, Claude 3 Sonnet, Claude 3 Opus, etc.

Whether it is multi-disciplinary reasoning, or document understanding, scientific diagrams, tables, screenshots and photos, Grok-1.5V can compete with them.

What is particularly surprising is its outstanding ability to understand the physical world.

In the RealWorldQA benchmark test, without thinking chain prompts and zero-sample settings, Grok is evaluated with the following data set. It can be seen that Grok performs better than similar models. .

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

The following 7 representative examples illustrate the good performance of Grok 1.5V in different scenarios.

1. Write code based on the chart

The following figure shows the flow chart of a simple guessing game. Grok can make specific decisions based on the flow chart. Content, use python code to implement the process.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

user:

Can you translate it into Python code?

Grok:

Of course! This flowchart depicts a simple guessing game in which the computer generates a random number and the user needs to guess what it is. Here is the Python code that represents the logic in the flowchart:

import randomdef guess_number(): # Generate a random number between 1 and 10target = random.randint(1, 10)while True:guess = int(input("Guess a number between 1 and 10: "))if guess == target:print("You won!")breakelse:print("Wrong guess, try again!")if __name__ == "__main__":guess_number()

This code will run the guessing game described in the flowchart. The user is prompted to enter the guessed number until the user guesses the correct number.

#2. Calculate calories

According to the nutritional ingredient list of the food, Grok can know that the serving size of the food is 3 pieces, Containing 60 calories, it is calculated that the total calories of the food given by the user for 5 pieces is 100.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

User:

How many calories are in 5 tablets?

Grok:

#If a serving contains 3 slices and the total calories are 60 calories, then the calories in 5 slices That's about 100 calories.

First, determine the calories per slice: 60 calories ÷ 3 slices = 20 calories/slice.

Next, multiply the calories per slice by the number of slices: 20 calories x 5 slices = 100 calories.

3. From painting to bedtime story

Grok can tell a short bedtime story based on the content of the painting.

For example, the following graffiti:

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

## User:

This is what my son drew. Can you tell a bedtime story based on his drawing?

Grok:

A long time ago, in a land full of adventure and wonder, there lived a brave man little boy named Timmy. He lives in a warm cabin with his family and loves nothing more than exploring the world around him.

On a sunny day, Timmy decided to explore the nearby forest. He packs his favorite snacks, a compass, and his trusty notebook to record all the amazing things he's about to see.

After going deep into the forest, Timmy discovered a magical river that sparkled in the sun. The charming scenery of the river made him decide to make a small paper boat and place it on the water as a special greeting to the forest spirits.

Timmy carefully folded the paper boat and drew a smiley face on it, expressing his love and blessing for the forest elves. Then he gently placed the paper boat in the river and watched it drift away with the water.

As the sun set, Timmy knew it was time to go home. He thought back to his day in the forest, knowing that this wonderful memory would stay with him forever.

Back home, Timmy told his family about his adventures. The family members were very happy and believed that he would experience more exciting adventures in the future.

So, with a heart full of love and dreams, Timmy slowly fell into sleep, looking forward to exploring new adventures in his imaginary world.

#4. Understand the meme

In the meme below, the left side is a start-up company and the right side is a large company. What does it mean?

The editor didn’t understand it at first glance. After reading Grok's explanation, it suddenly dawned on me.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

User:

I don’t quite understand this picture, please explain it.

Grok:

这幅图幽默地展示了初创企业和大公司在工作方式上的差异。

在左侧的「初创企业」里，一群人齐心协力挖掘一个深洞，每个人都在积极地参与。

而在右侧的「大公司」里，只有一人在挖洞，其他人只是在旁边看着。

这幅图的幽默之处在于夸大了两种组织形态的不同。

初创企业通常会让每个人都直接参与到核心任务中，而大公司则可能因为其庞大的结构和官僚体制，使得团队成员无法接触到项目的每一个细节。

5. 将表格转换为CSV

Grok可以将表格内容转换成CSV格式。

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

用户：

请将这个表转换为CSV格式。

Grok：

图像中的表格可以转换成CSV格式，如下所示：

medal,name,sport,event,date21 gold,El Amin Chentouf,Athletics,Men's marathon - T13,11 September31 gold,Mohamed Amguoun,Athletics,Men's 400m - T13,15 September41 gold,Azeddine Nouiri,Athletics,Men's shot put - T34,15 September52 silver,Mahdi Afri,Athletics,Men's 400m - T12,9 September62 silver,El Amin Chentouf,Athletics,Men's 5000m - T13,15 September73 bronze,Mahdi Afri,Athletics,Men's 200m - T12,17 September83 bronze,Mohamed Lahna,Paratriathlon,Men's PT2,10 September

请注意，CSV格式不支持样式或格式化，因此信息以纯文本格式显示。

6. 地板上的腐烂木头，都能给出建议

地板上的木材腐烂了怎么办？

照片拍给Grok，它可以帮我们判断腐烂情况，还能给出合理的处理建议。

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

用户：

我地板上的螺丝周围出现了这些洞。是腐烂了吗？我应该更换木板吗？

Grok：

这些木板似乎已经开始腐烂。从螺丝周围出现的孔洞可以看出，这是木材腐烂的典型迹象。

为了确保地板的安全并延长使用的寿命，建议你更换那些已经受损的木板。这样做可以避免地板进一步的损害，并保持其结构的稳固。

如果你不知道如何更换木板，可以寻求专业地板建造商或承包商的帮助。他们会对损害的程度进行评估，并为你提供关于地板维护的专业建议。

7. 解决编码问题

编码问题，Grok-1.5V也是不在话下。

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

用户：

你能写出解决这个问题的Python代码吗？

Grok：

def find_triplets(nums):result = []nums.sort()for i in range(len(nums) - 2):if i > 0 and nums[i] == nums[i - 1]:continuetarget = -nums[i]left = i + 1right = len(nums) - 1while left

Understanding of the real world

Developing useful real-world AI assistants and improving the model’s understanding of the physical world is crucial.

To achieve this goal, the research team also launched a new benchmark-RealWorldQA.

This benchmark can be used to evaluate the ability of multi-modal models to understand basic real-world spaces.

While many examples in current benchmarks are relatively easy for humans to understand, they are still challenging for cutting-edge models.

The initial release of RealWorldQA contains over 700 images, each with a question and easily verifiable answer.

For example, if we ask it: Which object is bigger, the pizza cutter or the scissors in the picture?

The answer given by Grok is: they are about the same size.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Where can I go from the current lane? A. Turn left B. Go straight C. Turn left and go straight D. Turn right

Grok will choose A.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

Judging from the camera in front of the car, is there enough space to get around the gray car in front?

The answer given by Grok is: yes.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

From this picture, which direction is the dinosaur facing?

Grok answered: East.

Musks new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds

This dataset includes anonymized images taken from vehicles, in addition to other real-world images.

The research team has made the data set open source and hopes to continue to expand the data set as the multi-modal model is improved.

Finally, the xAI official website concluded that by improving multi-modal understanding and generation capabilities, they will gradually approach the construction of beneficial AGI that can understand the universe.

In the next few months, it is estimated that we will also see major improvements in these two capabilities in image, audio, video and other multi-modal modes.

The above is the detailed content of Musk's new work! Shocking release of Grok-1.5V multi-modal model: perfect integration of digital and physical worlds. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles