Home  >  Article  >  Technology peripherals  >  First-hand review of Claude 3.5: Is it really better than GPT-4o for playing tricks, seeing a doctor, playing tricks, and doing math problems?

First-hand review of Claude 3.5: Is it really better than GPT-4o for playing tricks, seeing a doctor, playing tricks, and doing math problems?

王林
王林Original
2024-06-22 07:46:191020browse
Machine Power Report
Editor: Yang Wen
Playing tricks, seeing doctors, playing tricks, and doing math problems. Is "New King" Claude's 3.5 ability really so mysterious?

It’s coming, it’s coming, it’s coming with the Claude 3.5 Sonnet!

After three months of dormancy, just last night, OpenAI’s “strong rival” Anthropic launched a new generation model -

Claude 3.5 Sonnet!

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

What’s unique about this large model?

First of all, it can better grasp the nuances, humor and complex instructions, and the writing tone is more natural and friendly.

It is also Anthropic’s strongest visual model, good at tasks such as interpreting charts, graphs, or transcribing text from imperfect images.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Additionally, it performs exceptionally well on multiple assessment benchmarks including reasoning, reading comprehension, math, science, and coding.

In short, according to the official introduction, Claude 3.5 Sonnet is the smartest model so far, beating GPT-4o in many aspects.

Speaking of which, let’s not be polite and let Claude 3.5 Sonnet and GPT-4o compete directly to see which one is better.


First round: mind-eye training

In daily life, you will always encounter some embarrassing scenes.

For example, at a dinner party, you help the leader serve the rice. After the leader takes it, he says: "How about feeding the pigs after serving so much?" How would a person with high emotional intelligence respond in this situation?

We throw this problem to these two large models.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

They know how to flatter you.

Claude 3.5 gave 5 examples in one breath, but the second sentence, "My eyesight is not good, so I regard you as the pillar of our unit." This is probably a slap in the face.

GPT-4o understands "the ways of the world" better, "Seeing that you maintain such a good figure, I have to ask you for weight loss tips", this flattery is just right.

It is worth mentioning that Claude 3.5 Sonnet has also launched a new function - the prompt word re-editing function.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Users can directly edit and modify the original prompt words without having to copy and paste them over and over again.


Second round: Generating recipes based on dishes

We uploaded a picture of "Fried Eggs with Tomatoes" and let the two large models introduce the production process.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

They have a lot of experience with this classic Chinese dish, from ingredients to steps, and the most interesting thing is, it Both of them understand the essence of Chinese cooking, "a little bit", and both emphasize adding a little sugar to balance the acidity.

When it comes to cooking, the two large models are comparable.


The third game: Do math problems

In the official evaluation table, the math score of GPT-4o is slightly higher than Claude 3.5 Sonnet. Among them, GPT-4o is 76.6%, and Claude 3.5 Sonnet is 71.1%.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

We extracted two questions from Paper I of the 2024 New College Entrance Examination, one is a multiple-choice question and the other is an answer question, and they are "fed" to these two large models in the form of pictures.

The first question is a scoring question, and the correct answer is A.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

These two large models are "in tune", not only giving the correct answer, but also giving detailed information problem-solving steps.

We gave them the first question and asked them to give the solution process.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

The correct answer is: B=3/π.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

In fact, this question is the most basic question, but the two large models are "as fierce as a tiger in one operation", and finally given got the wrong answer.

What’s even more funny is that this wrong answer did not come out of thin air, but after a series of reasoning, and even the mistakes were the same.

In terms of mathematical ability, these two large models are evenly matched.


The fourth game: Playing hot memes on the Internet

This year, the field of AI video has blossomed everywhere, not only breaking into new "players" - Keling, Luma, Jimeng, etc., the former AI The video "carries the handle" Runway is also "the return of the king".

As a result, netizens made this meme to poke fun at the status of major AI video applications today.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

We uploaded this meme to two large models respectively, and entered the prompt word "What does this picture mean?" to test their image interpretation capabilities.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet has a detailed description in terms of screen characters, scenes and atmosphere, but it doesn’t seem to be Understand I don’t know the connotation of this meme, and I don’t know these AI video applications. I just vaguely stated that “this is a comment on the power structure in online communities, artificial intelligence systems, or virtual worlds.”

GPT-4o Take a look Just understand the meaning, "This picture may symbolize Runway's recognized superiority or leadership in the field of artificial intelligence and creative tools. Compared with other applications mentioned, Runway is highly regarded."

Obviously, this round, GPT-4o wins.


The fifth round: Understanding world famous paintings

We took out the picture "Spring Light" painted by Pierre-Auguste Coote in 1873 and asked them to identify the painting and appreciate it .

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

These two large models can be called "experts" in the art world. They both recognized the painting, expressed the basic information correctly, and appreciated it from different angles.

They all mentioned market value, however, Claude 3.5 Sonnet declined to comment, only reminding that "art valuation requires expert evaluation, considering multiple factors, and prices may fluctuate significantly over time."

GPT -4o believes that the painting may fetch millions of dollars. Is this too underestimated for this classic painting?

In this game, the two large models are tied.


The sixth round: AI doctoring

Recently, netizens have been playing with using large AI models to treat doctors. We took an X-ray of a 6-year-old's teeth and asked the models to use the teeth to infer age and what problems were present.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet Based on the development of deciduous teeth and permanent teeth, we concluded that this is a child about 6-7 years old The child's teeth, the lower teeth are somewhat crowded, the permanent teeth appear to be impacted, and there may be decay in the darker areas of the teeth.

GPT-4o believes that these are the teeth of a child aged 7-9 years old. The main dental problems include crowding of permanent teeth and potential impaction.

At the same time, they all mentioned that this requires professional dental examination.

Compared between the two, Claude 3.5 Sonnet is more accurate in judging age.

In this game, Claude 3.5 is slightly better.

In addition, many netizens are also working online and coming up with many interesting ways to play.

For example, EverArt founder Pietro Schirano cloned the Mario game using geometric shapes with the help of Claude 3.5 Sonnet, and the entire process only lasted 3 minutes.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

He said, "The crazy part is that it also animates the characters and the shapes look so original."

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Video link: https://www.php. cn/link/a412963e013751a90654aa344bc26efe

Dear readers, do you think Claude 3.5 Sonnet has completed the "defeat" against GPT-4o this time?

The above is the detailed content of First-hand review of Claude 3.5: Is it really better than GPT-4o for playing tricks, seeing a doctor, playing tricks, and doing math problems?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn