search
HomeTechnology peripheralsAIFirst-hand review of Claude 3.5: Is it really better than GPT-4o for playing tricks, seeing a doctor, playing tricks, and doing math problems?

Machine Power Report
Editor: Yang Wen
Playing tricks, seeing doctors, playing tricks, and doing math problems. Is "New King" Claude's 3.5 ability really so mysterious?

It’s coming, it’s coming, it’s coming with the Claude 3.5 Sonnet!

After three months of dormancy, just last night, OpenAI’s “strong rival” Anthropic launched a new generation model -

Claude 3.5 Sonnet!

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

What’s unique about this large model?

First of all, it can better grasp the nuances, humor and complex instructions, and the writing tone is more natural and friendly.

It is also Anthropic’s strongest visual model, good at tasks such as interpreting charts, graphs, or transcribing text from imperfect images.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Additionally, it performs exceptionally well on multiple assessment benchmarks including reasoning, reading comprehension, math, science, and coding.

In short, according to the official introduction, Claude 3.5 Sonnet is the smartest model so far, beating GPT-4o in many aspects.

Speaking of which, let’s not be polite and let Claude 3.5 Sonnet and GPT-4o compete directly to see which one is better.


First round: mind-eye training

In daily life, you will always encounter some embarrassing scenes.

For example, at a dinner party, you help the leader serve the rice. After the leader takes it, he says: "How about feeding the pigs after serving so much?" How would a person with high emotional intelligence respond in this situation?

We throw this problem to these two large models.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

They know how to flatter you.

Claude 3.5 gave 5 examples in one breath, but the second sentence, "My eyesight is not good, so I regard you as the pillar of our unit." This is probably a slap in the face.

GPT-4o understands "the ways of the world" better, "Seeing that you maintain such a good figure, I have to ask you for weight loss tips", this flattery is just right.

It is worth mentioning that Claude 3.5 Sonnet has also launched a new function - the prompt word re-editing function.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Users can directly edit and modify the original prompt words without having to copy and paste them over and over again.


Second round: Generating recipes based on dishes

We uploaded a picture of "Fried Eggs with Tomatoes" and let the two large models introduce the production process.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

They have a lot of experience with this classic Chinese dish, from ingredients to steps, and the most interesting thing is, it Both of them understand the essence of Chinese cooking, "a little bit", and both emphasize adding a little sugar to balance the acidity.

When it comes to cooking, the two large models are comparable.


The third game: Do math problems

In the official evaluation table, the math score of GPT-4o is slightly higher than Claude 3.5 Sonnet. Among them, GPT-4o is 76.6%, and Claude 3.5 Sonnet is 71.1%.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

We extracted two questions from Paper I of the 2024 New College Entrance Examination, one is a multiple-choice question and the other is an answer question, and they are "fed" to these two large models in the form of pictures.

The first question is a scoring question, and the correct answer is A.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

These two large models are "in tune", not only giving the correct answer, but also giving detailed information problem-solving steps.

We gave them the first question and asked them to give the solution process.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

The correct answer is: B=3/π.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

In fact, this question is the most basic question, but the two large models are "as fierce as a tiger in one operation", and finally given got the wrong answer.

What’s even more funny is that this wrong answer did not come out of thin air, but after a series of reasoning, and even the mistakes were the same.

In terms of mathematical ability, these two large models are evenly matched.


The fourth game: Playing hot memes on the Internet

This year, the field of AI video has blossomed everywhere, not only breaking into new "players" - Keling, Luma, Jimeng, etc., the former AI The video "carries the handle" Runway is also "the return of the king".

As a result, netizens made this meme to poke fun at the status of major AI video applications today.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

We uploaded this meme to two large models respectively, and entered the prompt word "What does this picture mean?" to test their image interpretation capabilities.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet has a detailed description in terms of screen characters, scenes and atmosphere, but it doesn’t seem to be Understand I don’t know the connotation of this meme, and I don’t know these AI video applications. I just vaguely stated that “this is a comment on the power structure in online communities, artificial intelligence systems, or virtual worlds.”

GPT-4o Take a look Just understand the meaning, "This picture may symbolize Runway's recognized superiority or leadership in the field of artificial intelligence and creative tools. Compared with other applications mentioned, Runway is highly regarded."

Obviously, this round, GPT-4o wins.


The fifth round: Understanding world famous paintings

We took out the picture "Spring Light" painted by Pierre-Auguste Coote in 1873 and asked them to identify the painting and appreciate it .

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

These two large models can be called "experts" in the art world. They both recognized the painting, expressed the basic information correctly, and appreciated it from different angles.

They all mentioned market value, however, Claude 3.5 Sonnet declined to comment, only reminding that "art valuation requires expert evaluation, considering multiple factors, and prices may fluctuate significantly over time."

GPT -4o believes that the painting may fetch millions of dollars. Is this too underestimated for this classic painting?

In this game, the two large models are tied.


The sixth round: AI doctoring

Recently, netizens have been playing with using large AI models to treat doctors. We took an X-ray of a 6-year-old's teeth and asked the models to use the teeth to infer age and what problems were present.

Claude 3.5 Sonnet:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

GPT-4o:

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Claude 3.5 Sonnet Based on the development of deciduous teeth and permanent teeth, we concluded that this is a child about 6-7 years old The child's teeth, the lower teeth are somewhat crowded, the permanent teeth appear to be impacted, and there may be decay in the darker areas of the teeth.

GPT-4o believes that these are the teeth of a child aged 7-9 years old. The main dental problems include crowding of permanent teeth and potential impaction.

At the same time, they all mentioned that this requires professional dental examination.

Compared between the two, Claude 3.5 Sonnet is more accurate in judging age.

In this game, Claude 3.5 is slightly better.

In addition, many netizens are also working online and coming up with many interesting ways to play.

For example, EverArt founder Pietro Schirano cloned the Mario game using geometric shapes with the help of Claude 3.5 Sonnet, and the entire process only lasted 3 minutes.

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

He said, "The crazy part is that it also animates the characters and the shapes look so original."

一手测评Claude 3.5:玩梗、看病、耍心眼 、做数学题,它真比GPT-4o强吗?

Video link: https://www.php. cn/link/a412963e013751a90654aa344bc26efe

Dear readers, do you think Claude 3.5 Sonnet has completed the "defeat" against GPT-4o this time?

The above is the detailed content of First-hand review of Claude 3.5: Is it really better than GPT-4o for playing tricks, seeing a doctor, playing tricks, and doing math problems?. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
DSA如何弯道超车NVIDIA GPU?DSA如何弯道超车NVIDIA GPU?Sep 20, 2023 pm 06:09 PM

你可能听过以下犀利的观点:1.跟着NVIDIA的技术路线,可能永远也追不上NVIDIA的脚步。2.DSA或许有机会追赶上NVIDIA,但目前的状况是DSA濒临消亡,看不到任何希望另一方面,我们都知道现在大模型正处于风口位置,业界很多人想做大模型芯片,也有很多人想投大模型芯片。但是,大模型芯片的设计关键在哪,大带宽大内存的重要性好像大家都知道,但做出来的芯片跟NVIDIA相比,又有何不同?带着问题,本文尝试给大家一点启发。纯粹以观点为主的文章往往显得形式主义,我们可以通过一个架构的例子来说明Sam

阿里云通义千问14B模型开源!性能超越Llama2等同等尺寸模型阿里云通义千问14B模型开源!性能超越Llama2等同等尺寸模型Sep 25, 2023 pm 10:25 PM

2021年9月25日,阿里云发布了开源项目通义千问140亿参数模型Qwen-14B以及其对话模型Qwen-14B-Chat,并且可以免费商用。Qwen-14B在多个权威评测中表现出色,超过了同等规模的模型,甚至有些指标接近Llama2-70B。此前,阿里云还开源了70亿参数模型Qwen-7B,仅一个多月的时间下载量就突破了100万,成为开源社区的热门项目Qwen-14B是一款支持多种语言的高性能开源模型,相比同类模型使用了更多的高质量数据,整体训练数据超过3万亿Token,使得模型具备更强大的推

ICCV 2023揭晓:ControlNet、SAM等热门论文斩获奖项ICCV 2023揭晓:ControlNet、SAM等热门论文斩获奖项Oct 04, 2023 pm 09:37 PM

在法国巴黎举行了国际计算机视觉大会ICCV(InternationalConferenceonComputerVision)本周开幕作为全球计算机视觉领域顶级的学术会议,ICCV每两年召开一次。ICCV的热度一直以来都与CVPR不相上下,屡创新高在今天的开幕式上,ICCV官方公布了今年的论文数据:本届ICCV共有8068篇投稿,其中有2160篇被接收,录用率为26.8%,略高于上一届ICCV2021的录用率25.9%在论文主题方面,官方也公布了相关数据:多视角和传感器的3D技术热度最高在今天的开

复旦大学团队发布中文智慧法律系统DISC-LawLLM,构建司法评测基准,开源30万微调数据复旦大学团队发布中文智慧法律系统DISC-LawLLM,构建司法评测基准,开源30万微调数据Sep 29, 2023 pm 01:17 PM

随着智慧司法的兴起,智能化方法驱动的智能法律系统有望惠及不同群体。例如,为法律专业人员减轻文书工作,为普通民众提供法律咨询服务,为法学学生提供学习和考试辅导。由于法律知识的独特性和司法任务的多样性,此前的智慧司法研究方面主要着眼于为特定任务设计自动化算法,难以满足对司法领域提供支撑性服务的需求,离应用落地有不小的距离。而大型语言模型(LLMs)在不同的传统任务上展示出强大的能力,为智能法律系统的进一步发展带来希望。近日,复旦大学数据智能与社会计算实验室(FudanDISC)发布大语言模型驱动的中

百度文心一言全面向全社会开放,率先迈出重要一步百度文心一言全面向全社会开放,率先迈出重要一步Aug 31, 2023 pm 01:33 PM

8月31日,文心一言首次向全社会全面开放。用户可以在应用商店下载“文心一言APP”或登录“文心一言官网”(https://yiyan.baidu.com)进行体验据报道,百度计划推出一系列经过全新重构的AI原生应用,以便让用户充分体验生成式AI的理解、生成、逻辑和记忆等四大核心能力今年3月16日,文心一言开启邀测。作为全球大厂中首个发布的生成式AI产品,文心一言的基础模型文心大模型早在2019年就在国内率先发布,近期升级的文心大模型3.5也持续在十余个国内外权威测评中位居第一。李彦宏表示,当文心

AI技术在蚂蚁集团保险业务中的应用:革新保险服务,带来全新体验AI技术在蚂蚁集团保险业务中的应用:革新保险服务,带来全新体验Sep 20, 2023 pm 10:45 PM

保险行业对于社会民生和国民经济的重要性不言而喻。作为风险管理工具,保险为人民群众提供保障和福利,推动经济的稳定和可持续发展。在新的时代背景下,保险行业面临着新的机遇和挑战,需要不断创新和转型,以适应社会需求的变化和经济结构的调整近年来,中国的保险科技蓬勃发展。通过创新的商业模式和先进的技术手段,积极推动保险行业实现数字化和智能化转型。保险科技的目标是提升保险服务的便利性、个性化和智能化水平,以前所未有的速度改变传统保险业的面貌。这一发展趋势为保险行业注入了新的活力,使保险产品更贴近人民群众的实际

致敬TempleOS,有开发者创建了启动Llama 2的操作系统,网友:8G内存老电脑就能跑致敬TempleOS,有开发者创建了启动Llama 2的操作系统,网友:8G内存老电脑就能跑Oct 07, 2023 pm 10:09 PM

不得不说,Llama2的「二创」项目越来越硬核、有趣了。自Meta发布开源大模型Llama2以来,围绕着该模型的「二创」项目便多了起来。此前7月,特斯拉前AI总监、重回OpenAI的AndrejKarpathy利用周末时间,做了一个关于Llama2的有趣项目llama2.c,让用户在PyTorch中训练一个babyLlama2模型,然后使用近500行纯C、无任何依赖性的文件进行推理。今天,在Karpathyllama2.c项目的基础上,又有开发者创建了一个启动Llama2的演示操作系统,以及一个

快手黑科技“子弹时间”赋能亚运转播,打造智慧观赛新体验快手黑科技“子弹时间”赋能亚运转播,打造智慧观赛新体验Oct 11, 2023 am 11:21 AM

杭州第19届亚运会不仅是国际顶级体育盛会,更是一场精彩绝伦的中国科技盛宴。本届亚运会中,快手StreamLake与杭州电信深度合作,联合打造智慧观赛新体验,在击剑赛事的转播中,全面应用了快手StreamLake六自由度技术,其中“子弹时间”也是首次应用于击剑项目国际顶级赛事。中国电信杭州分公司智能亚运专班组长芮杰表示,依托快手StreamLake自研的4K3D虚拟运镜视频技术和中国电信5G/全光网,通过赛场内部署的4K专业摄像机阵列实时采集的高清竞赛视频,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.