Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked-AI-php.cn

Home

Technology peripherals

Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Mar 06, 2024 pm 01:00 PM

aiModelarrangement

The plain text direction of the large model has been rolled to the end?

Last night, OpenAI’s biggest competitor Anthropic released a new generation of AI large model series - Claude 3.

This series contains three models, ranked from weakest to strongest, namely Claude 3 Haiku, Claude 3 Sonnet and Claude 3 Opus. Among them, Opus, the most capable, has scored higher than GPT-4 and Gemini 1.0 Ultra in multiple benchmark tests, setting new industry benchmarks in multiple dimensions such as mathematics, programming, multi-language understanding, and vision.

Anthropic states that Claude 3 Opus possesses knowledge at the level of a human undergraduate.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

After the release of the new model, Claude brings support for multi-modal capabilities for the first time (the Opus version has an MMMU score of 59.4%, exceeding GPT-4V, on par with Gemini 1.0 Ultra). Users can now upload photos, charts, documents and other types of unstructured data for AI to analyze and answer.

In addition, these three models also retain the consistent advantages of the Claude series models, namely the long context window. The initial stage supports a context window of 200K tokens, but Anthropic said that all three models support a context input of 1 million tokens (for specific customers), which is equivalent to the English version of "Moby Dick" or "Harry Potter and the Deathly Hallows" 》length.

However, in terms of pricing, the most powerful Claude 3 is also much more expensive than GPT-4 Turbo: GPT-4 Turbo has an input/output charge of 10/per million tokens. $30; while the Claude 3 Opus is $15/75.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Opus and Sonnet models are now available in claude.ai and the Claude API, with Haiku models coming soon. Amazon Cloud Technologies has announced that their new model is now available on Amazon Bedrock. Anthropic announced the official demo, the details are as follows:

After Anthropic’s official announcement, many researchers who got the opportunity to try it out also shared their experiences. Some say that Claude 3 Sonnet has solved a puzzle that only GPT-4 could solve before.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

However, some people say that in terms of actual experience, Claude 3 did not completely defeat GPT-4.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

First-hand actual measurement of Claude3

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Address: https ://claude.ai/

Does Claude 3 really surpass GPT-4 in performance as officially claimed? At present, most people think that it does have some meaning.

The following are some of the actual measurement results:

First of all, let’s do a brain teaser. Which month has twenty-eight days? The actual correct answer is every month. It seems that Claude 3 is not good at doing this kind of questions yet.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Then we tested the areas that Claude 3 is good at. From the official introduction, we can see that Claude is good at "understanding and processing images", including Extract text from images, convert UI to front-end code, understand complex equations, transcribe handwritten notes, and more.

For large models, it is often difficult to distinguish between fried chicken and teddy. When we input a picture containing teddy and fried chicken, Claude 3 gave this The answer "This image is a collage of dogs and chicken nuggets or nuggets that bear a striking resemblance to the dogs themselves..." is a passing question.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Then asked how many people were in it, Claude 3 also answered correctly, "This animation depicts seven small cartoon characters."

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Claude 3 can extract text from photos, even the vertical sequence of Chinese and Japanese can be correctly recognized:

GPT-4时代已过？全球网友实测Claude 3，只有震撼

If I use memes from the Internet, how will it respond? Regarding the picture of visual error, GPT-4 and Claude3 gave opposite guesses:

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Which one is correct?

In addition to understanding images, Claude is also capable of processing long texts. The full series of large models released this time can provide 200k context windows and accept more than 1 million token inputs.

What is the effect? We gave it a recent paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" published by Microsoft and the National University of Science and Technology, and asked it to summarize the main points of the article in the form of 1, 2, and 3. We recorded it. Time, the time to output the overall answer is about 15 seconds.

But this is only the output effect of Claude 3 Sonnet. If you use the Claude Pro version, it will be faster, but it will cost $20 a month.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

It is worth noting that Claude now requires that the size of the uploaded article does not exceed 10MB. If it exceeds, there will be a prompt:

GPT-4时代已过？全球网友实测Claude 3，只有震撼

In Claude 3's blog, Anthropic proposed that the coding capabilities of the new model have been greatly improved. Someone directly threw the basic ASCII code to Claude and found that it was stress-free:

GPT-4时代已过？全球网友实测Claude 3，只有震撼

We should be able to confirm that Claude 3 has stronger coding capabilities than GPT-4.

Some time ago, Karpathy, who had just resigned from OpenAI, proposed a "word segmenter" challenge. Specifically, he put his 2 hour and 13 minute tutorial video into LLM and had it translated into the format of a book chapter or blog post about tokenizers.

Faced with this task, Claude 3 took it. The following are the results posted by AnthropicAI research engineer Emmanuel Ameisen:

GPT-4时代已过？全球网友实测Claude 3，只有震撼图

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Perhaps it is no longer related to interests, Karpathy gave a relatively full and objective evaluation:

From a style point of view, it is indeed quite good! If you look closely, you'll notice some subtle issues/illusions. Regardless, it's impressive to have a system that works almost out of the box. I'm looking forward to playing more with the Claude 3, it looks like a strong model.

If there's anything relevant I have to say, it's that people should be extremely careful when making assessment comparisons, and not just because the assessments themselves are worse than you think , but also because many evaluation results are overfitted in undefined ways, and because the comparisons made can be misleading. The encoding rate (HumanEval) of GPT-4 is not 67%. Whenever I see this comparison used in place of coding performance, the corners of my eyes start to twitch.

Based on the above various tricky test results, some people have already shouted "Anthropic is so back".

Finally, anthropopic also launched a prompt library that contains prompt content in multiple directions. If you want to learn more about Claude 3’s new features, give it a try.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Link: https://docs.anthropic.com/claude/prompt-library

Claude 3 Series Model

## The three versions of the #Claude 3 series models are Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Among them, Claude 3 Opus is the most intelligent model, supporting a 200k tokens context window and achieving current SOTA performance on highly complex tasks. . The model handles open prompts and unseen scenes with excellent fluency and human-level understanding. Claude 3 Opus shows us the limits of what is possible with generative AI.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Claude 3 Sonnet delivers the ideal balance between intelligence and speed, especially for enterprise workloads. It delivers powerful performance at a lower cost than similar models and is designed for high durability in large-scale AI deployments. Claude 3 Sonnet supports a context window of 200k tokens.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Claude 3 Haiku is the fastest and most compact model with near real-time responsiveness. Interestingly, the context window it supports is also 200k. The model is able to answer simple queries and requests at unparalleled speed, allowing users to build seamless AI experiences that mimic human interactions.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Let’s take a closer look at the features and performance of the Claude 3 series models.

Comprehensively surpass GPT-4 and achieve a new SOTA level of intelligence

As the model with the highest level of intelligence in the Claude 3 series, Opus has the highest level of intelligence in the AI system It is better than competing products on most evaluation benchmarks, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K) and other benchmarks. Moreover, Opus demonstrates near-human-level understanding and fluency on complex tasks, leading the frontier of general intelligence.

Additionally, all Claude 3 Series models, including Opus, feature performance in analytics and predictions, granular content creation, code generation, and conversation in non-English languages such as Spanish, Japanese, and French Enhanced capabilities.

The following figure shows the comparison between the Claude 3 model and competing models on multiple performance benchmarks. It can be seen that the strongest Opus is better than OpenAI's GPT-4.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Near real-time response

Claude 3 model can support real-time customer chat , automated replenishment, and data extraction are tasks where response must be immediate and real-time.

Haiku is the fastest and most cost-effective model on the market in the smart category. It can read an arXiv platform paper (~10k tokens) containing dense chart and graphical information in less than three seconds.

For the vast majority of jobs, Sonnet is 2x faster and more intelligent than Claude 2 and Claude 2.1. It excels at tasks that require fast responses, such as knowledge retrieval or sales automation. The Opus is similar in speed to the Claude 2 and 2.1, but with a higher level of intelligence.

Powerful visual capabilities

Claude 3 has features comparable to other head models Complex visual functions. They can process data in a variety of visual formats, including photos, charts, graphs, and technical diagrams.

Anthropic says some of their customers have more than 50% of their knowledge bases programmed in various data formats, such as PDFs, flowcharts or presentation slides. Therefore, the new model's powerful visual capabilities are very helpful.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Fewer rejection replies

The previous Claude model often made unnecessary rejections, indicating a lack of contextual understanding by the model. Anthropic has made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to reject an answer than previous generations of models, even when user prompts are close to the system's bottom line. As shown below, the Claude 3 model exhibits a more nuanced understanding of requests, is able to identify truly harmful prompts, and refuses to answer harmless prompts much less frequently.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Accuracy improvement

To evaluate the accuracy of the model, Anthropic A large number of complex, factual questions are used to address known weaknesses in the current model. Anthropic classifies answers into correct answers, incorrect answers (or hallucinations), and uncertain answers, where the model does not know the answer, rather than providing incorrect information. Compared to Claude 2.1, Opus doubled the accuracy (or correct answers) on these challenging open-ended questions while also reducing incorrect answers.

In addition to producing more trustworthy responses, Anthropic will enable citations in the Claude 3 model so that the model can point to precise sentences in reference material to substantiate responses.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

##Long context and near-perfect recall

Claude 3 Series Models will initially offer 200K context windows at launch. However, officials say that all three models are capable of receiving inputs of more than 1 million tokens, and this capability will be provided to specific users who require enhanced processing capabilities.

In order to effectively handle long contextual cues, the model needs strong recall capabilities. The Needle In A Haystack (NIAH) assessment measures a model's ability to accurately recall information from large amounts of data. Anthropic enhanced the robustness of this benchmark by testing it on a different crowdsourced document base using 30 random Needle/question pairs in each prompt. Claude 3 Opus not only achieves near-perfect recall but also exceeds 99% accuracy. And in some cases, it even identified limitations in the assessment itself, realizing that the "needle" sentences appeared to have been artificially inserted into the original text.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Safe and easy to use

Anthropic said , which has established dedicated teams to track and mitigate security risks. The company is also developing methods such as Constitutional AI to improve model security and transparency and mitigate privacy concerns that new models may raise.

While the Claude 3 model series has made progress in key indicators of biological knowledge, network-related knowledge and autonomy compared to previous models, according to the research, the new model is at the forefront of AI Within Security Level 2 (ASL-2).

In terms of user experience, Claude 3 is better at following complex multi-step instructions than previous models, and is better able to adhere to brand and response guidelines, so that it can better develop trustworthy applications. Additionally, Anthropic says Claude 3 models are now better at producing popular structured output in formats like JSON, making it easier to guide Claude for use cases like natural language classification and sentiment analysis.

What is written in the technical report

Currently, Anthropic has released a 42-page technical report "The Claude 3 Model Family: Opus, Sonnet, Haiku".

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Report address: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

We saw the training data, evaluation criteria and more detailed experimental results of the Claude 3 series models.

In terms of training data, Claude 3 series models are trained on a proprietary mix of data publicly available on the Internet as of August 2023, as well as non-public data from third-party, data labeling services Data provided by vendors and paid contractors, data within Claude.

Claude 3 Series models have been extensively evaluated on multiple metrics including:

Reasoning ability
Multi-language ability
Long context
Reliability/factuality
Multi-modal ability

The first is the evaluation results on reasoning, programming and question and answer tasks , Claude 3 series models were compared with competing models on a series of industry-standard benchmarks for reasoning, reading comprehension, mathematics, science and programming. The results showed that they not only surpassed their previous models, but also achieved new SOTA in most cases. .

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Anthropic on the Law School Admission Test (LSAT), Multistate Bar Examination (MBE), American Mathematical Competition 2023 Math Competition, and Graduate Record Examination The Claude 3 series models were evaluated on the (GRE) General Examination, and the specific results are shown in Table 2 below.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Claude 3 series models have multi-modal (image and video frame input) capabilities and are great at solving complex multi-modal problems beyond simple text understanding Significant progress has been made on inference challenges.

A typical example is the performance of the Claude 3 model on the AI2D Scientific Chart Benchmark, a visual question-and-answer assessment that involves chart parsing and answering corresponding questions in a multiple-choice format .

Claude 3 Sonnet achieved SOTA level in 0-shot setting - 89.2%, followed by Claude 3 Opus (88.3%) and Claude 3 Haiku (80.6%), specific results As shown in Table 3 below.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

## In response to this technical report, Fu Yao, a doctoral student at the University of Edinburgh, gave his own analysis immediately.

First of all, in his opinion, the several models evaluated have basically no distinction in several indicators such as MMLU / GSM8K / HumanEval. What really needs to be concerned about is why the best one is The model still has 5% error on GSM8K.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

He believes that what can really distinguish the models is MATH and GPQA. These super difficult problems are the goals that AI models should aim for next. .

GPT-4时代已过？全球网友实测Claude 3，只有震撼

The areas where improvements are greater compared to Claude’s previous model are finance and medicine.

GPT-4时代已过？全球网友实测Claude 3，只有震撼

In terms of vision, the visual OCR capabilities of Claude 3 make people see its huge potential in data collection. .

GPT-4时代已过？全球网友实测Claude 3，只有震撼

In addition, he also found some other trends:

GPT-4时代已过？全球网友实测Claude 3，只有震撼

Judging from the current evaluation benchmarks and experience, Claude 3 has made great strides in terms of intelligence level, multi-modal capabilities and speed. improvement. With the further optimization and application of the new series of models, we may see a more diversified large model ecosystem.

Blog address: https://www.anthropic.com/news/claude-3-family

The above is the detailed content of Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Tool Calling in LLMsApr 14, 2025 am 11:28 AM

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

How ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthApr 14, 2025 am 11:27 AM

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

UN Input On AI: Winners, Losers, And OpportunitiesApr 14, 2025 am 11:25 AM

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Learning Negotiation Skills Via Generative AIApr 14, 2025 am 11:23 AM

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

TED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfApr 14, 2025 am 11:22 AM

The TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerApr 14, 2025 am 11:21 AM

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

What is Graph Database?Apr 14, 2025 am 11:19 AM

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

LLM Routing: Strategies, Techniques, and Python ImplementationApr 14, 2025 am 11:14 AM

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Hot Topics

Where is the login entrance for gmail email?

7501

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers