


Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked
The plain text direction of the large model has been rolled to the end?
Last night, OpenAI’s biggest competitor Anthropic released a new generation of AI large model series - Claude 3.
This series contains three models, ranked from weakest to strongest, namely Claude 3 Haiku, Claude 3 Sonnet and Claude 3 Opus. Among them, Opus, the most capable, has scored higher than GPT-4 and Gemini 1.0 Ultra in multiple benchmark tests, setting new industry benchmarks in multiple dimensions such as mathematics, programming, multi-language understanding, and vision.
Anthropic states that Claude 3 Opus possesses knowledge at the level of a human undergraduate.
After the release of the new model, Claude brings support for multi-modal capabilities for the first time (the Opus version has an MMMU score of 59.4%, exceeding GPT-4V, on par with Gemini 1.0 Ultra). Users can now upload photos, charts, documents and other types of unstructured data for AI to analyze and answer.
In addition, these three models also retain the consistent advantages of the Claude series models, namely the long context window. The initial stage supports a context window of 200K tokens, but Anthropic said that all three models support a context input of 1 million tokens (for specific customers), which is equivalent to the English version of "Moby Dick" or "Harry Potter and the Deathly Hallows" 》length.
However, in terms of pricing, the most powerful Claude 3 is also much more expensive than GPT-4 Turbo: GPT-4 Turbo has an input/output charge of 10/per million tokens. $30; while the Claude 3 Opus is $15/75.
Opus and Sonnet models are now available in claude.ai and the Claude API, with Haiku models coming soon. Amazon Cloud Technologies has announced that their new model is now available on Amazon Bedrock. Anthropic announced the official demo, the details are as follows:
After Anthropic’s official announcement, many researchers who got the opportunity to try it out also shared their experiences. Some say that Claude 3 Sonnet has solved a puzzle that only GPT-4 could solve before.
However, some people say that in terms of actual experience, Claude 3 did not completely defeat GPT-4.
First-hand actual measurement of Claude3
Address: https ://claude.ai/
Does Claude 3 really surpass GPT-4 in performance as officially claimed? At present, most people think that it does have some meaning.
The following are some of the actual measurement results:
First of all, let’s do a brain teaser. Which month has twenty-eight days? The actual correct answer is every month. It seems that Claude 3 is not good at doing this kind of questions yet.
Then we tested the areas that Claude 3 is good at. From the official introduction, we can see that Claude is good at "understanding and processing images", including Extract text from images, convert UI to front-end code, understand complex equations, transcribe handwritten notes, and more.
For large models, it is often difficult to distinguish between fried chicken and teddy. When we input a picture containing teddy and fried chicken, Claude 3 gave this The answer "This image is a collage of dogs and chicken nuggets or nuggets that bear a striking resemblance to the dogs themselves..." is a passing question.
Then asked how many people were in it, Claude 3 also answered correctly, "This animation depicts seven small cartoon characters."
Claude 3 can extract text from photos, even the vertical sequence of Chinese and Japanese can be correctly recognized:
If I use memes from the Internet, how will it respond? Regarding the picture of visual error, GPT-4 and Claude3 gave opposite guesses:
Which one is correct?
In addition to understanding images, Claude is also capable of processing long texts. The full series of large models released this time can provide 200k context windows and accept more than 1 million token inputs.
What is the effect? We gave it a recent paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" published by Microsoft and the National University of Science and Technology, and asked it to summarize the main points of the article in the form of 1, 2, and 3. We recorded it. Time, the time to output the overall answer is about 15 seconds.
But this is only the output effect of Claude 3 Sonnet. If you use the Claude Pro version, it will be faster, but it will cost $20 a month.
It is worth noting that Claude now requires that the size of the uploaded article does not exceed 10MB. If it exceeds, there will be a prompt:
In Claude 3's blog, Anthropic proposed that the coding capabilities of the new model have been greatly improved. Someone directly threw the basic ASCII code to Claude and found that it was stress-free:
We should be able to confirm that Claude 3 has stronger coding capabilities than GPT-4.
Some time ago, Karpathy, who had just resigned from OpenAI, proposed a "word segmenter" challenge. Specifically, he put his 2 hour and 13 minute tutorial video into LLM and had it translated into the format of a book chapter or blog post about tokenizers.
Faced with this task, Claude 3 took it. The following are the results posted by AnthropicAI research engineer Emmanuel Ameisen:
图
Perhaps it is no longer related to interests, Karpathy gave a relatively full and objective evaluation:
From a style point of view, it is indeed quite good! If you look closely, you'll notice some subtle issues/illusions. Regardless, it's impressive to have a system that works almost out of the box. I'm looking forward to playing more with the Claude 3, it looks like a strong model.
If there's anything relevant I have to say, it's that people should be extremely careful when making assessment comparisons, and not just because the assessments themselves are worse than you think , but also because many evaluation results are overfitted in undefined ways, and because the comparisons made can be misleading. The encoding rate (HumanEval) of GPT-4 is not 67%. Whenever I see this comparison used in place of coding performance, the corners of my eyes start to twitch.
Based on the above various tricky test results, some people have already shouted "Anthropic is so back".
Finally, anthropopic also launched a prompt library that contains prompt content in multiple directions. If you want to learn more about Claude 3’s new features, give it a try.
Link: https://docs.anthropic.com/claude/prompt-library
Claude 3 Series Model
## The three versions of the #Claude 3 series models are Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku.
Among them, Claude 3 Opus is the most intelligent model, supporting a 200k tokens context window and achieving current SOTA performance on highly complex tasks. . The model handles open prompts and unseen scenes with excellent fluency and human-level understanding. Claude 3 Opus shows us the limits of what is possible with generative AI.
Claude 3 Sonnet delivers the ideal balance between intelligence and speed, especially for enterprise workloads. It delivers powerful performance at a lower cost than similar models and is designed for high durability in large-scale AI deployments. Claude 3 Sonnet supports a context window of 200k tokens.
Claude 3 Haiku is the fastest and most compact model with near real-time responsiveness. Interestingly, the context window it supports is also 200k. The model is able to answer simple queries and requests at unparalleled speed, allowing users to build seamless AI experiences that mimic human interactions.
Let’s take a closer look at the features and performance of the Claude 3 series models.
Comprehensively surpass GPT-4 and achieve a new SOTA level of intelligence
As the model with the highest level of intelligence in the Claude 3 series, Opus has the highest level of intelligence in the AI system It is better than competing products on most evaluation benchmarks, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K) and other benchmarks. Moreover, Opus demonstrates near-human-level understanding and fluency on complex tasks, leading the frontier of general intelligence.
Additionally, all Claude 3 Series models, including Opus, feature performance in analytics and predictions, granular content creation, code generation, and conversation in non-English languages such as Spanish, Japanese, and French Enhanced capabilities.
The following figure shows the comparison between the Claude 3 model and competing models on multiple performance benchmarks. It can be seen that the strongest Opus is better than OpenAI's GPT-4.
Near real-time response
Claude 3 model can support real-time customer chat , automated replenishment, and data extraction are tasks where response must be immediate and real-time.
Haiku is the fastest and most cost-effective model on the market in the smart category. It can read an arXiv platform paper (~10k tokens) containing dense chart and graphical information in less than three seconds.
For the vast majority of jobs, Sonnet is 2x faster and more intelligent than Claude 2 and Claude 2.1. It excels at tasks that require fast responses, such as knowledge retrieval or sales automation. The Opus is similar in speed to the Claude 2 and 2.1, but with a higher level of intelligence.
Powerful visual capabilities
Claude 3 has features comparable to other head models Complex visual functions. They can process data in a variety of visual formats, including photos, charts, graphs, and technical diagrams.
Anthropic says some of their customers have more than 50% of their knowledge bases programmed in various data formats, such as PDFs, flowcharts or presentation slides. Therefore, the new model's powerful visual capabilities are very helpful.
Fewer rejection replies
The previous Claude model often made unnecessary rejections, indicating a lack of contextual understanding by the model. Anthropic has made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to reject an answer than previous generations of models, even when user prompts are close to the system's bottom line. As shown below, the Claude 3 model exhibits a more nuanced understanding of requests, is able to identify truly harmful prompts, and refuses to answer harmless prompts much less frequently.
Accuracy improvement
To evaluate the accuracy of the model, Anthropic A large number of complex, factual questions are used to address known weaknesses in the current model. Anthropic classifies answers into correct answers, incorrect answers (or hallucinations), and uncertain answers, where the model does not know the answer, rather than providing incorrect information. Compared to Claude 2.1, Opus doubled the accuracy (or correct answers) on these challenging open-ended questions while also reducing incorrect answers.
In addition to producing more trustworthy responses, Anthropic will enable citations in the Claude 3 model so that the model can point to precise sentences in reference material to substantiate responses.
##Long context and near-perfect recall
Claude 3 Series Models will initially offer 200K context windows at launch. However, officials say that all three models are capable of receiving inputs of more than 1 million tokens, and this capability will be provided to specific users who require enhanced processing capabilities.
In order to effectively handle long contextual cues, the model needs strong recall capabilities. The Needle In A Haystack (NIAH) assessment measures a model's ability to accurately recall information from large amounts of data. Anthropic enhanced the robustness of this benchmark by testing it on a different crowdsourced document base using 30 random Needle/question pairs in each prompt. Claude 3 Opus not only achieves near-perfect recall but also exceeds 99% accuracy. And in some cases, it even identified limitations in the assessment itself, realizing that the "needle" sentences appeared to have been artificially inserted into the original text.
Safe and easy to use
Anthropic said , which has established dedicated teams to track and mitigate security risks. The company is also developing methods such as Constitutional AI to improve model security and transparency and mitigate privacy concerns that new models may raise.
While the Claude 3 model series has made progress in key indicators of biological knowledge, network-related knowledge and autonomy compared to previous models, according to the research, the new model is at the forefront of AI Within Security Level 2 (ASL-2).
In terms of user experience, Claude 3 is better at following complex multi-step instructions than previous models, and is better able to adhere to brand and response guidelines, so that it can better develop trustworthy applications. Additionally, Anthropic says Claude 3 models are now better at producing popular structured output in formats like JSON, making it easier to guide Claude for use cases like natural language classification and sentiment analysis.
What is written in the technical report
Currently, Anthropic has released a 42-page technical report "The Claude 3 Model Family: Opus, Sonnet, Haiku".
Report address: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
We saw the training data, evaluation criteria and more detailed experimental results of the Claude 3 series models.
In terms of training data, Claude 3 series models are trained on a proprietary mix of data publicly available on the Internet as of August 2023, as well as non-public data from third-party, data labeling services Data provided by vendors and paid contractors, data within Claude.
Claude 3 Series models have been extensively evaluated on multiple metrics including:
- Reasoning ability
- Multi-language ability
- Long context
- Reliability/factuality
- Multi-modal ability
The first is the evaluation results on reasoning, programming and question and answer tasks , Claude 3 series models were compared with competing models on a series of industry-standard benchmarks for reasoning, reading comprehension, mathematics, science and programming. The results showed that they not only surpassed their previous models, but also achieved new SOTA in most cases. .
Anthropic on the Law School Admission Test (LSAT), Multistate Bar Examination (MBE), American Mathematical Competition 2023 Math Competition, and Graduate Record Examination The Claude 3 series models were evaluated on the (GRE) General Examination, and the specific results are shown in Table 2 below.
Claude 3 series models have multi-modal (image and video frame input) capabilities and are great at solving complex multi-modal problems beyond simple text understanding Significant progress has been made on inference challenges.
A typical example is the performance of the Claude 3 model on the AI2D Scientific Chart Benchmark, a visual question-and-answer assessment that involves chart parsing and answering corresponding questions in a multiple-choice format .
Claude 3 Sonnet achieved SOTA level in 0-shot setting - 89.2%, followed by Claude 3 Opus (88.3%) and Claude 3 Haiku (80.6%), specific results As shown in Table 3 below.
## In response to this technical report, Fu Yao, a doctoral student at the University of Edinburgh, gave his own analysis immediately.
First of all, in his opinion, the several models evaluated have basically no distinction in several indicators such as MMLU / GSM8K / HumanEval. What really needs to be concerned about is why the best one is The model still has 5% error on GSM8K.
He believes that what can really distinguish the models is MATH and GPQA. These super difficult problems are the goals that AI models should aim for next. .
The areas where improvements are greater compared to Claude’s previous model are finance and medicine.
In terms of vision, the visual OCR capabilities of Claude 3 make people see its huge potential in data collection. .
In addition, he also found some other trends:
Judging from the current evaluation benchmarks and experience, Claude 3 has made great strides in terms of intelligence level, multi-modal capabilities and speed. improvement. With the further optimization and application of the new series of models, we may see a more diversified large model ecosystem.
Blog address: https://www.anthropic.com/news/claude-3-family
The above is the detailed content of Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked. For more information, please follow other related articles on the PHP Chinese website!

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

The TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment