Home  >  Article  >  Technology peripherals  >  Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked

Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked

WBOY
WBOYforward
2024-03-06 13:00:18332browse

The plain text direction of the large model has been rolled to the end?

Last night, OpenAI’s biggest competitor Anthropic released a new generation of AI large model series - Claude 3.

This series contains three models, ranked from weakest to strongest, namely Claude 3 Haiku, Claude 3 Sonnet and Claude 3 Opus. Among them, Opus, the most capable, has scored higher than GPT-4 and Gemini 1.0 Ultra in multiple benchmark tests, setting new industry benchmarks in multiple dimensions such as mathematics, programming, multi-language understanding, and vision.

Anthropic states that Claude 3 Opus possesses knowledge at the level of a human undergraduate.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

After the release of the new model, Claude brings support for multi-modal capabilities for the first time (the Opus version has an MMMU score of 59.4%, exceeding GPT-4V, on par with Gemini 1.0 Ultra). Users can now upload photos, charts, documents and other types of unstructured data for AI to analyze and answer.

In addition, these three models also retain the consistent advantages of the Claude series models, namely the long context window. The initial stage supports a context window of 200K tokens, but Anthropic said that all three models support a context input of 1 million tokens (for specific customers), which is equivalent to the English version of "Moby Dick" or "Harry Potter and the Deathly Hallows" 》length.

However, in terms of pricing, the most powerful Claude 3 is also much more expensive than GPT-4 Turbo: GPT-4 Turbo has an input/output charge of 10/per million tokens. $30; while the Claude 3 Opus is $15/75.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Opus and Sonnet models are now available in claude.ai and the Claude API, with Haiku models coming soon. Amazon Cloud Technologies has announced that their new model is now available on Amazon Bedrock. Anthropic announced the official demo, the details are as follows:

After Anthropic’s official announcement, many researchers who got the opportunity to try it out also shared their experiences. Some say that Claude 3 Sonnet has solved a puzzle that only GPT-4 could solve before.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

However, some people say that in terms of actual experience, Claude 3 did not completely defeat GPT-4.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

First-hand actual measurement of Claude3

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Address: https ://claude.ai/

Does Claude 3 really surpass GPT-4 in performance as officially claimed? At present, most people think that it does have some meaning.

The following are some of the actual measurement results:

First of all, let’s do a brain teaser. Which month has twenty-eight days? The actual correct answer is every month. It seems that Claude 3 is not good at doing this kind of questions yet.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Then we tested the areas that Claude 3 is good at. From the official introduction, we can see that Claude is good at "understanding and processing images", including Extract text from images, convert UI to front-end code, understand complex equations, transcribe handwritten notes, and more.

For large models, it is often difficult to distinguish between fried chicken and teddy. When we input a picture containing teddy and fried chicken, Claude 3 gave this The answer "This image is a collage of dogs and chicken nuggets or nuggets that bear a striking resemblance to the dogs themselves..." is a passing question.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Then asked how many people were in it, Claude 3 also answered correctly, "This animation depicts seven small cartoon characters."

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 can extract text from photos, even the vertical sequence of Chinese and Japanese can be correctly recognized:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

If I use memes from the Internet, how will it respond? Regarding the picture of visual error, GPT-4 and Claude3 gave opposite guesses:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Which one is correct?

In addition to understanding images, Claude is also capable of processing long texts. The full series of large models released this time can provide 200k context windows and accept more than 1 million token inputs.

What is the effect? We gave it a recent paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" published by Microsoft and the National University of Science and Technology, and asked it to summarize the main points of the article in the form of 1, 2, and 3. We recorded it. Time, the time to output the overall answer is about 15 seconds.

But this is only the output effect of Claude 3 Sonnet. If you use the Claude Pro version, it will be faster, but it will cost $20 a month.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

It is worth noting that Claude now requires that the size of the uploaded article does not exceed 10MB. If it exceeds, there will be a prompt:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In Claude 3's blog, Anthropic proposed that the coding capabilities of the new model have been greatly improved. Someone directly threw the basic ASCII code to Claude and found that it was stress-free:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

We should be able to confirm that Claude 3 has stronger coding capabilities than GPT-4.

Some time ago, Karpathy, who had just resigned from OpenAI, proposed a "word segmenter" challenge. Specifically, he put his 2 hour and 13 minute tutorial video into LLM and had it translated into the format of a book chapter or blog post about tokenizers.

Faced with this task, Claude 3 took it. The following are the results posted by AnthropicAI research engineer Emmanuel Ameisen:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

GPT-4时代已过?全球网友实测Claude 3,只有震撼


Perhaps it is no longer related to interests, Karpathy gave a relatively full and objective evaluation:

From a style point of view, it is indeed quite good! If you look closely, you'll notice some subtle issues/illusions. Regardless, it's impressive to have a system that works almost out of the box. I'm looking forward to playing more with the Claude 3, it looks like a strong model.

If there's anything relevant I have to say, it's that people should be extremely careful when making assessment comparisons, and not just because the assessments themselves are worse than you think , but also because many evaluation results are overfitted in undefined ways, and because the comparisons made can be misleading. The encoding rate (HumanEval) of GPT-4 is not 67%. Whenever I see this comparison used in place of coding performance, the corners of my eyes start to twitch.

Based on the above various tricky test results, some people have already shouted "Anthropic is so back".

Finally, anthropopic also launched a prompt library that contains prompt content in multiple directions. If you want to learn more about Claude 3’s new features, give it a try.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Link: https://docs.anthropic.com/claude/prompt-library

Claude 3 Series Model

## The three versions of the #Claude 3 series models are Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Among them, Claude 3 Opus is the most intelligent model, supporting a 200k tokens context window and achieving current SOTA performance on highly complex tasks. . The model handles open prompts and unseen scenes with excellent fluency and human-level understanding. Claude 3 Opus shows us the limits of what is possible with generative AI.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 Sonnet delivers the ideal balance between intelligence and speed, especially for enterprise workloads. It delivers powerful performance at a lower cost than similar models and is designed for high durability in large-scale AI deployments. Claude 3 Sonnet supports a context window of 200k tokens.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 Haiku is the fastest and most compact model with near real-time responsiveness. Interestingly, the context window it supports is also 200k. The model is able to answer simple queries and requests at unparalleled speed, allowing users to build seamless AI experiences that mimic human interactions.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Let’s take a closer look at the features and performance of the Claude 3 series models.

Comprehensively surpass GPT-4 and achieve a new SOTA level of intelligence

As the model with the highest level of intelligence in the Claude 3 series, Opus has the highest level of intelligence in the AI ​​system It is better than competing products on most evaluation benchmarks, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K) and other benchmarks. Moreover, Opus demonstrates near-human-level understanding and fluency on complex tasks, leading the frontier of general intelligence.

Additionally, all Claude 3 Series models, including Opus, feature performance in analytics and predictions, granular content creation, code generation, and conversation in non-English languages ​​such as Spanish, Japanese, and French Enhanced capabilities.

The following figure shows the comparison between the Claude 3 model and competing models on multiple performance benchmarks. It can be seen that the strongest Opus is better than OpenAI's GPT-4.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Near real-time response

Claude 3 model can support real-time customer chat , automated replenishment, and data extraction are tasks where response must be immediate and real-time.

Haiku is the fastest and most cost-effective model on the market in the smart category. It can read an arXiv platform paper (~10k tokens) containing dense chart and graphical information in less than three seconds.

For the vast majority of jobs, Sonnet is 2x faster and more intelligent than Claude 2 and Claude 2.1. It excels at tasks that require fast responses, such as knowledge retrieval or sales automation. The Opus is similar in speed to the Claude 2 and 2.1, but with a higher level of intelligence.

Powerful visual capabilities

Claude 3 has features comparable to other head models Complex visual functions. They can process data in a variety of visual formats, including photos, charts, graphs, and technical diagrams.

Anthropic says some of their customers have more than 50% of their knowledge bases programmed in various data formats, such as PDFs, flowcharts or presentation slides. Therefore, the new model's powerful visual capabilities are very helpful.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Fewer rejection replies

The previous Claude model often made unnecessary rejections, indicating a lack of contextual understanding by the model. Anthropic has made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to reject an answer than previous generations of models, even when user prompts are close to the system's bottom line. As shown below, the Claude 3 model exhibits a more nuanced understanding of requests, is able to identify truly harmful prompts, and refuses to answer harmless prompts much less frequently.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Accuracy improvement

To evaluate the accuracy of the model, Anthropic A large number of complex, factual questions are used to address known weaknesses in the current model. Anthropic classifies answers into correct answers, incorrect answers (or hallucinations), and uncertain answers, where the model does not know the answer, rather than providing incorrect information. Compared to Claude 2.1, Opus doubled the accuracy (or correct answers) on these challenging open-ended questions while also reducing incorrect answers.

In addition to producing more trustworthy responses, Anthropic will enable citations in the Claude 3 model so that the model can point to precise sentences in reference material to substantiate responses.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

##Long context and near-perfect recall

Claude 3 Series Models will initially offer 200K context windows at launch. However, officials say that all three models are capable of receiving inputs of more than 1 million tokens, and this capability will be provided to specific users who require enhanced processing capabilities.

In order to effectively handle long contextual cues, the model needs strong recall capabilities. The Needle In A Haystack (NIAH) assessment measures a model's ability to accurately recall information from large amounts of data. Anthropic enhanced the robustness of this benchmark by testing it on a different crowdsourced document base using 30 random Needle/question pairs in each prompt. Claude 3 Opus not only achieves near-perfect recall but also exceeds 99% accuracy. And in some cases, it even identified limitations in the assessment itself, realizing that the "needle" sentences appeared to have been artificially inserted into the original text.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Safe and easy to use

Anthropic said , which has established dedicated teams to track and mitigate security risks. The company is also developing methods such as Constitutional AI to improve model security and transparency and mitigate privacy concerns that new models may raise.

While the Claude 3 model series has made progress in key indicators of biological knowledge, network-related knowledge and autonomy compared to previous models, according to the research, the new model is at the forefront of AI Within Security Level 2 (ASL-2).

In terms of user experience, Claude 3 is better at following complex multi-step instructions than previous models, and is better able to adhere to brand and response guidelines, so that it can better develop trustworthy applications. Additionally, Anthropic says Claude 3 models are now better at producing popular structured output in formats like JSON, making it easier to guide Claude for use cases like natural language classification and sentiment analysis.

What is written in the technical report

Currently, Anthropic has released a 42-page technical report "The Claude 3 Model Family: Opus, Sonnet, Haiku".

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Report address: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

We saw the training data, evaluation criteria and more detailed experimental results of the Claude 3 series models.

In terms of training data, Claude 3 series models are trained on a proprietary mix of data publicly available on the Internet as of August 2023, as well as non-public data from third-party, data labeling services Data provided by vendors and paid contractors, data within Claude.

Claude 3 Series models have been extensively evaluated on multiple metrics including:

  • Reasoning ability
  • Multi-language ability
  • Long context
  • Reliability/factuality
  • Multi-modal ability

The first is the evaluation results on reasoning, programming and question and answer tasks , Claude 3 series models were compared with competing models on a series of industry-standard benchmarks for reasoning, reading comprehension, mathematics, science and programming. The results showed that they not only surpassed their previous models, but also achieved new SOTA in most cases. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Anthropic on the Law School Admission Test (LSAT), Multistate Bar Examination (MBE), American Mathematical Competition 2023 Math Competition, and Graduate Record Examination The Claude 3 series models were evaluated on the (GRE) General Examination, and the specific results are shown in Table 2 below.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 series models have multi-modal (image and video frame input) capabilities and are great at solving complex multi-modal problems beyond simple text understanding Significant progress has been made on inference challenges.

A typical example is the performance of the Claude 3 model on the AI2D Scientific Chart Benchmark, a visual question-and-answer assessment that involves chart parsing and answering corresponding questions in a multiple-choice format .

Claude 3 Sonnet achieved SOTA level in 0-shot setting - 89.2%, followed by Claude 3 Opus (88.3%) and Claude 3 Haiku (80.6%), specific results As shown in Table 3 below.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

## In response to this technical report, Fu Yao, a doctoral student at the University of Edinburgh, gave his own analysis immediately.

First of all, in his opinion, the several models evaluated have basically no distinction in several indicators such as MMLU / GSM8K / HumanEval. What really needs to be concerned about is why the best one is The model still has 5% error on GSM8K.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

He believes that what can really distinguish the models is MATH and GPQA. These super difficult problems are the goals that AI models should aim for next. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

The areas where improvements are greater compared to Claude’s previous model are finance and medicine.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In terms of vision, the visual OCR capabilities of Claude 3 make people see its huge potential in data collection. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In addition, he also found some other trends:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Judging from the current evaluation benchmarks and experience, Claude 3 has made great strides in terms of intelligence level, multi-modal capabilities and speed. improvement. With the further optimization and application of the new series of models, we may see a more diversified large model ecosystem.

Blog address: https://www.anthropic.com/news/claude-3-family

The above is the detailed content of Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete