To learn more about AIGC, please visit:
51CTO AI.x Community
https ://www.51cto.com/aigc/
Introduction
Currently, new evaluations of RAG (Retrieval Augmentation Generation) systems seem to be released every day, and many of them focus on the framework in question. retrieval stage. However, the generative aspect—how the model synthesizes and expresses this retrieved information—may be equally important in practice. Many practical application cases prove that the system not only needs to return data from the context, but also needs to transform this information into a more complex response.
To this end, we have conducted several experiments to evaluate and compare the generation capabilities of three models: GPT-4, Claude 2.1 and Claude 3 Opus. This article will detail our research methods, results, and nuances of these models we encountered along the way, and explain why these are important to those building with generative AI.
If interested readers want to reproduce the results of the above experiment, everything needed in the experiment can be found in the GitHub repository (https://github.com/Arize-ai/LLMTest_NeedleInAHaystack).
Supplementary Notes
- Although initial findings suggested that Claude outperformed GPT-4, subsequent testing showed that with the advent of strategic prompt engineering, GPT-4 Excellent performance was demonstrated in broader evaluations. In short, there are still many problems in the model behavior and prompt engineering inherent in the RAG system.
- Simply adding "Please explain yourself and then answer the question" to the prompt template significantly improves (more than twice) the performance of GPT-4. It's clear that when LLM says the answer, it seems to help develop the idea further. Through interpretation, it is possible for the model to re-execute the correct answer in the embedding/attention space.
The importance of RAG stage and generation
Figure 1: Chart created by the author
While in a retrieval enhanced generation The retrieval part of the system is responsible for identifying and retrieving the most relevant information, but it is the generation phase that takes this raw data and transforms it into a coherent, meaningful and contextual response. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. Through the generation phase, blank information is filled in order to achieve a complete and understandable interpretation of the relevant information. At the same time, users can query information presented in relevant ways as needed. Through the processing in the generation stage, by filling in the blank information, the final generated result is made more complete and easier to understand. This provides a way to understand and query relevant information, helping users conduct deeper exploration and research.
In many real-world applications, the value of RAG systems lies not only in their ability to locate specific facts or information, but also in their ability to integrate and contextualize information within a broader framework. The generation phase enables RAG systems to go beyond simple fact retrieval and provide truly intelligent and adaptive responses.
Test #1: Date Mapping
The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the other representing the day. The task of the model is:
- Retrieve the random number #1
- Isolate the last digit and increment it by 1
- Generate a month for our date string based on the result
- Retrieve random number #2
- Generate date string from random number 2
For example, random numbers 4827143 and 17 represent April 17th.
The figures are placed in context of different lengths at different depths. The model initially had a rather difficult time accomplishing this task.
Figure 2: Initial test results
While both models performed poorly, Claude 2.1 performed significantly better in our initial tests Better than GPT-4, the success rate is almost four times higher. It is here that the verbose nature of Claude's model - providing detailed, explanatory answers - seems to give it a clear advantage, resulting in more accurate results compared to GPT-4's original terse answers.
Motivated by these unexpected experimental results, we introduced a new variable in the experiment. We instructed GPT-4 to “explain yourself, then answer the question,” a prompt that encouraged more detailed responses similar to those naturally output by the Claude model. Therefore, the impact of this small adjustment is far-reaching.
Figure 3: Initial test of targeted prompt results
The performance of the GPT-4 model improved significantly, achieving perfect results in subsequent tests. The Claude model's results also improved.
This experiment not only highlights differences in how language models handle generation tasks, but also demonstrates the potential impact of hint engineering on their performance. Claude's strength appears to be verbosity, which turns out to be a replicable strategy for GPT-4, suggesting that the way a model handles and presents inference can significantly affect its accuracy in generation tasks. Overall, in all our experiments, including the seemingly small "explain yourself" sentence played a role in improving the performance of the model.
Further tests and results
Figure 4: Four further tests used to evaluate the generation
We performed it four more times Tests to evaluate the ability of mainstream models to synthesize and convert retrieved information into various formats:
- String concatenation: Combine text fragments into coherent strings, testing the model's basic text operations Skill.
- Currency formatting: Format numbers into currency, round, and calculate percentage changes to evaluate the model's accuracy and ability to handle numeric data.
- Date mapping: Converting numeric representations into month names and days requires hybrid retrieval and context understanding.
- Modular operations: Perform complex number operations to test the model's mathematical generation capabilities.
As expected, each model showed strong performance in string concatenation, which also reiterates the previous understanding that text manipulation is a fundamental strength of language models.
Figure 5: Currency formatting test results
As for the currency formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1's performance is generally poor. Accuracy does not vary much across mark lengths, but is generally lower as the pointer is closer to the beginning of the context window.
Figure 6: Official test results from the Haystack website
Despite achieving excellent results in the first generation test, the accuracy of the Claude 3 was at decreased in a retrieval-only experiment. In theory, simply retrieving numbers should also be easier than manipulating them - which makes the drop in performance surprising and an area we plan to test further. If anything, this counterintuitive drop only further confirms the idea that both retrieval and generation should be tested when developing with RAG.
Conclusion
By testing various generation tasks, we observed that although both models, Claude and GPT-4, are good at trivial tasks such as string operations, they fail in more complex scenarios. , their advantages and disadvantages become obvious (https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/). LLM is still not very good at math! Another key result is that the introduction of "self-explanatory" hints significantly improves the performance of GPT-4, emphasizing the importance of how to hint the model and how to clarify its reasoning to achieve accurate results.
These findings have broader implications for the assessment of LLM. When comparing models like the detailed Claude and the initially less detailed GPT-4, it becomes clear that the RAG evaluation (https://arize.com/blog-course/rag-evaluation/) criteria must go beyond the previous emphasis on only being correct sex this. The verbosity of model responses introduces a variable that can significantly affect their perceived performance. This subtle difference may suggest that future model evaluations should consider average response length as a noteworthy factor to better understand the capabilities of the model and ensure a fairer comparison.
Translator Introduction
Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, and a veteran in the freelance programming industry.
Original title: Tips for Getting the Generation Part Right in Retrieval Augmented Generation, Author: Aparna Dhinakaran
Link:
nce.com/tips-for-getting-the -generation-part-right-in-retrieval-augmented-generation-7deaa26f28dc.
To learn more about AIGC, please visit:
51CTO AI.x Community
https://www.51cto.com/ aigc/
The above is the detailed content of Generative AI model big PK——GPT-4, Claude 2.1 and Claude 3.0 Opus. For more information, please follow other related articles on the PHP Chinese website!

一觉醒来,工作的方式被彻底改变。微软把AI神器GPT-4全面接入Office,这下ChatPPT、ChatWord、ChatExcel一家整整齐齐。CEO纳德拉在发布会上直接放话:今天,进入人机交互的新时代,重新发明生产力。新功能名叫Microsoft 365 Copilot(副驾驶),与改变了程序员的代码助手GitHub Copilot成为一个系列,继续改变更多人。现在AI不光能自动做PPT,而且能根据Word文档的内容一键做出精美排版。甚至连上台时对着每一页PPT应该讲什么话,都给一起安排

集成GPT-4的Github Copilot X还在小范围内测中,而集成GPT-4的Cursor已公开发行。Cursor是一个集成GPT-4的IDE,可以用自然语言编写代码,让编写代码和聊天一样简单。 GPT-4和GPT-3.5在处理和编写代码的能力上差别还是很大的。官网的一份测试报告。前两个是GPT-4,一个采用文本输入,一个采用图像输入;第三个是GPT3.5,可以看出GPT-4的代码能力相较于GPT-3.5有较大能力的提升。集成GPT-4的Github Copilot X还在小范围内测中,而

作者 | 云昭3月9日,微软德国CTO Andreas Braun在AI kickoff会议上带来了一个期待已久的消息:“我们将于下周推出GPT-4,届时我们将推出多模式模式,提供完全不同的可能性——例如视频。”言语之中,他将大型语言模型(LLM)比作“游戏改变者”,因为他们教机器理解自然语言,然后机器以统计的方式理解以前只能由人类阅读和理解的东西。与此同时,这项技术已经发展到“适用于所有语言”:你可以用德语提问,也可以用意大利语回答。借助多模态,微软(-OpenAI)将“使模型变得全面”。那

近段时间,人工智能聊天机器人ChatGPT刷爆网络,网友们争先恐后去领略它的超高情商和巨大威力。参加高考、修改代码、构思小说……它在广大网友的“鞭策”下不断突破自我,甚至可以用一整段程序,为你拼接出一只小狗。而这些技能只是基于GPT-3.5开发而来,在3月15日,AI世界再次更新,最新版本的GPT-4也被OpenAI发布了出来。与之前相比,GPT-4不仅展现了更加强大的语言理解能力,还能够处理图像内容,在考试中的得分甚至能超越90%的人类。那么,如此“逆天”的GPT-4还具有哪些能力?它又是如何

GPT-4在发布之时公布了一项医学知识测试结果,该测试由美国医师学会开发,最终它答对了75%的问题,相比GPT3.5的53%有很大的飞跃。 这两天,一篇关于“GPT-4救了我狗的命”的帖子属实有点火:短短一两天就有数千人转发,上万人点赞,网友在评论区讨论得热火朝天。△ 是真狗命,not人的“狗命”(Doge)乍一听,大家想必很纳闷:这俩能扯上什么关系?GPT-4还能长眼睛发现狗有什么危险吗?真实的经过是这样子的:当兽医说无能为力时,他问了GPT-4发帖人名叫Cooper。他自述自己养的一条狗子,

人工智能在过去几十年里发展势头强劲,像GPT-4这样的大型语言模型引起了用户的更多兴趣,他们想知道GPT-4如何支持数字化转型。根据行业媒体的预测,到2024年,GPT-4所基于的ChatGPT深度学习堆栈将产生10亿美元的收入。GPT-4的普及是由于人工智能技术的力量,以及高用户可访问性和广泛的通用性。科技行业的许多不同领域都可以利用GPT-4来自动化和个性化许多任务,使企业员工能够专注于更复杂的任务。以下是GPT-4在几个不同领域促进数字化转型的一些例子。1、个性化员工培训像GPT-4这样的

GPT-4,火爆,非常火爆。不过家人们,在铺天盖地的叫好声中,有件事可能你是“万万没想到”——在OpenAI公布的技术论文里,竟然藏着九大隐秘的线索!这些线索是由国外博主AI Explained发现并整理。他宛如一位细节狂魔,从长达98页论文中,逐个揭秘这些“隐匿的角落”,包括:GPT-5可能已经完成训练GPT-4出现过“挂掉”的情况OpenAI两年内或实现接近AGI……发现一:GPT4出现过“挂掉”的情况在GPT-4技术论文的第53页处,OpenAI提到了这样一个机构——Alignment R

3 月 15 日消息,今天 OpenAI 发布了全新的 GPT-4 大型语言模型,随后微软官方宣布,Bing Chat 此前已经升级使用 OpenAI 的 GPT-4 技术。微软公司副总裁兼消费者首席营销官 Yusuf Mehdi 确认 Bing Chat 聊天机器人 AI 已经在 GPT-4 上运行,ChatGPT 基于最新版本 GPT-4,由 OpenAI 开发 。微软 Bing 博客网站上的一篇帖子进一步证实了这一消息。微软表示,如果用户在过去五周内的任何时间使用过新的 Bing 预览版,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SublimeText3 Chinese version
Chinese version, very easy to use

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft