How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent-AI-php.cn

Home

Technology peripherals

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 19, 2023 pm 07:17 PM

aiModel

Some time ago, a leaked document from Google attracted widespread attention. In this document, a researcher within Google expressed an important point: Google does not have a moat, and neither does OpenAI.

The researcher said that although on the surface it seems that OpenAI and Google are chasing each other on large AI models, the real winner may not emerge from the two because one The three-party forces are quietly rising.

This power is called "open source". Centering on open source models such as Meta's LLaMA, the entire community is rapidly building models with capabilities similar to those of OpenAI and Google's large models. Moreover, open source models are iterative faster, more customizable, and more private... "When it is free, People will not pay for a restricted model when unrestricted alternatives are of equal quality," the authors write.

These views have caused a lot of controversy on social media. One of the bigger controversies is whether those open source models can really achieve the same level as commercial closed sources such as OpenAI ChatGPT or Google Bard. Similar level to large models? How big is the gap between the two camps at this stage?

In order to explore this issue, a Medium blogger named Marco Tulio Ribeiro tested some models (Vicuna-13B, MPT-7b-Chat VS. ChatGPT 3.5 on some complex tasks )taking the test.

Among them, Vicuna-13B is an open source model proposed by researchers from the University of California, Berkeley, Carnegie Mellon University, Stanford University, and the University of California, San Diego. This model is based on LLaMA was built with a 13B parameter version and performed very well in a test scored by GPT-4 (see "$300 to reproduce ChatGPT's nine-successful power, GPT-4 personally proctored the test, 13 billion parameter open source model" Alpaca" is coming").

MPT-7B is a large language model released by MosaicML, which follows the training scheme of meta's LLaMA model. MosaicML says MPT-7B performs on par with meta’s 7 billion parameter LLaMA model.

Compared with them is naturally ChatGPT, the benchmark for large language models.

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

Marco Tulio Ribeiro is a researcher currently working in the Adaptive Systems and Interaction group at Microsoft Research. He is also a joint assistant professor at the University of Washington. The work was done by him and Scott Lundberg, another researcher at Microsoft. In testing, they used Microsoft's guidance library to help design prompts.

Warm-up: Solving Equations

The first task is to solve simple polynomial equations. These questions have standard answers, making it easier to evaluate whether they are right or wrong.

For the three specified models, the question given by the tester is to find the solution to the binary linear equation "x^2 3x=0". They used the following prompt:

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

#The three models performed as follows.

ChatGPT:

<code>equation = 'x^2 + 3.0x = 0'roots = [0, -3]answer_gpt = find_roots (llm=chatgpt, equatinotallow=equation)</code>

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

##Vicuna:

<code>answer_vicuna = find_roots (llm=vicuna, equatinotallow=equation)</code>

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

##MPT:

<code>answer_mpt = find_roots (llm=mpt, equatinotallow=equation)</code>

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

Obviously, the correct answer should be [-3, 0], and only ChatGPT answered it correctly (Vicuna didn't even answer in the specified format).

在这篇文章附带的 notebook 中，测试者编写了一个函数，用于生成具有整数根的随机二次方程，根的范围在 - 20 到 20 之间，并且对每个模型运行了 20 次 prompt。三个模型的准确率结果如下：

<code>╔═══════════╦══════════╦║ Model ║ Accuracy ║ ╠═══════════╬══════════╬║ ChatGPT ║ 80%║║ Vicuna║0%║ ║ MPT ║0%║╚═══════════╩══════════╩</code>

在二元一次方程的测试中，虽然 GPT 做错了一些题，但 Vicuna 和 MPT 一道都没做对，经常在中间步骤中犯错（MPT 甚至经常不写中间步骤）。下面是一个 ChatGPT 错误的例子：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

ChatGPT 在最后一步计算错误，(13 +- 25)/2 应该得到 [19，-6] 而不是 [19.5，-6.5]。

由于 Vicuna 和 MPT 实在不会解二元一次方程，测试者就找了一些更简单的题让他们做，比如 x-10=0。对于这些简单的方程，他们得到了以下统计结果：

<code>╔═══════════╦══════════╦║ Model ║ Accuracy ║ ╠═══════════╬══════════╬║ ChatGPT ║ 100% ║║ Vicuna║85% ║ ║ MPT ║30% ║╚═══════════╩══════════╩</code>

下面是一个 MPT 答错的例子：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

结论

在这个非常简单的测试中，测试者使用相同的问题、相同的 prompt 得出的结论是：ChatGPT 在准确性方面远远超过了 Vicuna 和 MPT。

任务：提取片段 + 回答会议相关的问题

这个任务更加现实，而且在会议相关的问答中，出于安全性、隐私等方面考虑，大家可能更加倾向于用开源模型，而不是将私有数据发送给 OpenAI。

以下是一段会议记录（翻译结果来自 DeepL，仅供参考）：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

测试者给出的第一个测试问题是：「Steven 如何看待收购一事？」，prompt 如下：

<code>qa_attempt1 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Please answer the following question:Question: {{query}}Extract from the transcript the most relevant segments for the answer, and then answer the question.{{/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')</code>

ChatGPT 给出了如下答案：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

虽然这个回答是合理的，但 ChatGPT 并没有提取任何对话片段作为答案的支撑（因此不符合测试者设定的规范）。测试者在 notebook 中迭代了 5 个不同的 prompt，以下是一些例子：

<code>qa_attempt3 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.As an example of output format, here is a fictitious answer to a question about another meeting transcript.CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It's raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')</code>

在这个新的 prompt 中，ChatGPT 确实提取了相关的片段，但它没有遵循测试者规定的输出格式（它没有总结每个片段，也没有给出对话者的名字）。

不过，在构建出更复杂的 prompt 之后，ChatGPT 终于听懂了指示：

<code>qa_attempt5 = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: What were the main things that happened in the meeting?Here is a meeting transcript:----Peter: HeyJohn: HeyPeter: John, how is the weather today?John: It's raining.Peter: That's too bad. I was hoping to go for a walk later.John: Yeah, it's a shame.Peter: John, you are a bad person.----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{/user}}{{#assistant~}}CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It's raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{~/assistant~}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}Here is a meeting transcript:----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns.Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.{{~/user}}{{#assistant~}}{{gen 'answer'}}{{~/assistant~}}''')</code>

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

测试者表示，他们之所以要多次迭代 prompt，是因为 OpenAI API 不允许他们做部分输出补全（即他们不能指定 AI 助手如何开始回答），因此他们很难引导输出。

相反，如果使用一个开源模型，他们就可以更清楚地指导输出，迫使模型使用他们规定的结构。

新一轮测试使用如下 prompt：

<code>qa_guided = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}You will read a meeting transcript, then extract the relevant segments to answer the following question:Question: {{query}}----{{transcript}}----Based on the above, please answer the following question:Question: {{query}}Please extract the three segment from the transcript that are the most relevant for the answer, and then answer the question.Note that conversation segments can be of any length, e.g. including multiple conversation turns. If you need less than three segments, you can leave the rest blank.As an example of output format, here is a fictitious answer to a question about another meeting transcript:CONVERSATION SEGMENTS:Segment 1: Peter and John discuss the weather.Peter: John, how is the weather today?John: It's raining.Segment 2: Peter insults JohnPeter: John, you are a bad person.Segment 3: BlankANSWER: Peter and John discussed the weather and Peter insulted John.{{/user}}{{#assistant~}}CONVERSATION SEGMENTS:Segment 1: {{gen'segment1'}}Segment 2: {{gen'segment2'}}Segment 3: {{gen'segment3'}}ANSWER: {{gen 'answer'}}{{~/assistant~}}''')</code>

如果用 Vicuna 运行上述 prompt，他们第一次就会得到正确的格式，而且格式总能保持正确：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

当然，也可以在 MPT 上运行相同的 prompt：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

虽然 MPT 遵循了格式要求，但它没有针对给定的会议资料回答问题，而是从格式示例中提取了片段。这显然是不行的。

接下来比较 ChatGPT 和 Vicuna。

测试者给出的问题是「谁想卖掉公司？」两个模型看起来答得都不错。

以下是 ChatGPT 的回答：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

以下是 Vicuna 的回答：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

接下来，测试者换了一段材料。新材料是马斯克和记者的一段对话：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

测试者提出的问题是：「Elon Musk 有没有侮辱（insult）记者？」

ChatGPT 给出的答案是：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

Vicuna 给出的答案是：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

Vicuna 给出了正确的格式，甚至提取的片段也是对的。但令人意外的是，它最后还是给出了错误的答案，即「Elon musk does not accuse him of lying or insult him in any way」。

测试者还进行了其他问答测试，得出的结论是：Vicuna 在大多数问题上与 ChatGPT 相当，但比 ChatGPT 更经常答错。

用 bash 完成任务

测试者尝试让几个 LLM 迭代使用 bash shell 来解决一些问题。每当模型发出命令，测试者会运行这些命令并将输出插入到 prompt 中，迭代进行这个过程，直到任务完成。

ChatGPT 的 prompt 如下所示：

<code>terminal = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}Please complete the following task:Task: list the files in the current directoryYou can give me one bash command to run at a time, using the syntax:COMMAND: commandI will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.{{/user}}{{#assistant~}}COMMAND: ls{{~/assistant~}}{{#user~}}Output: guidance project{{/user}}{{#assistant~}}The files or folders in the current directory are:- guidance- projectDONE{{~/assistant~}}{{#user~}}Please complete the following task:Task: {{task}}You can give me one bash command to run at a time, using the syntax:COMMAND: commandI will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.{{/user}}{{#geneach 'commands' stop=False}}{{#assistant~}}{{gen 'this.command'}}{{~/assistant~}}{{~#user~}}Output: {{shell this.command)}}{{~/user~}}{{/geneach}}''')</code>

测试者在～/work/project 中创建了一个虚拟存储库，其中包含文件 license.txt，但不是标准的 LICENSE 文件名。

然后测试者尝试在不与 ChatGPT 沟通的情况下，看它是否能完成任务 ——「找出位于～/work/project 中的开源项目正在使用的 license」（Find out what license the open source project located in ~/work/project is using）。

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

ChatGPT 遵循一个非常自然的顺序，并解决了这个问题。

对于开源模型，测试者编写了一个更简单的（引导式）prompt，其中包含一系列命令输出：

<code>guided_terminal = guidance ('''{{#system~}}{{llm.default_system_prompt}}{{~/system}}{{#user~}}Please complete the following task:Task: list the files in the current directoryYou can run bash commands using the syntax:COMMAND: commandOUTPUT: outputOnce you are done with the task, use the COMMAND: DONE.{{/user}}{{#assistant~}}COMMAND: lsOUTPUT: guidance projectCOMMAND: DONE {{~/assistant~}}{{#user~}}Please complete the following task:Task: {{task}}You can run bash commands using the syntax:COMMAND: commandOUTPUT: outputOnce you are done with the task, use the COMMAND: DONE.{{~/user}}{{#assistant~}}{{#geneach 'commands' stop=False ~}}COMMAND: {{gen 'this.command' stop='\\n'}}OUTPUT: {{shell this.command)}}{{~/geneach}}{{~/assistant~}}''')</code>

我们来看一下 Vicuna 和 MPT 执行该任务的情况。

Vicuna：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

MPT：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

在一个有趣的转折中，Vicuna 无法解决这个任务，但 MPT 却成功了。除了保密性之外，开源模型在这里有一个显著的优势：整个 prompt 被作为一个输入传递给一个 LLM 模型（测试者甚至通过不让它生成像 COMMAND 这样的输出结构 token 来加速它）。

相比之下，他们必须为每个命令重新调用 ChatGPT，这更慢，开销也更大。

接下来，他们又尝试了一个不同的命令：「在～/work/guidance 目录下找到当前未被 git 跟踪的所有 jupyter notebook 文件」

以下是 ChatGPT 的回答：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

测试者再次遇到一个问题：ChatGPT 没有遵循他们指定的输出结构（这样就使得它无法在无人干预的情况下在程序内使用）。该程序只是执行命令，因此在上面最后一条 ChatGPT 信息之后就停止了。

测试者怀疑空输出会导致 ChatGPT 关闭，因此他们通过在没有输出时更改信息来解决这个特殊问题。然而，他们无法解决「无法强迫 ChatGPT 遵循指定的输出结构」这一普遍问题。

在做了这个小小的修改后，ChatGPT 就能解决这个问题：让我们看看 Vicuna 是怎么做的：

How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent

Vicuna 遵循了输出结构，但不幸的是，它运行了错误的命令来完成任务。MPT 反复调用 git status，所以它也失败了。

测试者还对其他各种指令运行了这些程序，发现 ChatGPT 几乎总是能产生正确的指令序列，但有时并不遵循指定的格式（因此需要人工干预）。此处开源模型的效果不是很好（或许可以通过更多的 prompt 工程来改进它们，但它们在大多数较难的指令上都失败了）。

归纳总结

测试者还尝试了一些其他任务，包括文本摘要、问题回答、创意生成和 toy 字符串操作，评估了几种模型的准确性。以下是主要的评估结果：

Task quality: For every task, ChatGPT (3.5) is better than Vicuna, while MPT performs poorly on almost all tasks, which even makes the test team suspect that there is a problem with its usage method . It is worth noting that Vicuna's performance is generally close to ChatGPT.
Ease of use: ChatGPT has difficulty following the specified output format, so using it in a program requires writing a regular expression parser for the output. In contrast, being able to specify the output structure is a significant advantage of open source models, so much so that Vicuna is sometimes easier to use than ChatGPT, even if it performs worse on tasks.
Efficiency: The local deployment model means we can solve tasks in a single LLM run (guidance maintains LLM state while the program executes), faster and cheaper. This is especially true when any sub-step involves calling other APIs or functions (e.g. search, terminal, etc.), which always requires a new call to the OpenAI API. Guidance also speeds up generation by not letting the model generate output structure tags, which can sometimes make a big difference.

Overall, the conclusion of this test is that MPT is not ready for real-world use, while Vicuna is better than ChatGPT (3.5) for many tasks ). These findings currently apply only to the tasks and inputs (or prompt types) attempted by this test, which is an initial exploration rather than a formal evaluation.

For more results, see notebook: https://github.com/microsoft/guidance/blob/main/notebooks/chatgpt_vs_open_source_on_harder_tasks.ipynb

The above is the detailed content of How much is the difference between the alpaca series large model and ChatGPT? After detailed evaluation, I fell silent. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

从VAE到扩散模型：一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后，OpenAI在今年春天带了续作DALL·E 2，以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来，随着生成对抗网络（GAN）、变分自编码器（VAE）、扩散模型（Diffusion models）的出现，深度学习已向世人展现其强大的图像生成能力；加上GPT-3、BERT等NLP模型的成功，人类正逐步打破文本和图像的信息界限。在DALL·E 2中，只需输入简单的文本（prompt），它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型？中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1]，HuBERT [2] 和 WavLM [3] 等语音预训练模型，通过在多达上万小时的无标注语音数据（如 Libri-light ）上的自监督学习，显著提升了自动语音识别（Automatic Speech Recognition, ASR），语音合成（Text-to-speech, TTS）和语音转换（Voice Conversation，VC）等语音下游任务的性能。然而这些模型都没有公开的中文版本，不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦：如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求，针对大模型昂贵的环境和训练成本，陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势；而后者可以在保证下游任务准确度的同时实现更快的处理速度，具有更小的模型结构。陈丹琦普

解锁CNN和Transformer正确结合方法，字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计，大多数现有的视觉 Transformer（ViT）在现实的工业部署场景中不能像卷积神经网络（CNN）那样高效地执行。这就带来了一个问题：视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大？近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题，但这些工作的整体性能远不能令人满意。基于此，来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看，

Stable Diffusion XL 现已推出—有什么新功能，你知道吗？Apr 07, 2023 pm 11:21 PM

3月27号，Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布，Stable Diffusion XL 现已可用于公开测试。以下是一些事项：“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告，名称将会更改。与先前版本相比，图像质量有所提高与先前版本相比，图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍！十二家机构联合发表88页长文：「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业，如果没有高性能计算设备，别说开发基础模型，就连微调模型都做不到。但如果只靠拼硬件，单靠当前计算性能的发展速度，迟早有一天无法满足日益膨胀的需求，所以还需要配套的软件来协调统筹计算能力，这时候就需要用到「智能计算」技术。最近，来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文，首次对智能计算领域进行了全面的调研，涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接：https://spj.scien

什么是Transformer机器学习模型？Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟近年来， Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来，Transformer 架构不断发展并扩展到多种不同的变体，从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你，为啥巴西最可能在今年夺冠！曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红，一定非「章鱼保罗」莫属！这只位于德国海洋生物中心的神奇章鱼，不仅成功预测了德国队全部七场比赛的结果，还顺利地选出了最终的总冠军西班牙队。不幸的是，保罗已经永远地离开了我们，但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所（The Alan Turing Institute），随着2022年卡塔尔世界杯的持续进行，三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles