


Can GPT-4 pass the Turing test?
When a powerful enough model is born, people often use the Turing test to measure the intelligence of this LLM.
Recently, researchers from the Department of Cognitive Science at UCSD discovered that:
In the Turing test, people simply cannot tell the difference GPT-4 and humans!
##Paper address: https://arxiv.org/pdf/2405.08007
In the Turing test, GPT-4 was judged to be human 54% of the time.
The experimental results show that this is the first time that a system has been empirically tested in the "interactive" two-person Turing test.
Researcher Cameron R.Jones recruited 500 volunteers, who were divided into 5 roles: 4 evaluators, namely GPT -4, GPT-3.5, ELIZA and humans, the other character "plays" humans themselves, hiding on the other side of the screen, waiting for the evaluator to discover.
The following is an excerpt from the game. Can you tell which dialog box is human?
Figure 1: Part of the conversation between the human interrogator (green) and the witness (grey)
In fact, among these four conversations, one was with a human witness, and the rest were with artificial intelligence.
The controlled "Turing test" was launched for the first time
In the past 74 years, people have made many attempts at Turing tests, but very few Few controlled experiments have been conducted.
The famous Loebner Award is an annual competition. However, from the first time it was held in 1990 to 2020, no system has passed the test.
#A recent large-scale study found that humans can identify a series of LLMs with 60% accuracy during a 2-minute online conversation.
To date, there have been no "controlled" experimental demonstrations where any machine has passed the test.
To test whether it is possible for people to see through the deception of current AI systems, the researchers used GPT-4 to conduct a randomly controlled two-person Turing test.
The test requirements are very simple, that is, human participants start a 5-minute conversation with humans and artificial intelligence, and determine whether the interlocutor is a human.
As mentioned at the beginning, the researchers evaluated a total of 3 models-GPT-4, GPT-3.5, and ELIZA.
For the first two LLMs, the researchers adapted the best-performing prompts from the GPT-4 and GPT-3.5 exploratory studies to indicate How the large model responds to messages.
Researchers say a core challenge in interpreting the Turing test is that a system's superior performance may stem from two reasons:
- One is an excellent imitation of humans
-The second is that the judge is too gullible
The latter situation is often called the ELIZA effect, that is, participation Authors will humanize even simple systems.
The ELIZA system is very familiar to everyone.
It was a simple rules-based chatbot from the 1960s, but users found themselves attributing human characteristics to it as well.
Therefore, in the experiment, the researchers used ELIZA as a baseline, on the one hand to empirically quantify the ELIZA effect, and on the other hand to conduct operational checks in the experiment .
So, what are the specific results of this Turing test?
GPT-4 pass rate is 54%
First look at an important indicator: pass rate.
The researchers define the pass rate as the proportion of evaluation objects in the game that are judged to be human by the evaluators.
The pass rate of GPT-4 is 54%, exceeding GPT-3.5 (50%) and ELIZA baseline (22%), but lower than human subjects (67%) .
Evaluator pass rate (left) and evaluator trust (right). Each point in the picture on the right represents a game
This result is actually not surprising and is basically consistent with the assumptions in the initial stage of the research.
Before the experiment, the researchers made several assumptions: First, it was assumed that humans’ ability to detect GPT-4 would not exceed chance (50%).
To test this, the researchers fit a logistic regression model to predict the passing rate of only the intercept.
There is no significant difference between the intercept and 0 (z=0.14, p=0.482), which indicates that the pass rate of GPT-4 can be approximately equal to 50%. The Bayesian model predicts a 95% confidence interval for the GPT-4 intercept of -0.24-0.55 (equivalent to a pass rate of 44%-63%).
This means that when humans face GPT-4, the probability of successful discrimination is about the same as tossing a coin!
The results also confirmed the researchers’ second and third hypotheses: the human pass rate was significantly higher than the GPT-4 pass rate (z=2.42, p=0.017), The pass rate of GPT-4 is higher than that of ELIZA (z=4.11, p
However, what is inconsistent with the researcher’s experimental assumptions is that the performance of GPT-4 and GPT-3.5 is almost the same (z=0.497, p=0.619).
In terms of "pretending to be human", it's a tie.
Also unexpectedly, participants who said they had a better understanding of how LLMs work were not more successful in identifying LLMs (z = -0.023, p = 0.869).
In exploratory analyses, the researchers found that age had a negative impact on interrogator accuracy (z = -2.24, p = 0.025), but other demographic variables, including Gender, education level, and frequency of interaction with chatbots) had no significant effect.
Why is this the result? Let us analyze the reasons behind it.
To gain insight into what exactly influences evaluators’ decisions, we categorized the strategies evaluators used and the reasons they provided for their decisions.
36% of evaluators would chat with the person being evaluated and ask about some personal details or daily activities.
The most commonly used strategies in Categories 2 and 3 are social and emotional (25%)—for example, asking about the assessee’s opinions and experiences and observing whether they have a sense of humor.
Others focus on knowledge and reasoning (13%), including questions about mathematics, logic and general knowledge.
Ask for strategy (left) and average accuracy (right)
To compare each strategy For effectiveness, the researchers calculated the average accuracy using these strategies.
Usually when the evaluator asks the other person about human experience (75%), logic and mathematics (78%) or directly "accuses" the other person of artificial intelligence (80%) Get closer to the "correct answer".
In games, the reason why artificial intelligence "shows off" is usually because the evaluator feels that they seem to be forced to play a role, either too casually, or lack personality.
And ELIZA is easily "exposed", mainly due to unnatural reactions and lack of contextual awareness.
It can be seen that both in terms of the strategies used by the volunteers participating in the game and the reasons for their decision-making, they pay more attention to language style and social-emotional factors rather than more traditional Intellectual categories (knowledge and reasoning).
This seems to reflect an underlying assumption of the volunteers—social intelligence is the most inimitable human characteristic for AI.
Turing test, passed
How to determine when the system has passed the Turing test?
Turing predicted in his 1950 paper that by the end of the century, AI systems would be able to play the imitation game so well that examiners, after a 5-minute exchange, The chance of correct identification will not exceed 70%.
This conclusion has led to whether the pass rate of AI can reach 30%, which has become the discussion goal of many researchers.
However, this benchmark seems arbitrary, and it is unclear whether Turing intended it as a definition of success.
In fact, the 50% baseline is more reasonable because it better proves that human reviewers are no better than chance at identifying AI.
This definition is particularly relevant to the goal of discovering whether users can reliably identify other humans during online interactions.
In the investigators' pre-publication analysis, this issue was addressed by using the ELIZA baseline as a manipulation check.
Only when the analysis shows that the pass rate of ELIZA is lower than the probability, and the pass rate of GPT-4 is not lower than the probability, the model can be judged to have passed the test.
According to this definition, GPT-4 clearly passed this version of the Turing test.
In addition, the participants’ confidence scores and decision-making basis showed that they were not randomly guessing: GPT-4 has an average human confidence level of 73% (see Figure 2).
Netizen discussion
Some netizens said that if we want to equate the Turing test with intelligence, The Turing test is not a good test. However, the fact that AI can almost always “fool” humans is a bit concerning.
Some people said that they were doubtful about this kind of test. Because GPT-4 will outperform most humans, it will be easy to distinguish who is human and who is artificial intelligence.
Researchers said that this is indeed a problem we have encountered. For example, GPT-4’s knowledge reserve is “too rich” or it masters too many languages. We explicitly prompt the model to avoid this situation, which is effective to a certain extent.
The above is the detailed content of GPT-4 passed the Turing test with a winning rate of 54%! UCSD new work: Humans cannot recognize GPT-4. For more information, please follow other related articles on the PHP Chinese website!

一觉醒来,工作的方式被彻底改变。微软把AI神器GPT-4全面接入Office,这下ChatPPT、ChatWord、ChatExcel一家整整齐齐。CEO纳德拉在发布会上直接放话:今天,进入人机交互的新时代,重新发明生产力。新功能名叫Microsoft 365 Copilot(副驾驶),与改变了程序员的代码助手GitHub Copilot成为一个系列,继续改变更多人。现在AI不光能自动做PPT,而且能根据Word文档的内容一键做出精美排版。甚至连上台时对着每一页PPT应该讲什么话,都给一起安排

集成GPT-4的Github Copilot X还在小范围内测中,而集成GPT-4的Cursor已公开发行。Cursor是一个集成GPT-4的IDE,可以用自然语言编写代码,让编写代码和聊天一样简单。 GPT-4和GPT-3.5在处理和编写代码的能力上差别还是很大的。官网的一份测试报告。前两个是GPT-4,一个采用文本输入,一个采用图像输入;第三个是GPT3.5,可以看出GPT-4的代码能力相较于GPT-3.5有较大能力的提升。集成GPT-4的Github Copilot X还在小范围内测中,而

作者 | 云昭3月9日,微软德国CTO Andreas Braun在AI kickoff会议上带来了一个期待已久的消息:“我们将于下周推出GPT-4,届时我们将推出多模式模式,提供完全不同的可能性——例如视频。”言语之中,他将大型语言模型(LLM)比作“游戏改变者”,因为他们教机器理解自然语言,然后机器以统计的方式理解以前只能由人类阅读和理解的东西。与此同时,这项技术已经发展到“适用于所有语言”:你可以用德语提问,也可以用意大利语回答。借助多模态,微软(-OpenAI)将“使模型变得全面”。那

近段时间,人工智能聊天机器人ChatGPT刷爆网络,网友们争先恐后去领略它的超高情商和巨大威力。参加高考、修改代码、构思小说……它在广大网友的“鞭策”下不断突破自我,甚至可以用一整段程序,为你拼接出一只小狗。而这些技能只是基于GPT-3.5开发而来,在3月15日,AI世界再次更新,最新版本的GPT-4也被OpenAI发布了出来。与之前相比,GPT-4不仅展现了更加强大的语言理解能力,还能够处理图像内容,在考试中的得分甚至能超越90%的人类。那么,如此“逆天”的GPT-4还具有哪些能力?它又是如何

GPT-4在发布之时公布了一项医学知识测试结果,该测试由美国医师学会开发,最终它答对了75%的问题,相比GPT3.5的53%有很大的飞跃。 这两天,一篇关于“GPT-4救了我狗的命”的帖子属实有点火:短短一两天就有数千人转发,上万人点赞,网友在评论区讨论得热火朝天。△ 是真狗命,not人的“狗命”(Doge)乍一听,大家想必很纳闷:这俩能扯上什么关系?GPT-4还能长眼睛发现狗有什么危险吗?真实的经过是这样子的:当兽医说无能为力时,他问了GPT-4发帖人名叫Cooper。他自述自己养的一条狗子,

人工智能在过去几十年里发展势头强劲,像GPT-4这样的大型语言模型引起了用户的更多兴趣,他们想知道GPT-4如何支持数字化转型。根据行业媒体的预测,到2024年,GPT-4所基于的ChatGPT深度学习堆栈将产生10亿美元的收入。GPT-4的普及是由于人工智能技术的力量,以及高用户可访问性和广泛的通用性。科技行业的许多不同领域都可以利用GPT-4来自动化和个性化许多任务,使企业员工能够专注于更复杂的任务。以下是GPT-4在几个不同领域促进数字化转型的一些例子。1、个性化员工培训像GPT-4这样的

GPT-4,火爆,非常火爆。不过家人们,在铺天盖地的叫好声中,有件事可能你是“万万没想到”——在OpenAI公布的技术论文里,竟然藏着九大隐秘的线索!这些线索是由国外博主AI Explained发现并整理。他宛如一位细节狂魔,从长达98页论文中,逐个揭秘这些“隐匿的角落”,包括:GPT-5可能已经完成训练GPT-4出现过“挂掉”的情况OpenAI两年内或实现接近AGI……发现一:GPT4出现过“挂掉”的情况在GPT-4技术论文的第53页处,OpenAI提到了这样一个机构——Alignment R

3 月 15 日消息,今天 OpenAI 发布了全新的 GPT-4 大型语言模型,随后微软官方宣布,Bing Chat 此前已经升级使用 OpenAI 的 GPT-4 技术。微软公司副总裁兼消费者首席营销官 Yusuf Mehdi 确认 Bing Chat 聊天机器人 AI 已经在 GPT-4 上运行,ChatGPT 基于最新版本 GPT-4,由 OpenAI 开发 。微软 Bing 博客网站上的一篇帖子进一步证实了这一消息。微软表示,如果用户在过去五周内的任何时间使用过新的 Bing 预览版,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

Zend Studio 13.0.1
Powerful PHP integrated development environment

Atom editor mac version download
The most popular open source editor

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Dreamweaver Mac version
Visual web development tools