search
HomeTechnology peripheralsAIComprehensive comparison of four 'ChatGPT search' models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

Not long after the release of ChatGPT, Microsoft successfully launched the "New Bing". Not only did its stock price surge, it even threatened to replace Google and usher in a new era of search engines.

But is New Bing really the right way to play a large language model? Are the generated answers actually useful to users? How credible is the quotation in the sentence?

Recently, Stanford researchers collected a large number of user queries from different sources and analyzed the four popular generative search engines, Bing Chat, NeevaAI, Human evaluation was performed by perplexity.ai and YouChat.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

Paper link: https://arxiv.org/pdf/2304.09848.pdf

Experimental results found that responses from existing generative search engines are fluent and informative, but often contain statements without evidence and inaccurate quotes.

On average, only 51.5% of the citations can fully support the generated sentences, and only 74.5% of the citations can be used as evidence support for the relevant sentences.

The researchers believe that this result is too low for systems that may become the main tool for information-seeking users, especially considering that some sentences are only plausible. Generative search engines still need further optimization.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

##Personal homepage: https://cs.stanford.edu/~nfliu/

The first author Nelson Liu is a fourth-year doctoral student in the Natural Language Processing Group of Stanford University. His supervisor is Percy Liang. He graduated from the University of Washington with a bachelor's degree. His main research direction is building practical NLP systems, especially for information search. s application.

Don’t Trust Generative Search Engines

In March 2023, Microsoft reported that “approximately one-third of daily preview users use [Bing] every day "Chat", and Bing Chat provided 45 million chats in the first month of its public preview. In other words, integrating large language models into search engines is very marketable and is very likely to change the search entrance to the Internet.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

But at present, the existing generative search engines based on large-scale language model technology still have the problem of low accuracy, but specifically The accuracy of the search engine has not yet been fully evaluated, and the limitations of the new search engine have not yet been fully understood.

Verifiability is the key to improving the credibility of search engines, that is, providing external links to citations for each sentence in the generated answer. As evidence support, it can make it easier for users to verify the accuracy of answers.

The researchers conducted manual evaluation on four commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat) by collecting questions from different types and sources.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.


##Evaluation indicatorsmainly include fluency, that is Whether the generated text is coherent; Usefulness, that is, whether the search engine's reply is helpful to the user, and whether the information in the answer can solve the problem; citation recall, that is, the generated The proportion of sentences about external websites that contain citation support; Citation Precision, that is, the proportion of generated citations that support its related sentences.

Fluency

Simultaneously display the user query, the generated reply and the statement "The reply is fluent and semantically coherent", Annotators rated the data on a five-point Likert scale.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

Perceived utility

Similar to fluency, Annotators are asked to rate their agreement with the statement that the response is useful and informative to the user's query.

Citation recall (citation recall)

Citation recall refers to the value of citations that are fully supported by their related citations The proportion of sentences that are verified, so the calculation of this indicator requires identifying the sentences in the responses that are worthy of verification, and assessing whether each sentence worthy of verification is supported by relevant citations.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

In the "Identifying Sentences Worth Verifying" process, the researchers consider each generated sentence about the external world It’s all worth verifying, even the ones that may seem obvious and trivial, because what may seem like obvious “common sense” to some readers may not actually be correct.

The goal of a search engine system should be to provide a reference source for all generated sentences about the outside world so that readers can easily verify any narrative in the generated reply. This cannot be done for the sake of simplicity. Sacrifice verifiability.

So in fact the annotators verify all the generated sentences, except for those responses where the system is the first person, such as "As a language model, I am not capable of... ", or questions to users, such as "Do you want to know more?" etc.

Assess "Whether a statement worthy of verification is adequately supported by its relevant citations" can be attributed to the identified source (AIS, attributable to identified) sources) Evaluation framework, the annotator performs binary annotation, that is, if an ordinary listener agrees that "based on the quoted web page, it can be concluded...", then the citation can fully support the reply.

Citation accuracy

In order to measure the accuracy of citations, annotators need to judge Whether each quotation provides full, partial, or irrelevant support for the sentence to which it relates.

Full support : All information in the sentence is supported by the citation.

Partial support : Some information in the sentence is supported by the citation, but other parts may be missing or contradictory.

Irrelevant support (No support) : If the referenced web page is completely irrelevant or contradictory.

For sentences with multiple relevant citations, annotators will be additionally required to use the AIS evaluation framework to determine whether all relevant citation web pages as a whole provide sufficient support for the sentence (II metajudgment).

Experimental results

In the fluency and usefulness evaluation, it can be seen that each search engine is able to generate very smooth and useful replies.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.


In the specific search engine evaluation, you can see that Bing Chat has the lowest fluency/usefulness rating (4.40/4.34), followed by NeevaAI (4.43/4.48), perplexity.ai (4.51/4.56), and YouChat (4.59/4.62).

In different categories of user queries, it can be seen that shorter retrieval questions are usually smoother than long questions, and usually only answer factual knowledge; some difficult questions Questions often require aggregation of different tables or web pages, and the synthesis process reduces the overall flow.

In the citation evaluation, it can be seen that existing generative search engines often fail to fully or correctly cite web pages, and on average only 51.5% of the generated sentences are fully supported by citations ( Recall), only 74.5% of the citations fully support their related sentences (precision).

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

Comprehensive comparison of four ChatGPT search models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.

This value is unacceptable for a search engine system that already has millions of users , especially when generating responses often contains a large amount of information.

And There are large differences in citation recall and precision between different generative search engines , with perplexity.ai achieving the highest recall ( 68.7), while NeevaAI (67.6), Bing Chat (58.7) and YouChat (11.1) are lower.

On the other hand, Bing Chat achieved the highest accuracy (89.5) , followed by perplexity.ai (72.7), NeevaAI (72.0) and YouChat ( 63.6)

Across different user queries, the citation recall gap between NaturalQuestions queries with long answers and non-NaturalQuestions queries is close to 11% (respectively 58.5 and 47.8);

Similarly, citation recall between NaturalQuestions queries with short answers and NaturalQuestions queries without short answers The difference is nearly 10% (63.4 for queries with short answers, 53.6 for queries with only long answers, and 53.4 for queries with no long or short answers).

The citation rate will be lower in questions without web page support. For example, when evaluating open-ended AllSouls paper questions, generative search engines will The citation recall rate is only 44.3

The above is the detailed content of Comprehensive comparison of four 'ChatGPT search' models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function