How smooth is the performance of GPT-4? Can human writing be surpassed?
Summary generation is a task of natural language generation (NLG), whose main purpose is to compress long texts into short summaries. It can be applied to a variety of content, such as news articles, source code, and cross-language texts, etc.
With the emergence of large models (LLM), traditional fine-tuning on specific data sets method is no longer applicable.
We can’t help but ask, how effective is LLM in generating summaries?
In order to answer this question, researchers from Peking University conducted a detailed discussion in the paper "Summarization is (Almost) Dead". They evaluated the performance of LLM on various summarization tasks (single news, multiple news, dialogue, source code, and cross-language summarization) using human-generated evaluation datasets. Quantitative and qualitative comparisons of LLM-generated summaries, human-written summaries, and fine-tuned model-generated summaries revealed that LLM-generated summaries were significantly favored by human evaluators
In the past After sampling and examining 100 papers related to summarization methods published in ACL, EMNLP, NAACL and COLING in 3 years, the researchers found that the main contribution of about 70% of the papers was to propose a summary summarization method and use it on standard data Its effectiveness has been verified on the set. Therefore, the study stated that "Summarization is (Almost) Dead"
Despite this, the researchers said that the field still faces some challenges, such as the need for higher Issues such as quality reference data sets and improved evaluation methods still need to be resolved
Paper link: https://arxiv.org/pdf/2309.09558. pdf
Methods and results
This study uses the latest data to construct data sets, each of which consists of 50 samples.
When performing single news, multiple news and conversation summary tasks, we used methods similar to the CNN/DailyMail and Multi-News data set construction methods for simulation. For the cross-language summarization task, we adopt the same strategy as that proposed by Zhu et al. As for the code summary task, the method proposed by Bahrami et al.
After the data set is constructed, the next step is the method. Specifically, this article uses BART and T5 for single news tasks; Pegasus and BART for multiple news tasks; T5 and BART for dialogue tasks; MT5 and MBART for cross-language tasks; and Codet5 for source code tasks.
In this experiment, the study used human evaluators to compare the overall quality of different abstracts. According to the results in Figure 1, the summaries generated by LLM outperform the manually generated summaries and the summaries generated by the fine-tuned model in all tasks
#This raises the question of why LLM is able to outperform human-written summaries, which are traditionally thought to be flawless. Furthermore, preliminary observations indicate that LLM-generated summaries are very smooth and coherent
This paper further recruits annotators to identify hallucination issues in human and LLM-generated summary sentences, and the results are shown in Table 1 , human-written summaries exhibit the same or a higher number of hallucinations compared to summaries generated by GPT-4. In specific tasks such as multiple news items and code summarization, human-written summaries exhibit significantly poorer factual consistency.
Table 2 shows the proportion of hallucinations in human-written summaries and GPT-4 generated summaries
This article also found that human-written reference summaries have a problem that lacks fluency. As shown in Figure 2 (a), human-written reference summaries sometimes suffer from incomplete information. And in Figure 2(b), some human-written reference summaries exhibit hallucinations.
This study also found that the summaries generated by fine-tuning models usually have a fixed and strict length, while LLM is able to adjust the output length based on input information. Furthermore, when the input contains multiple topics, the summaries generated by the fine-tuned model have low coverage of the topics, as shown in Figure 3, while the LLM is able to capture all topics when generating summaries
According to the results in Figure 4, it can be seen that the human preference score for large models exceeds 50%, which shows that people have a strong preference for its summary and highlights the ability of LLM in text summarization
The above is the detailed content of How smooth is the performance of GPT-4? Can human writing be surpassed?. For more information, please follow other related articles on the PHP Chinese website!

Harnessing the Power of Data Visualization with Microsoft Power BI Charts In today's data-driven world, effectively communicating complex information to non-technical audiences is crucial. Data visualization bridges this gap, transforming raw data i

Expert Systems: A Deep Dive into AI's Decision-Making Power Imagine having access to expert advice on anything, from medical diagnoses to financial planning. That's the power of expert systems in artificial intelligence. These systems mimic the pro

First of all, it’s apparent that this is happening quickly. Various companies are talking about the proportions of their code that are currently written by AI, and these are increasing at a rapid clip. There’s a lot of job displacement already around

The film industry, alongside all creative sectors, from digital marketing to social media, stands at a technological crossroad. As artificial intelligence begins to reshape every aspect of visual storytelling and change the landscape of entertainment

ISRO's Free AI/ML Online Course: A Gateway to Geospatial Technology Innovation The Indian Space Research Organisation (ISRO), through its Indian Institute of Remote Sensing (IIRS), is offering a fantastic opportunity for students and professionals to

Local Search Algorithms: A Comprehensive Guide Planning a large-scale event requires efficient workload distribution. When traditional approaches fail, local search algorithms offer a powerful solution. This article explores hill climbing and simul

The release includes three distinct models, GPT-4.1, GPT-4.1 mini and GPT-4.1 nano, signaling a move toward task-specific optimizations within the large language model landscape. These models are not immediately replacing user-facing interfaces like

Chip giant Nvidia said on Monday it will start manufacturing AI supercomputers— machines that can process copious amounts of data and run complex algorithms— entirely within the U.S. for the first time. The announcement comes after President Trump si


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Chinese version
Chinese version, very easy to use

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool