


High-scoring paper from COLM, the first large model conference: Preference search algorithm PairS makes text evaluation of large models more efficient

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The authors of the article are all from the Language Technology Laboratory of Cambridge University. One is Liu Yinhong, a third-year doctoral student, and his supervisors are professors Nigel Collier and Ehsan Shareghi. His research interests are large model and text evaluation, data generation, etc. Zhou Han, a second-year doctoral student in Tongyi, is mentored by professors Anna Korhonen and Ivan Vulić. His research interest is in efficient large models.
The large model exhibits excellent command following and task generalization capabilities. This unique ability comes from the use of command following data and human feedback reinforcement learning (RLHF) in LLMs training. In the RLHF training paradigm, the reward model is aligned with human preferences based on ranking comparison data. This enhances the alignment of LLMs with human values, thereby generating responses that better assist humans and adhere to human values.
Recently, the first large model conference COLM has just announced the acceptance results. One of the high-scoring works analyzed the score bias problem that is difficult to avoid and correct when LLM is used as a text evaluator, and proposed to convert the evaluation problem into a preference ranking. problem, and thus designed the PairS algorithm, an algorithm that can search and sort from pairwise preferences. By leveraging the assumptions of uncertainty and LLM transitivity, PairS can give efficient and accurate preference rankings and demonstrate higher consistency with human judgment on multiple test sets.
Paper link: https://arxiv.org/abs/2403.16950
Paper title: Aligning with Human Judgment: The Role of Pairwise Preference in Large Language Model Evaluators
Github address: https://github.com/cambridgeltl/PairS
What are the problems with large model evaluation?
A large number of recent works have demonstrated the excellent performance of LLMs in evaluating text quality, forming a new paradigm for reference-free evaluation of generative tasks, avoiding expensive human annotation costs. However, LLM evaluators are highly sensitive to prompt design and may even be affected by multiple biases, including positional bias, verbosity bias, and context bias. These biases prevent LLM evaluators from being fair and trustworthy, leading to inconsistencies and misalignments with human judgment.
To reduce biased predictions of LLMs, previous work developed calibration techniques to reduce bias in LLM predictions. We first conduct a systematic analysis of the effectiveness of calibration techniques in aligning pointwise LLM estimators. As shown in Figure 2 above, existing calibration methods still do not align the LLM estimator well even when supervision data is provided.
As shown in Formula 1, we believe that the main reason for the misalignment of evaluation is not the biased priors over evaluation score distribution of LLM, but the misalignment of the evaluation standard, that is, the LLM evaluator The likelihood (likelihood). We believe that LLM evaluators will have more consistent evaluation criteria with humans when doing pairwise evaluation, so we explore a new LLM evaluation paradigm to promote more aligned judgments.
Inspiration brought by RLHF
As shown in Figure 1 below, inspired by the alignment of reward models through preference data in RLHF, we believe that the LLM evaluator can be obtained by generating a preference ranking. More human-aligned predictions. Some recent work has begun to obtain preference rankings by asking LLM to perform pairwise comparisons. However, evaluating the complexity and scalability of preference rankings has been largely overlooked. They ignore the transitivity assumption, making the number of comparisons O (N^2), making the evaluation process expensive and infeasible.
PairS: Efficient Preference Search Algorithm
In this work, we propose two pairwise preference search algorithms (PairS-greedy and PairS-beam). PairS-greedy is an algorithm based on complete transitivity assumption and merge sort, and can obtain global preference sorting with only O (NlogN) complexity. The transitivity assumption means that, for example, for 3 candidates, LLM always has if A≻B and B≻C, then A≻C. Under this assumption we can directly use traditional ranking algorithms to obtain preference rankings from pairwise preferences.
But LLM does not have perfect transitivity, so we designed the PairS-beam algorithm. Under the looser transitivity assumption, we derive and simplify the likelihood function for preference ranking. PairS-beam is a search method that performs a beam search based on the likelihood value in each merge operation of the merge sort algorithm, and reduces the pairwise comparison space through the uncertainty of preferences. PairS-beam can adjust the contrast complexity and ranking quality, and efficiently provide the maximum likelihood estimate (MLE) of preference ranking. In Figure 3 below we show an example of how PairS-beam performs a merge operation.
Experimental results
We tested on multiple representative data sets, including the closed-ended abbreviation tasks NewsRoom and SummEval, and the open-ended story generation task HANNA, and compared multiple Baseline methods for LLM single-point evaluation, including unsupervised direct scoring, G-Eval, GPTScore and supervised training UniEval and BARTScore. As shown in Table 1 below, PairS has higher consistency with human ratings than them on every task. GPT-4-turbo can even achieve SOTA effects.
In the article, we also compared two baseline methods for preference ranking, win rate and ELO rating. PairS can achieve their same quality preference ranking with only about 30% of the number of comparisons. The paper also provides more insights into how pairwise preferences can be used to quantitatively compute the transitivity of LLM estimators, and how pairwise estimators can benefit from calibration.
For more research details, please refer to the original paper.
The above is the detailed content of High-scoring paper from COLM, the first large model conference: Preference search algorithm PairS makes text evaluation of large models more efficient. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SublimeText3 Chinese version
Chinese version, very easy to use

WebStorm Mac version
Useful JavaScript development tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft