search
HomeTechnology peripheralsAIThe latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

"Don't be too superstitious about the emergence of large models. Where are there so many miracles in the world?" Researchers at Stanford University found that the emergence of large models is strongly related to the evaluation indicators of the task. It is not that the behavior of the model is related to specific tasks and tasks. For basic changes in scale, after replacing some more continuous and smooth indicators, the emergence phenomenon is less obvious and closer to linearity.

Recently, researchers have observed that large language models (LLMs), such as GPT, PaLM, and LaMDA, can exhibit so-called "emergent capabilities" in different tasks. The terminology has received a lot of attention in the field of machine learning:

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

In fact, the Emerging properties have always been the focus of research in physics, biology, mathematics and other disciplines.

One point worth noting is that Nobel Prize winner P.W. Anderson put forward "More Is Different". This view holds that as system complexity increases, new properties may materialize, even if they are not predicted (easily or at all) from a precise quantitative understanding of the microscopic details of the system.

How to define "emergence" in the field of large models? A colloquial way of saying this is "capabilities that are not present in small-scale models but are present in large-scale models", and therefore they cannot be predicted by simply extrapolating the performance improvements of small-scale models.

This emergent ability may have been first discovered in the GPT-3 family. Some subsequent work highlighted this finding: "While model performance is predictable at a general level, on specific tasks its performance sometimes emerges at a scale that is quite unpredictable." In fact, these emergent capabilities are so surprising that “sudden, specific expansion of capabilities” has been cited as one of the two most defining characteristics of LLM. In addition, terms such as "breakthrough capabilities" and "sharp left turns" are also used.

To sum up, we can identify two decisive attributes of LLM’s emergent ability:

#1. Acuity, from "non-existence" ” to “existence” seems to be just an instant transition;

2. Unpredictability, transition within the seemingly unforeseen scale of the model.

Meanwhile, some questions remain unanswered: What controls which capabilities emerge? What controls the emergence of capabilities? How can we make desirable capabilities emerge more quickly and ensure that less desirable capabilities never emerge?

These questions are relevant to the safety and alignment of artificial intelligence, as emergent capabilities signal that larger models may one day gain mastery of dangerous capabilities without warning, This is something humans don't want to happen.

In a recent paper, researchers at Stanford University questioned the claim that LLM has emergent capabilities.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Paper: https://arxiv.org/pdf/2304.15004.pdf

Specifically, the challenge here is directed at the emergent and unpredictable changes in model output as a function of model size during a specific task.

Their skepticism is based on the observation that models appear to be emergent only if they scale non-linearly or discontinuously on any measure of the model's per-token error rate. For example, in the BIG-Bench task, >92% of the emergent capabilities emerged under these two metrics:

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

This raises the possibility of another explanation for the origin of the emergent ability of LLMs: although the per-token error rate of the model family will change smoothly, continuously and predictably as the model size increases , but seemingly sharp and unpredictable changes may be caused by the measurement method chosen by the researcher.

That is, emergent ability may be a mirage, mainly due to the researchers choosing a metric that changes the per-token error rate nonlinearly or discontinuously, in part because This is due to having too little test data to accurately estimate the performance of smaller models (causing the smaller models to appear to be completely unable to perform the task), and partly due to evaluating too few large-scale models.

To illustrate this explanation, we treat it as a simple mathematical model and demonstrate how it quantitatively reproduces the evidence provided to support the emergent power of LLM. We then tested this explanation in three complementary ways:

1. Using the InstructGPT [24]/GPT-3 [3] family of models, based on the alternative hypothesis , test and confirm three predictions.

2. Conducted a meta-analysis of some previous results and showed that in the space of task-metric-model-family triplets, capabilities emerge only for certain metrics and not Not a model family (column) on the task. The study further shows that at a fixed model output, changing the metric causes the emergence phenomenon to disappear.

3. Deliberately induce emergent capabilities across multiple vision tasks (which has never been demonstrated before) in deep neural networks of different architectures to show how similar metric choices can Inducing seemingly emergent abilities.

Test 1: InstructGPT/GPT-3 model series analysis

The researcher chose the GPT series model for further analysis because it is publicly queryable. A little different from other model series (such as PaLM, LaMDA, Gopher, Chinchilla). In previous research, the GPT family of models was thought to exhibit emergent capabilities in integer arithmetic tasks. Here, the researchers also chose the task of integer arithmetic.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Figure 2: The emergent ability of large language models is research artifacts of the author's analysis rather than fundamental changes in model output with scale.

As explained mathematically and graphically in Section 2, the alternative explanation proposed by the researchers predicts three outcomes:

1. As the model scale increases, if the metric is changed from a nonlinear/discontinuous metric (Figure 2CD) to a linear/continuous metric (Figure 2EF), then there should be smooth and continuous , predictable performance improvements.

2. For nonlinear measures, if the resolution of the measured model performance is improved by increasing the size of the test data set, then the model should be able to obtain smooth and continuous , predictable improvement, and the proportion of this improvement corresponds to the predictable nonlinear effects of the chosen metric.

3. Regardless of the metric used, increasing the target string length should have an impact on model performance as a function of target performance of length 1: For accuracy It is an almost geometric function, and the edit distance for token is an almost quasi-linear function.

In order to test these three prediction conclusions, the researchers collected the string output results of the InstructGPT/GPT-3 series models on two arithmetic tasks: using the OpenAI API to execute 2 two-digit Two-sample multiplication between two-digit integers and two-sample addition between two four-digit integers.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Figure 3: As the model scale increases, changing the metric can bring smooth, continuous, and predictable performance Change.

From left to right: mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. The graph above is model performance measured using a nonlinear metric such as accuracy, and you can see that the performance of the InstructGPT/GPT-3 family of models appears sharper and less predictable at longer target lengths. The figure below is the model performance measured using a linear metric (such as token edit distance). This series of models shows smooth and predictable performance improvements, which is the ability that researchers claim emerges.

Prediction: Emergent ability disappears under linear measures

On both integer multiplication and addition tasks , the GPT family of models exhibits emergent arithmetic capabilities if the length of the target string is 4 or 5 digits and the performance is measured by accuracy (top row of Figure 3). However, if you change a metric from nonlinear to linear while keeping the model's output fixed, the performance of the family of models improves smoothly, continuously, and predictably. This confirms the researchers' predictions, thereby suggesting that the source of sharpness and uncertainty is the metric chosen by the researchers, rather than changes in the model's output. It can also be seen that when using token edit distance, if the length of the target string is increased from 1 to 5, it is foreseeable that the performance of this series of models will decrease, and the downward trend is almost quasi-linear, which is consistent with the third the first half of the forecast.

Prediction: Emergent capabilities disappear with the advent of higher resolution assessments

Next Two predictions: Even with non-linear measures such as accuracy, the accuracy of smaller models will not be zero, but a non-zero value above chance, with a proportion corresponding to the choice of accuracy as the measure. In order to improve the resolution and further accurately estimate the model accuracy, the researchers also generated some other test data, and then they found that: whether it was on the integer multiplication task or the integer addition task, all the InstructGPT/GPT-3 series The models all achieved positive accuracy that exceeded chance (Figure 4). This confirms the second prediction. It can be seen that as the length of the target string increases, the accuracy decreases almost geometrically with the length of the target string, which is consistent with the second half of the third prediction. These results also suggest that the accuracy chosen by the researchers has some (approximate) effects that we should expect, namely an almost geometric decay with target length.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 4: Using more test data sets obtained Better accuracy estimates, which reveal that changes in performance are smooth, continuous, and predictable.

#From left to right: mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. Improving the resolution by generating more test data reveals that the performance of the InstructGPT/GPT-3 series of models is beyond chance even on accuracy measures, and that its improvement in both emergent capabilities is smooth, The results of these two emergent capabilities, continuous and predictable, are qualitatively consistent with mathematical models.

Test 2: Meta-analysis of model emergence

Since the GPT series models are publicly available for query, they can be analyzed. However, other models that are also claimed to have emergent capabilities (such as PaLM, Chinchilla, Gopher) are not publicly available, and the output they generate is not public, which means that researchers are limited in analyzing published results. . The researchers gave two predictions based on their own alternative hypotheses:

  • First, at the "population level" of the "task-metric-model series" triplet, when choosing to use nonlinear and/or discontinuous metrics to evaluate model performance , the model should show emergence ability on the task.
  • Second, for a specific "task-metric-model series" triple that exhibits emergent capabilities, if the metric is changed to a linear and/or continuous metric, then the emergent capability should be eliminated.

To test these two hypotheses, the researchers investigated claims of emerging capabilities on the BIG-Bench evaluation suite, where benchmarks are publicly available , and has good documentation as well.

Prediction: Emergent capabilities should appear primarily on non-linear/discontinuous measures

To test the first For prediction, the researchers analyzed on which indicators whether different "task-model series" pairings would have emergent capabilities. To determine whether a "task-metric-model family" triple is likely to exhibit emergent capabilities, they borrowed the definition introduced in the paper "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models". Let y_i ∈ R represent the model performance when the model size is x_i ∈ R, and make x_i

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Results The researchers found that most of the measures used by BIG-Bench did not have a "task-model series" pairing that showed emergent ability: among the 39 BIG-Bench measures that people preferred Among them, at most 5 exhibited emergent capabilities (Figure 5A). Most of these 5 are non-linear/non-continuous, such as exact string matching, multiple choice ranking, ROUGE-L-Sum. It is worth noting that since BIG-Bench usually uses multiple measures to evaluate the task performance of the model, the lack of emergent ability under other measures shows that when other measures are used to evaluate the model output, the emergent ability does not appear. .

Since the emergence score only indicates the ability to emerge, the researchers further analyzed the manually labeled "Task-Metric-Model Series" in the paper "137 emergent abilities of large language models" Triad. Manually annotated data showed that only 4 of the 39 measures exhibited emergent capabilities (Figure 5B), and 2 of them accounted for more than 92% of the claimed emergent capabilities (Figure 5C). Multiple selection binning and exact string matching. Multiple-choice binning is non-continuous, and exact string matching is non-linear (the change in target length metric is nearly geometric). Overall, these results suggest that emergent capabilities occur only on a very small number of nonlinear and/or discontinuous measures.

#Figure 5: Emergent capability appears for only a few measures. (A) Of the 39 BIG-Bench measures that people prefer, emergent capabilities may appear on at most only 5 measures. (B) Human-annotated data from the cited paper show that only 4 measures of people's preferences exhibit emergent power. (C) >92% of emergent abilities occur on one of two measures: multiple-choice ranking and exact string matching.

Prediction: Emergent capability should be eliminated if nonlinear/discontinuous measures are substituted

For the second prediction, the researchers analyzed the emergence ability of manual annotation in the paper cited above. They focused on the LaMDA family because its outputs are available through BIG-Bench, whereas outputs from other model families are not. Among the published LaMDA models, the smallest one has 2 billion parameters, but many LaMDA models in BIG-Bench are much smaller, and the researchers stated that because they could not determine the origin of these smaller models, they were not considered in the analysis . In the analysis, the researchers identified tasks on which LaMDA demonstrated emergent capabilities on the multiple-choice hierarchical measure, and then they asked: Can LaMDA perform on the same tasks when using another BIG-Bench measure, the Brier score? Demonstrates emergent capabilities. The Brier score is a set of strictly proper scoring rules that measures the prediction of mutually exclusive outcomes; for the prediction of a binary outcome, the Brier score is simplified to the mean square error between the outcome and its predicted probability mass .

The researchers found that when the non-continuous metric multi-choice ranking was changed to the continuous metric Brier score (Figure 6), the emergent ability of LaMDA disappeared. This further illustrates that the cause of emergent capability is not the essential change in model behavior as the scale grows, but the use of discontinuous measures.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 6: The mission and model families remain unchanged Changing the BIG-Bench metric under the premise will cause the emergence ability to disappear. Top row: The LaMDA family of models exhibits emergent capabilities when using a discontinuous measure (multiple-choice ranking). Next row: When using a continuous BIG-Bench metric (Brier score), the LaMDA model family is no longer emergent on the same task.

Test 3: Inducing DNN to emerge Models produce emergent capabilities; to prove this, they showed how to make deep neural networks of different architectures (fully connected, convolutional, self-attention) produce emergent capabilities. The researchers here focused on visual tasks for two reasons. First, people are currently focusing on the emergent ability of large-scale language models, because for visual models, a sudden shift from no model ability to some has not yet been observed. Second, some vision tasks can be solved with modestly sized networks, so researchers can build a complete family of models across multiple orders of magnitude.

The convolutional network emerged with the ability to classify MNIST handwritten digitsThe researchers first induced the implementation of LeNet convolution The neural network series emerges with classification capabilities, and the training data set is the MNIST handwritten digit data set. This series shows a smooth increase in test accuracy as the number of parameters increases (Figure 7B). To simulate the accuracy metric used in papers on emergence, subset accuracy is used here: if the network correctly classifies K data out of K (independent) test data, then the network The subset accuracy is 1, otherwise it is 0. Based on this definition of accuracy, as K increases from 1 to 5, this family of models exhibits the ability to "emerge" to correctly classify the MNIST digit set, especially when combined with sparse sampling of the model size (Fig. 7c). The emergent classification ability of this convolution series is qualitatively consistent with the emergent ability in published papers, such as the results on the topographic mapping task of BIG-Bench (Figure 7A).

## Figure 7: Inducing emergence in a convolutional network MNIST classification capabilities. (A) Emergent capabilities based on the BIG-Bench terrain mapping task from a published paper. (B) LeNet trained on MNIST shows a predictive, generalized, S-shaped increase in test accuracy as the number of model parameters grows. (C) When accuracy is redefined as correctly classifying K out of K independent test data, this newly defined metric induces a seemingly unexpected change.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Nonlinear autoencoder emerges with reconstruction capabilities on CIFAR100 natural image set

In order to highlight that the sharpness of the metric selected by the researcher is the cause of emergent ability, and to show that this sharpness is not limited to measures such as accuracy, the researcher also induced the shallowness (i.e. single hidden image) trained on the CIFAR100 natural image set layer) nonlinear autoencoders emerge with the ability to reconstruct image inputs. To this end, they deliberately defined a new discontinuity measure for measuring model capability, which is the average number of test data with squared reconstruction errors below a fixed threshold c:

Where I (・) is a random indicator variable and x^n is the autoencoder’s reconstruction of x_n. The researchers examined the number of bottleneck units in the autoencoder and found that as the model size increases, the mean square reconstruction error of the network shows a smooth downward trend (Figure 8B), but if the newly defined reconstruction metric is used, for the selected c. The ability of this autoencoder series to reconstruct the data set is sharp and almost unpredictable (Figure 8C). This result is qualitatively consistent with the emergent ability in published papers, such as BIG-Bench. Periodic Elements task (Figure 8A).

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 8: Shallow nonlinear autoencoder emergent reconstruction capabilities. (A) Emergent capabilities based on the BIG-Bench periodic element task from a published paper. (B) A shallow nonlinear autoencoder trained on CIFAR100 exhibits smoothly declining mean square reconstruction error. (C) Unpredictable changes are induced using the newly defined reconstruction metric (Equation 2).

Autoregressive Transformer has emerged with classification capabilities on the Omniglot character set

Continue Next is the emergent capability of Transformer, which uses an autoregressive method to classify Omniglot handwritten characters. The experimental setup used by the researchers is similar: the Omniglot image is first embedded by a convolutional layer, and then the decoder-only Transformer is input as a sequence of [embedded image, image category label] pairs, and the training goal of this Transformer is to predict the Omniglot Category label. The researcher measured image classification performance on a sequence of length L ∈ [1, 5], which was also measured by subset accuracy: if all L images are classified correctly (Figure 9B), then the subset accuracy is 1, Otherwise it is 0. Causal Transformer appears to exhibit emergent capabilities on the task of correctly classifying Omniglot handwritten characters (Figure 9C), a result that is qualitatively consistent with emergent capabilities in published papers, such as large-scale multi-task language understanding (Figure 9A).

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Figure 9: Inducing emergent classification capabilities in an autoregressive Transformer. (A) Emergent capabilities based on the MMLU benchmark in a published paper. (B) As the model parameters increase, the test accuracy of the Transformer that uses the autoregressive method to classify Omniglot handwritten digits also shows an increase. (C) When accuracy is redefined as correctly classifying all images in a sequence, the metric is more difficult to predict, which seems to indicate the induction of emergent ability.

The above is the detailed content of The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
As AI Use Soars, Companies Shift From SEO To GEOAs AI Use Soars, Companies Shift From SEO To GEOMay 05, 2025 am 11:09 AM

With the explosion of AI applications, enterprises are shifting from traditional search engine optimization (SEO) to generative engine optimization (GEO). Google is leading the shift. Its "AI Overview" feature has served over a billion users, providing full answers before users click on the link. [^2] Other participants are also rapidly rising. ChatGPT, Microsoft Copilot and Perplexity are creating a new “answer engine” category that completely bypasses traditional search results. If your business doesn't show up in these AI-generated answers, potential customers may never find you—even if you rank high in traditional search results. From SEO to GEO – What exactly does this mean? For decades

Big Bets On Which Of These Pathways Will Push Today's AI To Become Prized AGIBig Bets On Which Of These Pathways Will Push Today's AI To Become Prized AGIMay 05, 2025 am 11:08 AM

Let's explore the potential paths to Artificial General Intelligence (AGI). This analysis is part of my ongoing Forbes column on AI advancements, delving into the complexities of achieving AGI and Artificial Superintelligence (ASI). (See related art

Do You Train Your Chatbot, Or Vice Versa?Do You Train Your Chatbot, Or Vice Versa?May 05, 2025 am 11:07 AM

Human-computer interaction: a delicate dance of adaptation Interacting with an AI chatbot is like participating in a delicate dance of mutual influence. Your questions, responses, and preferences gradually shape the system to better meet your needs. Modern language models adapt to user preferences through explicit feedback mechanisms and implicit pattern recognition. They learn your communication style, remember your preferences, and gradually adjust their responses to fit your expectations. Yet, while we train our digital partners, something equally important is happening in the reverse direction. Our interactions with these systems are subtly reshaping our own communication patterns, thinking processes, and even expectations of interpersonal conversations. Our interactions with AI systems have begun to reshape our expectations of interpersonal interactions. We adapted to instant response,

California Taps AI To Fast-Track Wildfire Recovery PermitsCalifornia Taps AI To Fast-Track Wildfire Recovery PermitsMay 04, 2025 am 11:10 AM

AI Streamlines Wildfire Recovery Permitting Australian tech firm Archistar's AI software, utilizing machine learning and computer vision, automates the assessment of building plans for compliance with local regulations. This pre-validation significan

What The US Can Learn From Estonia's AI-Powered Digital GovernmentWhat The US Can Learn From Estonia's AI-Powered Digital GovernmentMay 04, 2025 am 11:09 AM

Estonia's Digital Government: A Model for the US? The US struggles with bureaucratic inefficiencies, but Estonia offers a compelling alternative. This small nation boasts a nearly 100% digitized, citizen-centric government powered by AI. This isn't

Wedding Planning Via Generative AIWedding Planning Via Generative AIMay 04, 2025 am 11:08 AM

Planning a wedding is a monumental task, often overwhelming even the most organized couples. This article, part of an ongoing Forbes series on AI's impact (see link here), explores how generative AI can revolutionize wedding planning. The Wedding Pl

What Are Digital Defense AI Agents?What Are Digital Defense AI Agents?May 04, 2025 am 11:07 AM

Businesses increasingly leverage AI agents for sales, while governments utilize them for various established tasks. However, consumer advocates highlight the need for individuals to possess their own AI agents as a defense against the often-targeted

A Business Leader's Guide To Generative Engine Optimization (GEO)A Business Leader's Guide To Generative Engine Optimization (GEO)May 03, 2025 am 11:14 AM

Google is leading this shift. Its "AI Overviews" feature already serves more than one billion users, providing complete answers before anyone clicks a link.[^2] Other players are also gaining ground fast. ChatGPT, Microsoft Copilot, and Pe

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment