Home  >  Article  >  Technology peripherals  >  The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

WBOY
WBOYforward
2023-05-09 09:55:091427browse

"Don't be too superstitious about the emergence of large models. Where are there so many miracles in the world?" Researchers at Stanford University found that the emergence of large models is strongly related to the evaluation indicators of the task. It is not that the behavior of the model is related to specific tasks and tasks. For basic changes in scale, after replacing some more continuous and smooth indicators, the emergence phenomenon is less obvious and closer to linearity.

Recently, researchers have observed that large language models (LLMs), such as GPT, PaLM, and LaMDA, can exhibit so-called "emergent capabilities" in different tasks. The terminology has received a lot of attention in the field of machine learning:

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

In fact, the Emerging properties have always been the focus of research in physics, biology, mathematics and other disciplines.

One point worth noting is that Nobel Prize winner P.W. Anderson put forward "More Is Different". This view holds that as system complexity increases, new properties may materialize, even if they are not predicted (easily or at all) from a precise quantitative understanding of the microscopic details of the system.

How to define "emergence" in the field of large models? A colloquial way of saying this is "capabilities that are not present in small-scale models but are present in large-scale models", and therefore they cannot be predicted by simply extrapolating the performance improvements of small-scale models.

This emergent ability may have been first discovered in the GPT-3 family. Some subsequent work highlighted this finding: "While model performance is predictable at a general level, on specific tasks its performance sometimes emerges at a scale that is quite unpredictable." In fact, these emergent capabilities are so surprising that “sudden, specific expansion of capabilities” has been cited as one of the two most defining characteristics of LLM. In addition, terms such as "breakthrough capabilities" and "sharp left turns" are also used.

To sum up, we can identify two decisive attributes of LLM’s emergent ability:

#1. Acuity, from "non-existence" ” to “existence” seems to be just an instant transition;

2. Unpredictability, transition within the seemingly unforeseen scale of the model.

Meanwhile, some questions remain unanswered: What controls which capabilities emerge? What controls the emergence of capabilities? How can we make desirable capabilities emerge more quickly and ensure that less desirable capabilities never emerge?

These questions are relevant to the safety and alignment of artificial intelligence, as emergent capabilities signal that larger models may one day gain mastery of dangerous capabilities without warning, This is something humans don't want to happen.

In a recent paper, researchers at Stanford University questioned the claim that LLM has emergent capabilities.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Paper: https://arxiv.org/pdf/2304.15004.pdf

Specifically, the challenge here is directed at the emergent and unpredictable changes in model output as a function of model size during a specific task.

Their skepticism is based on the observation that models appear to be emergent only if they scale non-linearly or discontinuously on any measure of the model's per-token error rate. For example, in the BIG-Bench task, >92% of the emergent capabilities emerged under these two metrics:

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

This raises the possibility of another explanation for the origin of the emergent ability of LLMs: although the per-token error rate of the model family will change smoothly, continuously and predictably as the model size increases , but seemingly sharp and unpredictable changes may be caused by the measurement method chosen by the researcher.

That is, emergent ability may be a mirage, mainly due to the researchers choosing a metric that changes the per-token error rate nonlinearly or discontinuously, in part because This is due to having too little test data to accurately estimate the performance of smaller models (causing the smaller models to appear to be completely unable to perform the task), and partly due to evaluating too few large-scale models.

To illustrate this explanation, we treat it as a simple mathematical model and demonstrate how it quantitatively reproduces the evidence provided to support the emergent power of LLM. We then tested this explanation in three complementary ways:

1. Using the InstructGPT [24]/GPT-3 [3] family of models, based on the alternative hypothesis , test and confirm three predictions.

2. Conducted a meta-analysis of some previous results and showed that in the space of task-metric-model-family triplets, capabilities emerge only for certain metrics and not Not a model family (column) on the task. The study further shows that at a fixed model output, changing the metric causes the emergence phenomenon to disappear.

3. Deliberately induce emergent capabilities across multiple vision tasks (which has never been demonstrated before) in deep neural networks of different architectures to show how similar metric choices can Inducing seemingly emergent abilities.

Test 1: InstructGPT/GPT-3 model series analysis

The researcher chose the GPT series model for further analysis because it is publicly queryable. A little different from other model series (such as PaLM, LaMDA, Gopher, Chinchilla). In previous research, the GPT family of models was thought to exhibit emergent capabilities in integer arithmetic tasks. Here, the researchers also chose the task of integer arithmetic.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Figure 2: The emergent ability of large language models is research artifacts of the author's analysis rather than fundamental changes in model output with scale.

As explained mathematically and graphically in Section 2, the alternative explanation proposed by the researchers predicts three outcomes:

1. As the model scale increases, if the metric is changed from a nonlinear/discontinuous metric (Figure 2CD) to a linear/continuous metric (Figure 2EF), then there should be smooth and continuous , predictable performance improvements.

2. For nonlinear measures, if the resolution of the measured model performance is improved by increasing the size of the test data set, then the model should be able to obtain smooth and continuous , predictable improvement, and the proportion of this improvement corresponds to the predictable nonlinear effects of the chosen metric.

3. Regardless of the metric used, increasing the target string length should have an impact on model performance as a function of target performance of length 1: For accuracy It is an almost geometric function, and the edit distance for token is an almost quasi-linear function.

In order to test these three prediction conclusions, the researchers collected the string output results of the InstructGPT/GPT-3 series models on two arithmetic tasks: using the OpenAI API to execute 2 two-digit Two-sample multiplication between two-digit integers and two-sample addition between two four-digit integers.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Figure 3: As the model scale increases, changing the metric can bring smooth, continuous, and predictable performance Change.

From left to right: mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. The graph above is model performance measured using a nonlinear metric such as accuracy, and you can see that the performance of the InstructGPT/GPT-3 family of models appears sharper and less predictable at longer target lengths. The figure below is the model performance measured using a linear metric (such as token edit distance). This series of models shows smooth and predictable performance improvements, which is the ability that researchers claim emerges.

Prediction: Emergent ability disappears under linear measures

On both integer multiplication and addition tasks , the GPT family of models exhibits emergent arithmetic capabilities if the length of the target string is 4 or 5 digits and the performance is measured by accuracy (top row of Figure 3). However, if you change a metric from nonlinear to linear while keeping the model's output fixed, the performance of the family of models improves smoothly, continuously, and predictably. This confirms the researchers' predictions, thereby suggesting that the source of sharpness and uncertainty is the metric chosen by the researchers, rather than changes in the model's output. It can also be seen that when using token edit distance, if the length of the target string is increased from 1 to 5, it is foreseeable that the performance of this series of models will decrease, and the downward trend is almost quasi-linear, which is consistent with the third the first half of the forecast.

Prediction: Emergent capabilities disappear with the advent of higher resolution assessments

Next Two predictions: Even with non-linear measures such as accuracy, the accuracy of smaller models will not be zero, but a non-zero value above chance, with a proportion corresponding to the choice of accuracy as the measure. In order to improve the resolution and further accurately estimate the model accuracy, the researchers also generated some other test data, and then they found that: whether it was on the integer multiplication task or the integer addition task, all the InstructGPT/GPT-3 series The models all achieved positive accuracy that exceeded chance (Figure 4). This confirms the second prediction. It can be seen that as the length of the target string increases, the accuracy decreases almost geometrically with the length of the target string, which is consistent with the second half of the third prediction. These results also suggest that the accuracy chosen by the researchers has some (approximate) effects that we should expect, namely an almost geometric decay with target length.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 4: Using more test data sets obtained Better accuracy estimates, which reveal that changes in performance are smooth, continuous, and predictable.

#From left to right: mathematical model, 2 two-digit integer multiplication tasks, 2 four-digit integer addition tasks. Improving the resolution by generating more test data reveals that the performance of the InstructGPT/GPT-3 series of models is beyond chance even on accuracy measures, and that its improvement in both emergent capabilities is smooth, The results of these two emergent capabilities, continuous and predictable, are qualitatively consistent with mathematical models.

Test 2: Meta-analysis of model emergence

Since the GPT series models are publicly available for query, they can be analyzed. However, other models that are also claimed to have emergent capabilities (such as PaLM, Chinchilla, Gopher) are not publicly available, and the output they generate is not public, which means that researchers are limited in analyzing published results. . The researchers gave two predictions based on their own alternative hypotheses:

  • First, at the "population level" of the "task-metric-model series" triplet, when choosing to use nonlinear and/or discontinuous metrics to evaluate model performance , the model should show emergence ability on the task.
  • Second, for a specific "task-metric-model series" triple that exhibits emergent capabilities, if the metric is changed to a linear and/or continuous metric, then the emergent capability should be eliminated.

To test these two hypotheses, the researchers investigated claims of emerging capabilities on the BIG-Bench evaluation suite, where benchmarks are publicly available , and has good documentation as well.

Prediction: Emergent capabilities should appear primarily on non-linear/discontinuous measures

To test the first For prediction, the researchers analyzed on which indicators whether different "task-model series" pairings would have emergent capabilities. To determine whether a "task-metric-model family" triple is likely to exhibit emergent capabilities, they borrowed the definition introduced in the paper "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models". Let y_i ∈ R represent the model performance when the model size is x_i ∈ R, and make x_i

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

##Results The researchers found that most of the measures used by BIG-Bench did not have a "task-model series" pairing that showed emergent ability: among the 39 BIG-Bench measures that people preferred Among them, at most 5 exhibited emergent capabilities (Figure 5A). Most of these 5 are non-linear/non-continuous, such as exact string matching, multiple choice ranking, ROUGE-L-Sum. It is worth noting that since BIG-Bench usually uses multiple measures to evaluate the task performance of the model, the lack of emergent ability under other measures shows that when other measures are used to evaluate the model output, the emergent ability does not appear. .

Since the emergence score only indicates the ability to emerge, the researchers further analyzed the manually labeled "Task-Metric-Model Series" in the paper "137 emergent abilities of large language models" Triad. Manually annotated data showed that only 4 of the 39 measures exhibited emergent capabilities (Figure 5B), and 2 of them accounted for more than 92% of the claimed emergent capabilities (Figure 5C). Multiple selection binning and exact string matching. Multiple-choice binning is non-continuous, and exact string matching is non-linear (the change in target length metric is nearly geometric). Overall, these results suggest that emergent capabilities occur only on a very small number of nonlinear and/or discontinuous measures.

#Figure 5: Emergent capability appears for only a few measures. (A) Of the 39 BIG-Bench measures that people prefer, emergent capabilities may appear on at most only 5 measures. (B) Human-annotated data from the cited paper show that only 4 measures of people's preferences exhibit emergent power. (C) >92% of emergent abilities occur on one of two measures: multiple-choice ranking and exact string matching.

Prediction: Emergent capability should be eliminated if nonlinear/discontinuous measures are substituted

For the second prediction, the researchers analyzed the emergence ability of manual annotation in the paper cited above. They focused on the LaMDA family because its outputs are available through BIG-Bench, whereas outputs from other model families are not. Among the published LaMDA models, the smallest one has 2 billion parameters, but many LaMDA models in BIG-Bench are much smaller, and the researchers stated that because they could not determine the origin of these smaller models, they were not considered in the analysis . In the analysis, the researchers identified tasks on which LaMDA demonstrated emergent capabilities on the multiple-choice hierarchical measure, and then they asked: Can LaMDA perform on the same tasks when using another BIG-Bench measure, the Brier score? Demonstrates emergent capabilities. The Brier score is a set of strictly proper scoring rules that measures the prediction of mutually exclusive outcomes; for the prediction of a binary outcome, the Brier score is simplified to the mean square error between the outcome and its predicted probability mass .

The researchers found that when the non-continuous metric multi-choice ranking was changed to the continuous metric Brier score (Figure 6), the emergent ability of LaMDA disappeared. This further illustrates that the cause of emergent capability is not the essential change in model behavior as the scale grows, but the use of discontinuous measures.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 6: The mission and model families remain unchanged Changing the BIG-Bench metric under the premise will cause the emergence ability to disappear. Top row: The LaMDA family of models exhibits emergent capabilities when using a discontinuous measure (multiple-choice ranking). Next row: When using a continuous BIG-Bench metric (Brier score), the LaMDA model family is no longer emergent on the same task.

Test 3: Inducing DNN to emerge Models produce emergent capabilities; to prove this, they showed how to make deep neural networks of different architectures (fully connected, convolutional, self-attention) produce emergent capabilities. The researchers here focused on visual tasks for two reasons. First, people are currently focusing on the emergent ability of large-scale language models, because for visual models, a sudden shift from no model ability to some has not yet been observed. Second, some vision tasks can be solved with modestly sized networks, so researchers can build a complete family of models across multiple orders of magnitude.

The convolutional network emerged with the ability to classify MNIST handwritten digitsThe researchers first induced the implementation of LeNet convolution The neural network series emerges with classification capabilities, and the training data set is the MNIST handwritten digit data set. This series shows a smooth increase in test accuracy as the number of parameters increases (Figure 7B). To simulate the accuracy metric used in papers on emergence, subset accuracy is used here: if the network correctly classifies K data out of K (independent) test data, then the network The subset accuracy is 1, otherwise it is 0. Based on this definition of accuracy, as K increases from 1 to 5, this family of models exhibits the ability to "emerge" to correctly classify the MNIST digit set, especially when combined with sparse sampling of the model size (Fig. 7c). The emergent classification ability of this convolution series is qualitatively consistent with the emergent ability in published papers, such as the results on the topographic mapping task of BIG-Bench (Figure 7A).

## Figure 7: Inducing emergence in a convolutional network MNIST classification capabilities. (A) Emergent capabilities based on the BIG-Bench terrain mapping task from a published paper. (B) LeNet trained on MNIST shows a predictive, generalized, S-shaped increase in test accuracy as the number of model parameters grows. (C) When accuracy is redefined as correctly classifying K out of K independent test data, this newly defined metric induces a seemingly unexpected change.

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Nonlinear autoencoder emerges with reconstruction capabilities on CIFAR100 natural image set

In order to highlight that the sharpness of the metric selected by the researcher is the cause of emergent ability, and to show that this sharpness is not limited to measures such as accuracy, the researcher also induced the shallowness (i.e. single hidden image) trained on the CIFAR100 natural image set layer) nonlinear autoencoders emerge with the ability to reconstruct image inputs. To this end, they deliberately defined a new discontinuity measure for measuring model capability, which is the average number of test data with squared reconstruction errors below a fixed threshold c:

Where I (・) is a random indicator variable and x^n is the autoencoder’s reconstruction of x_n. The researchers examined the number of bottleneck units in the autoencoder and found that as the model size increases, the mean square reconstruction error of the network shows a smooth downward trend (Figure 8B), but if the newly defined reconstruction metric is used, for the selected c. The ability of this autoencoder series to reconstruct the data set is sharp and almost unpredictable (Figure 8C). This result is qualitatively consistent with the emergent ability in published papers, such as BIG-Bench. Periodic Elements task (Figure 8A).

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

## Figure 8: Shallow nonlinear autoencoder emergent reconstruction capabilities. (A) Emergent capabilities based on the BIG-Bench periodic element task from a published paper. (B) A shallow nonlinear autoencoder trained on CIFAR100 exhibits smoothly declining mean square reconstruction error. (C) Unpredictable changes are induced using the newly defined reconstruction metric (Equation 2).

Autoregressive Transformer has emerged with classification capabilities on the Omniglot character set

Continue Next is the emergent capability of Transformer, which uses an autoregressive method to classify Omniglot handwritten characters. The experimental setup used by the researchers is similar: the Omniglot image is first embedded by a convolutional layer, and then the decoder-only Transformer is input as a sequence of [embedded image, image category label] pairs, and the training goal of this Transformer is to predict the Omniglot Category label. The researcher measured image classification performance on a sequence of length L ∈ [1, 5], which was also measured by subset accuracy: if all L images are classified correctly (Figure 9B), then the subset accuracy is 1, Otherwise it is 0. Causal Transformer appears to exhibit emergent capabilities on the task of correctly classifying Omniglot handwritten characters (Figure 9C), a result that is qualitatively consistent with emergent capabilities in published papers, such as large-scale multi-task language understanding (Figure 9A).

The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.

Figure 9: Inducing emergent classification capabilities in an autoregressive Transformer. (A) Emergent capabilities based on the MMLU benchmark in a published paper. (B) As the model parameters increase, the test accuracy of the Transformer that uses the autoregressive method to classify Omniglot handwritten digits also shows an increase. (C) When accuracy is redefined as correctly classifying all images in a sequence, the metric is more difficult to predict, which seems to indicate the induction of emergent ability.

The above is the detailed content of The latest Stanford research reminds us not to place too much faith in the ability of large models to emerge, because this is just a result of metric selection.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete