Home  >  Article  >  Technology peripherals  >  Princeton DeepMind used mathematics to prove: LLM is not a random parrot! "The bigger the scale, the stronger the capability" has a theoretical basis

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! "The bigger the scale, the stronger the capability" has a theoretical basis

王林
王林forward
2024-02-19 09:30:25618browse

The protagonists of today’s story are two scientists, Sanjeev Arora and Anirudh Goyal.

Arora is from Princeton University, while Goyal is from Google DeepMind.

They came together and just wanted to explore one question.

That is, is LLM a random parrot that only chatters, or has it really learned something and transformed into an intelligent agent with emergent capabilities?

AI pioneers Hinton and Ng Enda also talked about this issue, but they did not come to any clear conclusion at that time.

Hinton pointed out that if a consensus cannot be reached on this issue, it will also be difficult to reach a consensus on the potential harm that AI may bring.

Arora and Goyal believe that LLM is not just about imitating mechanical repetitive learning. They pointed out that the output content of LLM is not just randomly generated from a large amount of training data, and this point deserves further exploration.

Two people co-wrote a paper for this.

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

Paper address: https://arxiv.org/abs/2307.15936

The truth is ,After extensive training, the scale of LLM becomes larger and larger, their ,related capabilities will be effectively improved, and new capabilities ,will be developed.

This is not something that ordinary permutations and combinations can do.

The "big" of large models

As we all know, LLM is a huge artificial neural network, connecting "neurons" one by one.

In fact, what we are talking about is the parameters of the model. The more parameters there are, the larger the LLM is.

Let’s first understand the mechanism and links of training LLM.

The training process will include this link - provide LLM with a single sentence, hide the last word, and then let LLM predict who the vacant vocabulary should be based on probability.

If LLM knows 1000 words, then it will produce 1000 probabilities. Finally, choose the one with the highest probability and fill it in.

At the beginning, LLM may not be able to select the correct word, and the algorithm will give a loss value, that is, in a high-dimensional mathematical space, the initial answer given by LLM is the same as the original answer. The "distance" between the correct answers of the sentence, and then use this value to fine-tune the parameters.

Afterwards, for the same sentence, LLM can calculate a more correct probability distribution, and the above loss value will be slightly reduced.

In this way, billions of sentences in the training data are run through this process until the overall loss value of LLM is reduced to a not bad level.

Similarly, testing LLM will also follow this process, and the test results will be obtained based on the loss value (of course, the sentences used for testing are definitely not in the training data, otherwise it is not cheating).

After training and testing, LLM will have a high probability of generating the most correct word when encountering a new text prompt. When a word comes out, it is thrown into prompt and the next word is generated.

Slowly generated, a seemingly coherent answer appeared on the paper.

However, in this process, there is no indication that the larger LLM will perform better on questions that require reasoning.

Pay attention to follow the train of thought. There is no indication, which means that there is no empirical evidence that can point to this result, but judging from the superficial facts, this conclusion is correct.

In other words, a larger-scale LLM will perform better than a small-scale model in terms of reasoning ability. Although there is no difference in the training methods between the two, the only difference is in scale. .

Arora is confused, where does this ability come from?

This is the starting point of Arora and Goyal's research - trying to build a theoretical framework to analyze how these new capabilities emerge.

So, they turned their attention to the field of mathematics and took aim at something called a random graph. Simply put, this term lies at the intersection of graph theory and probability theory.

In a random graph, whether there is an edge connecting them between any two nodes is random, just like tossing a coin.

If the coin is thrown heads, there is one side with probability p.

When the value of p changes, the properties of the entire random graph may suddenly change. For example, if the p value exceeds a certain threshold, some isolated nodes (that is, points that are not connected to other nodes) will suddenly disappear.

The two scientists realized that this feature of random graphs could be an intuitive way to simulate large language models.

Although the complexity of neural networks is unspeakable and almost difficult to analyze, the concept of random graphs has been studied by mathematicians for a long time, and various tools have been developed to analyze.

Perhaps, through the related theories of random graphs, neural network researchers can try to understand and analyze some characteristics of large language models.

Here, the two researchers focused on the bipartite graph, which contains two types of nodes.

In their model, one type of node represents a text fragment. Note that the fragment here must be at least a paragraph in terms of length, and may even be several pages long, rather than a single word.

Such nodes form a straight line.

The second type of nodes represents the skills required to understand the given text above. For example, the understanding of logical relationships, or the ability to calculate, or more specifically, the ability to understand sarcasm.

The purpose of giving these examples is to make it clear that this second type of node represents a variety of abilities, and all of them are included.

Arora said that if LLM can see that a certain text contains irony, the overall understanding may change significantly.

However, as we mentioned above, the capabilities represented by the second type of nodes do not mean that the purpose of LLM during the training process is to realize these capabilities. In other words, LLM only trains the ability to predict the next possible word during training.

In other words, the capabilities represented by the second type of nodes were designed by Arora and Goyal from the perspective of results, in order to better understand the capabilities displayed by LLM.

After setting up, the two types of nodes will begin to connect to each other. The connection represents what capabilities LLM needs to understand a certain paragraph of text. It may be one-to-one, it may be one-to-many, or it may be many-to-one.

Take the example of understanding irony. This skill point will establish a connection with all texts containing ironic elements.

Connection is not that simple. You know, big companies like OpenAI and DeepMind will not disclose training data or test data. So the two researchers cannot rely on these.

In addition, what they want to understand is the relationship between scale, behavior, and ability.

Since 2021, researchers studying the performance of LLMs and other neural networks have observed a common characteristic.

They noticed that as the model grew larger, both in size and in the amount of training data, its loss on test data (predictions for new text after training the difference from the correct answer) decreases in a very specific way.

These observations have been encoded into an equation called the neural scaling law.

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

Therefore, Arora and Goyal stated that their theory does not depend on any single LLM case, or a specific set of training and test data. Rather, it is some kind of universal law: the loss predicted by the scaling law.

The key to their further research is the relationship between the neural scaling law and the bipartite graph introduced above.

Borrowing of bipartite graph

First, the researcher assumes that there is a bipartite graph corresponding to the behavior of LLM on the test data.

In order to take advantage of the loss changes of LLM on the test data, they imagined the following way to describe how LLM acquires skills.

Let’s take the skill of being able to understand irony as an example -

This concept is represented by a skill node, so the researchers looked at which text nodes this skill node is connected to .

If almost all of these connected text nodes are successful - meaning that LLM's prediction of the text represented by this specific skill is very accurate - then LLM is good at this specific skill. capable.

But if more than a certain proportion of skill nodes are connected to failed text nodes, then LLM will fail on this skill.

The connections between these bipartite graphs and LLMs enable Arora and Goyal to analyze the behavior of LLMs using the tools of random graph theory.

Studying these graphs reveals certain relationships between nodes. These relationships are then transformed into a logical and testable method to explain how large language models acquire some unexpected capabilities.

Here, Arora and Goyal first explain a key behavior—why larger LLMs are more proficient at individual skills than relatively smaller models.

They started with lower test losses predicted by the neural scaling laws.

If there are fewer failed test nodes, it means there are fewer connections between failed test nodes and skill nodes. Therefore, more skill nodes connected to successful test nodes indicate that the model has increased capabilities in skills.

Next, the two researchers found a way to explain the power gained by larger models - as the size of the LLM increases and the test loss decreases, the skill node Random combinations of start connections to individual text nodes.

This shows that LLM also became better at using multiple skills at the same time, and started to use multiple skills to generate text, even if these exact skill combinations are not in any of the texts in the training data appeared.

For example, an LLM can already use one skill to generate text, then if we expand the number of parameters or training data of the LLM by an order of magnitude, it will be equally good at generating text that requires two skills text.

By analogy and by another order of magnitude, LLM can now perform tasks that require four skills at the same time! Moreover, the proficiency level in each ability is also the same.

Therefore, larger LLMs have more ways to combine skills together, leading to a significant improvement in the performance of the LLM itself.

As the LLM grows larger, the likelihood that it encounters all these skill combinations in the training data becomes smaller and smaller, until 0.

According to the rules of random graph theory, each combination comes from a random sampling of possible skills. So if there are about a thousand basic single skill nodes in the graph, and let's say we want to combine four skills, there are about 1000 to the fourth power - that's a full trillion possible combinations.

In other words, if an LLM can really perform these tasks by combining four of the 1,000 skills, it means that the model must have generalization capabilities, and more Furthermore, the model is probably not a random parrot.

But Arora and Goyal wanted to go beyond theory and test their idea that LLMs become better at combining more skills as their scale and training data increase, so in general perform better in terms of culture.

Together with other members of the team, they designed a method called skill blending to evaluate LLM's ability to generate text using multiple skills.

To put LLM to the test, the research team asked it to generate three sentences about randomly selected topics. The generation of these sentences first demonstrated LLM's randomly selected skill points.

For example, they asked GPT-4 to write an article about swordsmanship, and then they asked the model to demonstrate skills from four areas: self-bias, metaphor, statistics, and Physics attempts mastery.

The output of GPT-4 is like this:

In this dance with steel, my victory (to use metaphor ) as sure as an object will fall freely (using physics experiments).

And as a famous duelist, I am naturally flexible, as most people know me (using statistics). fail? It can only be because the battlefield is tilted towards the enemy, not because of my shortcomings (self-bias).

The actual result, as the mathematics predicts, is that GPT-4 far outperforms GPT-3.5.

Arora makes a bold guess, will there be a model that far surpasses GPT-4 in a year?

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

The above is the detailed content of Princeton DeepMind used mathematics to prove: LLM is not a random parrot! "The bigger the scale, the stronger the capability" has a theoretical basis. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete