search
HomeTechnology peripheralsAIPrinceton DeepMind used mathematics to prove: LLM is not a random parrot! 'The bigger the scale, the stronger the capability' has a theoretical basis

The protagonists of today’s story are two scientists, Sanjeev Arora and Anirudh Goyal.

Arora is from Princeton University, while Goyal is from Google DeepMind.

They came together and just wanted to explore one question.

That is, is LLM a random parrot that only chatters, or has it really learned something and transformed into an intelligent agent with emergent capabilities?

AI pioneers Hinton and Ng Enda also talked about this issue, but they did not come to any clear conclusion at that time.

Hinton pointed out that if a consensus cannot be reached on this issue, it will also be difficult to reach a consensus on the potential harm that AI may bring.

Arora and Goyal believe that LLM is not just about imitating mechanical repetitive learning. They pointed out that the output content of LLM is not just randomly generated from a large amount of training data, and this point deserves further exploration.

Two people co-wrote a paper for this.

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

Paper address: https://arxiv.org/abs/2307.15936

The truth is ,After extensive training, the scale of LLM becomes larger and larger, their ,related capabilities will be effectively improved, and new capabilities ,will be developed.

This is not something that ordinary permutations and combinations can do.

The "big" of large models

As we all know, LLM is a huge artificial neural network, connecting "neurons" one by one.

In fact, what we are talking about is the parameters of the model. The more parameters there are, the larger the LLM is.

Let’s first understand the mechanism and links of training LLM.

The training process will include this link - provide LLM with a single sentence, hide the last word, and then let LLM predict who the vacant vocabulary should be based on probability.

If LLM knows 1000 words, then it will produce 1000 probabilities. Finally, choose the one with the highest probability and fill it in.

At the beginning, LLM may not be able to select the correct word, and the algorithm will give a loss value, that is, in a high-dimensional mathematical space, the initial answer given by LLM is the same as the original answer. The "distance" between the correct answers of the sentence, and then use this value to fine-tune the parameters.

Afterwards, for the same sentence, LLM can calculate a more correct probability distribution, and the above loss value will be slightly reduced.

In this way, billions of sentences in the training data are run through this process until the overall loss value of LLM is reduced to a not bad level.

Similarly, testing LLM will also follow this process, and the test results will be obtained based on the loss value (of course, the sentences used for testing are definitely not in the training data, otherwise it is not cheating).

After training and testing, LLM will have a high probability of generating the most correct word when encountering a new text prompt. When a word comes out, it is thrown into prompt and the next word is generated.

Slowly generated, a seemingly coherent answer appeared on the paper.

However, in this process, there is no indication that the larger LLM will perform better on questions that require reasoning.

Pay attention to follow the train of thought. There is no indication, which means that there is no empirical evidence that can point to this result, but judging from the superficial facts, this conclusion is correct.

In other words, a larger-scale LLM will perform better than a small-scale model in terms of reasoning ability. Although there is no difference in the training methods between the two, the only difference is in scale. .

Arora is confused, where does this ability come from?

This is the starting point of Arora and Goyal's research - trying to build a theoretical framework to analyze how these new capabilities emerge.

So, they turned their attention to the field of mathematics and took aim at something called a random graph. Simply put, this term lies at the intersection of graph theory and probability theory.

In a random graph, whether there is an edge connecting them between any two nodes is random, just like tossing a coin.

If the coin is thrown heads, there is one side with probability p.

When the value of p changes, the properties of the entire random graph may suddenly change. For example, if the p value exceeds a certain threshold, some isolated nodes (that is, points that are not connected to other nodes) will suddenly disappear.

The two scientists realized that this feature of random graphs could be an intuitive way to simulate large language models.

Although the complexity of neural networks is unspeakable and almost difficult to analyze, the concept of random graphs has been studied by mathematicians for a long time, and various tools have been developed to analyze.

Perhaps, through the related theories of random graphs, neural network researchers can try to understand and analyze some characteristics of large language models.

Here, the two researchers focused on the bipartite graph, which contains two types of nodes.

In their model, one type of node represents a text fragment. Note that the fragment here must be at least a paragraph in terms of length, and may even be several pages long, rather than a single word.

Such nodes form a straight line.

The second type of nodes represents the skills required to understand the given text above. For example, the understanding of logical relationships, or the ability to calculate, or more specifically, the ability to understand sarcasm.

The purpose of giving these examples is to make it clear that this second type of node represents a variety of abilities, and all of them are included.

Arora said that if LLM can see that a certain text contains irony, the overall understanding may change significantly.

However, as we mentioned above, the capabilities represented by the second type of nodes do not mean that the purpose of LLM during the training process is to realize these capabilities. In other words, LLM only trains the ability to predict the next possible word during training.

In other words, the capabilities represented by the second type of nodes were designed by Arora and Goyal from the perspective of results, in order to better understand the capabilities displayed by LLM.

After setting up, the two types of nodes will begin to connect to each other. The connection represents what capabilities LLM needs to understand a certain paragraph of text. It may be one-to-one, it may be one-to-many, or it may be many-to-one.

Take the example of understanding irony. This skill point will establish a connection with all texts containing ironic elements.

Connection is not that simple. You know, big companies like OpenAI and DeepMind will not disclose training data or test data. So the two researchers cannot rely on these.

In addition, what they want to understand is the relationship between scale, behavior, and ability.

Since 2021, researchers studying the performance of LLMs and other neural networks have observed a common characteristic.

They noticed that as the model grew larger, both in size and in the amount of training data, its loss on test data (predictions for new text after training the difference from the correct answer) decreases in a very specific way.

These observations have been encoded into an equation called the neural scaling law.

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

Therefore, Arora and Goyal stated that their theory does not depend on any single LLM case, or a specific set of training and test data. Rather, it is some kind of universal law: the loss predicted by the scaling law.

The key to their further research is the relationship between the neural scaling law and the bipartite graph introduced above.

Borrowing of bipartite graph

First, the researcher assumes that there is a bipartite graph corresponding to the behavior of LLM on the test data.

In order to take advantage of the loss changes of LLM on the test data, they imagined the following way to describe how LLM acquires skills.

Let’s take the skill of being able to understand irony as an example -

This concept is represented by a skill node, so the researchers looked at which text nodes this skill node is connected to .

If almost all of these connected text nodes are successful - meaning that LLM's prediction of the text represented by this specific skill is very accurate - then LLM is good at this specific skill. capable.

But if more than a certain proportion of skill nodes are connected to failed text nodes, then LLM will fail on this skill.

The connections between these bipartite graphs and LLMs enable Arora and Goyal to analyze the behavior of LLMs using the tools of random graph theory.

Studying these graphs reveals certain relationships between nodes. These relationships are then transformed into a logical and testable method to explain how large language models acquire some unexpected capabilities.

Here, Arora and Goyal first explain a key behavior—why larger LLMs are more proficient at individual skills than relatively smaller models.

They started with lower test losses predicted by the neural scaling laws.

If there are fewer failed test nodes, it means there are fewer connections between failed test nodes and skill nodes. Therefore, more skill nodes connected to successful test nodes indicate that the model has increased capabilities in skills.

Next, the two researchers found a way to explain the power gained by larger models - as the size of the LLM increases and the test loss decreases, the skill node Random combinations of start connections to individual text nodes.

This shows that LLM also became better at using multiple skills at the same time, and started to use multiple skills to generate text, even if these exact skill combinations are not in any of the texts in the training data appeared.

For example, an LLM can already use one skill to generate text, then if we expand the number of parameters or training data of the LLM by an order of magnitude, it will be equally good at generating text that requires two skills text.

By analogy and by another order of magnitude, LLM can now perform tasks that require four skills at the same time! Moreover, the proficiency level in each ability is also the same.

Therefore, larger LLMs have more ways to combine skills together, leading to a significant improvement in the performance of the LLM itself.

As the LLM grows larger, the likelihood that it encounters all these skill combinations in the training data becomes smaller and smaller, until 0.

According to the rules of random graph theory, each combination comes from a random sampling of possible skills. So if there are about a thousand basic single skill nodes in the graph, and let's say we want to combine four skills, there are about 1000 to the fourth power - that's a full trillion possible combinations.

In other words, if an LLM can really perform these tasks by combining four of the 1,000 skills, it means that the model must have generalization capabilities, and more Furthermore, the model is probably not a random parrot.

But Arora and Goyal wanted to go beyond theory and test their idea that LLMs become better at combining more skills as their scale and training data increase, so in general perform better in terms of culture.

Together with other members of the team, they designed a method called skill blending to evaluate LLM's ability to generate text using multiple skills.

To put LLM to the test, the research team asked it to generate three sentences about randomly selected topics. The generation of these sentences first demonstrated LLM's randomly selected skill points.

For example, they asked GPT-4 to write an article about swordsmanship, and then they asked the model to demonstrate skills from four areas: self-bias, metaphor, statistics, and Physics attempts mastery.

The output of GPT-4 is like this:

In this dance with steel, my victory (to use metaphor ) as sure as an object will fall freely (using physics experiments).

And as a famous duelist, I am naturally flexible, as most people know me (using statistics). fail? It can only be because the battlefield is tilted towards the enemy, not because of my shortcomings (self-bias).

The actual result, as the mathematics predicts, is that GPT-4 far outperforms GPT-3.5.

Arora makes a bold guess, will there be a model that far surpasses GPT-4 in a year?

Princeton DeepMind used mathematics to prove: LLM is not a random parrot! The bigger the scale, the stronger the capability has a theoretical basis

The above is the detailed content of Princeton DeepMind used mathematics to prove: LLM is not a random parrot! 'The bigger the scale, the stronger the capability' has a theoretical basis. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.