search
HomeTechnology peripheralsAIAre large language models wrong for coding?

Reinforcement learning models beat generative AI when the goal is accuracy, consistency, game mastery, or finding one correct answer.

Large language models, such as GPT-4, are impressive because they can generate high-quality, smooth and natural text that is extremely convincing. Sadly, so does the hype: Microsoft researchers breathlessly describe the Microsoft-funded OpenAI GPT-4 model as demonstrating "a spark of artificial general intelligence."

Of course, unless Microsoft is referring to a tendency to hallucinate, the generated error text must be wrong. GPT is not good at playing games such as chess and Go, it is not good at mathematics, and the code it writes may have errors and subtle loopholes.

This does not mean that large language models are all hype. We need some new angles to discuss generative artificial intelligence (GenAI) without over exaggerating its differences from other technologies.

As detailed in an IEEESpectrum article, some experts, such as OpenAI’s Ilya Sutskever, believe that adding reinforcement learning with human feedback can eliminate the LLM illusion. But others, like Meta's Yann LeCun and Geoff Hinton (recently retired from Google), think more fundamental flaws in large language models are at work. Both believe that large language models lack the non-linguistic knowledge that is crucial to understanding the underlying reality that language describes.

Diffblue CEO Mathew Lodge pointed out in an interview that there is a better solution. He said, "Small, fast, and cheap to run reinforcement learning models can easily defeat large language models with hundreds of billions of parameters in a variety of tasks, from playing games to writing code."

Are we looking for AI gold in the wrong places?

What Lodge is saying is that generative AI certainly has its uses, but perhaps we are trying to It forces the introduction of an area of ​​reinforcement learning for which it is not well suited. Take games for example.

Levy Rozman, a chess grandmaster, posted a video of himself playing against ChatGPT (chat-based artificial intelligence). The model made a series of ridiculous and illegal moves, including capturing its own pieces. The best open source chess software (Stockfish, which doesn't use neural networks at all) lets ChatGPT beat it in less than 10 moves because the large language model can't find legal moves. This proves that large language models fall far short of the claims of general artificial intelligence, and this is not an isolated example.

Due to its reinforcement learning algorithm, Google AlphaGo is the best performing Go artificial intelligence currently. Reinforcement learning works by generating different solutions to a problem, trying them, using the results to improve the next suggestion, and then repeating the process thousands of times to find the best result.

In the case of AlphaGo, the AI ​​tries different moves and predicts whether this is a good move and whether it is likely to win the game from this position. It uses feedback to "track" promising sequences of moves and generate other possible moves. The effect is a search for possible moves.

This process is called probabilistic search. Although there are many moves, you don't need to try them all, but you can be patient and search the areas where you might find the best moves. This works great for gaming. AlphaGo has defeated Go masters in the past. AlphaGo is not infallible, but it currently performs better than the best large-scale language models available today.

Probability vs. Accuracy

Proponents believe that even though there is evidence that large language models significantly lag behind other types of artificial intelligence, They also get progressively better. However, Lodge points out that we need to understand why they perform better at this kind of task if we are to accept this idea. The reason for the difficulty on this issue, he continued, is that no one can predict exactly how GPT-4 will respond to specific cues. This pattern is beyond human explanation. This, he believes, is “the reason why ‘just-in-time engineering’ doesn’t exist.” He stresses that it’s also a struggle for AI researchers to prove that “emergent properties” of large language models exist, let alone predict them.

It can be said that the best argument is induction. GPT-4 is better than GPT-3 on some language tasks because it is larger. Therefore, a larger model would be better.

Lodge’s view is that GPT-4 still needs to overcome the challenges faced by GPT-3, so there is a problem. One of them is math; while GPT-4 is better than GPT-3 at addition operations, it still has bottlenecks at multiplication and other mathematical operations.

Increasing the size of language models does not magically solve these problems, and according to OpenAI, larger models are not the solution. The reason comes down to the fundamental nature of large language models, as the OpenAI forum points out: “Large language models are probabilistic in nature and operate by generating possible outputs based on the patterns they observe in the training data. In Mathematics and physics problems, the likelihood of finding a single correct answer is slim."

In the artificial intelligence process, methods driven by reinforcement learning can produce results more accurately because it is a process of pursuing a goal. Reinforcement learning iteratively finds the best answer closest to the goal to achieve the desired goal. Lodge points out that large language model courses "are not designed to iterate or find goals. They are designed to give a 'good enough' answer one or a few times."

A "one-shot" answer is the first answer produced by the model, obtained by predicting a sequence of words in the prompt. "Few-shot learning" involves providing additional samples or cues to the model to assist it in generating better predictions.. Large language models often also add some randomness (that is, they are "randomized") to increase the likelihood of a better answer, so they will give different answers to the same question.

It’s not that the large language model world ignores reinforcement learning. GPT-4 combines "reinforcement learning with human feedback" (RLHF). A core model trained by a human operator favors certain answers, but this does not fundamentally change the answer the model generated in the first place. Lodge noted that a large language model might provide the following options to fill in the gaps in the sentence "Wayne Gretzky likes ice..."

1. Wayne Gretzky loves ice cream.

2. Wayne Gretzky loves ice hockey.

3. Wayne Gretzky loves ice fishing.

4. Wayne Gretzky loves skating.

5. Wayne Gretzky likes ice wine.

Human operators ranked the answers and might have concluded that the legendary Canadian hockey player preferred ice hockey and skating, despite the broad appeal of ice cream. Human rankings and more human-written responses are used to train the model. Note that GPT-4 does not pretend to know Wayne Gretzky's preferences accurately, only to do the best possible job when prompted.

Finally, large language models are not designed to be highly accurate or consistent. There is a trade-off between accuracy and deterministic behavior in exchange for generality. To Lodge, all this means is that reinforcement learning beats generative AI when it comes to applying AI at scale.

Applying Reinforcement Learning to Software

What about software development? As I write, GenAI already provides a solution for those using tools like GitHubCopilot or AmazonCodeWhisperer Opportunities are provided to increase developer productivity. This is not speculation - it has happened. These tools can predict what code is likely to appear next, based on the code before and after the insertion point in the integrated development environment.

In fact, as David Ramel of Visual Studio Magazine said, the latest version of Copilot already generates 61% of the Java code. For those concerned that this will reduce the work of software developers, remember that these tools require diligent human supervision to check completion and edit it so that the code compiles and runs correctly. Autocompletion has been a staple of IDEs since their earliest days, and Copilot and other code generators make it even more useful. Large-scale autonomous coding is different, in fact 61% of the Java code needs to be written.

However, reinforcement learning enables precise autonomous coding at scale, Lodge said. Of course, he has a vested interest in saying this: In 2019, his company Diffblue released Cover, a commercial unit test writing tool based on reinforcement learning. Cover writes complete unit test suites without human intervention, making it possible to automate complex, error-prone tasks at scale.

Is Lodge biased? Absolutely. He has many experiences justifying his belief that reinforcement learning outperforms GenAI in software development. Today, Diffblue uses reinforcement learning to search the space of all possible test methods, automatically write test code for each method, and select the best test among the tests written. Reinforcement learning reward functions are based on a variety of criteria, including test coverage and aesthetics, one of which includes conforming to human-written coding style. The tool creates tests for each method in an average of one second.

Lodge believes that if the goal is to automatically write 10,000 unit tests for a program that no one understands, then reinforcement learning is the only real solution. "Large language models cannot compete; humans have no way to effectively supervise them and correct their code at this scale, and making models larger and more complex does not solve this problem."

Conclusion: The most powerful thing about large language models is that they are general-purpose language processors. They can perform language tasks for which they have not been explicitly trained. This means they can do a great job at content generation (copywriting) and many other things. Lodge emphasized: "But this does not make large language models a substitute for artificial intelligence models, which are often based on reinforcement learning, which are more accurate, more consistent, and can be used at scale."

The above is the detailed content of Are large language models wrong for coding?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
undress free porn AI tool websiteundress free porn AI tool websiteMay 13, 2025 am 11:26 AM

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How to create pornographic images/videos using undressAIHow to create pornographic images/videos using undressAIMay 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undress AI official website entrance website addressundress AI official website entrance website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How does undressAI generate pornographic images/videos?How does undressAI generate pornographic images/videos?May 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undressAI porn AI official website addressundressAI porn AI official website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

UndressAI usage tutorial guide articleUndressAI usage tutorial guide articleMay 13, 2025 am 10:43 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyright[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsExplaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!