Home > Article > Technology peripherals > Are large language models wrong for coding?
Reinforcement learning models beat generative AI when the goal is accuracy, consistency, game mastery, or finding one correct answer.
Large language models, such as GPT-4, are impressive because they can generate high-quality, smooth and natural text that is extremely convincing. Sadly, so does the hype: Microsoft researchers breathlessly describe the Microsoft-funded OpenAI GPT-4 model as demonstrating "a spark of artificial general intelligence."
Of course, unless Microsoft is referring to a tendency to hallucinate, the generated error text must be wrong. GPT is not good at playing games such as chess and Go, it is not good at mathematics, and the code it writes may have errors and subtle loopholes.
This does not mean that large language models are all hype. We need some new angles to discuss generative artificial intelligence (GenAI) without over exaggerating its differences from other technologies.
As detailed in an IEEESpectrum article, some experts, such as OpenAI’s Ilya Sutskever, believe that adding reinforcement learning with human feedback can eliminate the LLM illusion. But others, like Meta's Yann LeCun and Geoff Hinton (recently retired from Google), think more fundamental flaws in large language models are at work. Both believe that large language models lack the non-linguistic knowledge that is crucial to understanding the underlying reality that language describes.
Diffblue CEO Mathew Lodge pointed out in an interview that there is a better solution. He said, "Small, fast, and cheap to run reinforcement learning models can easily defeat large language models with hundreds of billions of parameters in a variety of tasks, from playing games to writing code."
What Lodge is saying is that generative AI certainly has its uses, but perhaps we are trying to It forces the introduction of an area of reinforcement learning for which it is not well suited. Take games for example.
Levy Rozman, a chess grandmaster, posted a video of himself playing against ChatGPT (chat-based artificial intelligence). The model made a series of ridiculous and illegal moves, including capturing its own pieces. The best open source chess software (Stockfish, which doesn't use neural networks at all) lets ChatGPT beat it in less than 10 moves because the large language model can't find legal moves. This proves that large language models fall far short of the claims of general artificial intelligence, and this is not an isolated example.
Due to its reinforcement learning algorithm, Google AlphaGo is the best performing Go artificial intelligence currently. Reinforcement learning works by generating different solutions to a problem, trying them, using the results to improve the next suggestion, and then repeating the process thousands of times to find the best result.
In the case of AlphaGo, the AI tries different moves and predicts whether this is a good move and whether it is likely to win the game from this position. It uses feedback to "track" promising sequences of moves and generate other possible moves. The effect is a search for possible moves.
This process is called probabilistic search. Although there are many moves, you don't need to try them all, but you can be patient and search the areas where you might find the best moves. This works great for gaming. AlphaGo has defeated Go masters in the past. AlphaGo is not infallible, but it currently performs better than the best large-scale language models available today.
Proponents believe that even though there is evidence that large language models significantly lag behind other types of artificial intelligence, They also get progressively better. However, Lodge points out that we need to understand why they perform better at this kind of task if we are to accept this idea. The reason for the difficulty on this issue, he continued, is that no one can predict exactly how GPT-4 will respond to specific cues. This pattern is beyond human explanation. This, he believes, is “the reason why ‘just-in-time engineering’ doesn’t exist.” He stresses that it’s also a struggle for AI researchers to prove that “emergent properties” of large language models exist, let alone predict them.
It can be said that the best argument is induction. GPT-4 is better than GPT-3 on some language tasks because it is larger. Therefore, a larger model would be better.
Lodge’s view is that GPT-4 still needs to overcome the challenges faced by GPT-3, so there is a problem. One of them is math; while GPT-4 is better than GPT-3 at addition operations, it still has bottlenecks at multiplication and other mathematical operations.
Increasing the size of language models does not magically solve these problems, and according to OpenAI, larger models are not the solution. The reason comes down to the fundamental nature of large language models, as the OpenAI forum points out: “Large language models are probabilistic in nature and operate by generating possible outputs based on the patterns they observe in the training data. In Mathematics and physics problems, the likelihood of finding a single correct answer is slim."
In the artificial intelligence process, methods driven by reinforcement learning can produce results more accurately because it is a process of pursuing a goal. Reinforcement learning iteratively finds the best answer closest to the goal to achieve the desired goal. Lodge points out that large language model courses "are not designed to iterate or find goals. They are designed to give a 'good enough' answer one or a few times."
A "one-shot" answer is the first answer produced by the model, obtained by predicting a sequence of words in the prompt. "Few-shot learning" involves providing additional samples or cues to the model to assist it in generating better predictions.. Large language models often also add some randomness (that is, they are "randomized") to increase the likelihood of a better answer, so they will give different answers to the same question.
It’s not that the large language model world ignores reinforcement learning. GPT-4 combines "reinforcement learning with human feedback" (RLHF). A core model trained by a human operator favors certain answers, but this does not fundamentally change the answer the model generated in the first place. Lodge noted that a large language model might provide the following options to fill in the gaps in the sentence "Wayne Gretzky likes ice..."
1. Wayne Gretzky loves ice cream.
2. Wayne Gretzky loves ice hockey.
3. Wayne Gretzky loves ice fishing.
4. Wayne Gretzky loves skating.
5. Wayne Gretzky likes ice wine.
Human operators ranked the answers and might have concluded that the legendary Canadian hockey player preferred ice hockey and skating, despite the broad appeal of ice cream. Human rankings and more human-written responses are used to train the model. Note that GPT-4 does not pretend to know Wayne Gretzky's preferences accurately, only to do the best possible job when prompted.
Finally, large language models are not designed to be highly accurate or consistent. There is a trade-off between accuracy and deterministic behavior in exchange for generality. To Lodge, all this means is that reinforcement learning beats generative AI when it comes to applying AI at scale.
What about software development? As I write, GenAI already provides a solution for those using tools like GitHubCopilot or AmazonCodeWhisperer Opportunities are provided to increase developer productivity. This is not speculation - it has happened. These tools can predict what code is likely to appear next, based on the code before and after the insertion point in the integrated development environment.
In fact, as David Ramel of Visual Studio Magazine said, the latest version of Copilot already generates 61% of the Java code. For those concerned that this will reduce the work of software developers, remember that these tools require diligent human supervision to check completion and edit it so that the code compiles and runs correctly. Autocompletion has been a staple of IDEs since their earliest days, and Copilot and other code generators make it even more useful. Large-scale autonomous coding is different, in fact 61% of the Java code needs to be written.
However, reinforcement learning enables precise autonomous coding at scale, Lodge said. Of course, he has a vested interest in saying this: In 2019, his company Diffblue released Cover, a commercial unit test writing tool based on reinforcement learning. Cover writes complete unit test suites without human intervention, making it possible to automate complex, error-prone tasks at scale.
Is Lodge biased? Absolutely. He has many experiences justifying his belief that reinforcement learning outperforms GenAI in software development. Today, Diffblue uses reinforcement learning to search the space of all possible test methods, automatically write test code for each method, and select the best test among the tests written. Reinforcement learning reward functions are based on a variety of criteria, including test coverage and aesthetics, one of which includes conforming to human-written coding style. The tool creates tests for each method in an average of one second.
Lodge believes that if the goal is to automatically write 10,000 unit tests for a program that no one understands, then reinforcement learning is the only real solution. "Large language models cannot compete; humans have no way to effectively supervise them and correct their code at this scale, and making models larger and more complex does not solve this problem."
Conclusion: The most powerful thing about large language models is that they are general-purpose language processors. They can perform language tasks for which they have not been explicitly trained. This means they can do a great job at content generation (copywriting) and many other things. Lodge emphasized: "But this does not make large language models a substitute for artificial intelligence models, which are often based on reinforcement learning, which are more accurate, more consistent, and can be used at scale."
The above is the detailed content of Are large language models wrong for coding?. For more information, please follow other related articles on the PHP Chinese website!