


An article briefly discusses the generalization ability of deep learning
1. The issue of DNN generalization ability
The paper mainly discusses why the over-parameterized neural network model can have good generalization performance? That is, it does not simply memorize the training set, but summarizes a general rule from the training set, so that it can be adapted to the test set (generalization ability).
Take the classic decision tree model as an example. When the tree model learns the general rules of the data set: a good situation, if the tree first splits the node, It can just about distinguish samples with different labels, the depth is very small, and the number of samples on each corresponding leaf is enough (that is, the amount of data based on statistical rules is also relatively large), then the rules that will be obtained will be more May generalize to other data. (ie: good fit and generalization ability).
Another worse situation is that if the tree cannot learn some general rules, in order to learn this data set, the tree will become deeper and deeper, maybe every time Each leaf node corresponds to a small number of samples (the statistical information brought by a small amount of data may be just noise). Finally, all the data is memorized by rote (ie: overfitting and no generalization ability). We can see that tree models that are too deep can easily overfit.
So how can an over-parameterized neural network achieve good generalization?
2. Reasons for the generalization ability of DNN
This article explains from a simple and general perspective-exploring the reasons for the generalization ability in the gradient descent optimization process of neural networks:
We summarized the gradient coherence theory: the coherence of gradients from different samples is the reason why neural networks can have good generalization capabilities. When the gradients of different samples are well aligned during training, that is, when they are coherent, gradient descent is stable, can converge quickly, and the resulting model can generalize well. Otherwise, if the samples are too few or the training time is too long, it may not generalize.
Based on this theory, we can make the following explanation.
2.1 Generalization of Width Neural Network
Wider neural network models have good generalization capabilities. This is because wider networks have more sub-networks and are more likely to produce gradient coherence than smaller networks, resulting in better generalization. In other words, gradient descent is a feature selector that prioritizes generalization (coherence) gradients, and wider networks may have better features simply because they have more features.
- Original paper: Generalization and width. Neyshabur et al. [2018b] found that wider networks generalize better. Can we now explain this? Intuitively, wider networks have more sub-networks at any given level, and so the sub-network with maximum coherence in a wider network may be more coherent than its counterpart in a thinner network, and hence generalize better. In other words, since—as discussed in Section 10—gradient descent is a feature selector that prioritizes well-generalizing (coherent) features, wider networks are likely to have better features simply because they have more features. In this connection, see also the Lottery Ticket Hypothesis [Frankle and Carbin, 2018]
- Paper link: https ://github.com/aialgorithm/Blog
But personally, I think it still needs to distinguish the width of the network input layer/hidden layer. Especially for the input layer of data mining tasks, since the input features are usually manually designed, it is necessary to consider feature selection (ie, reduce the width of the input layer). Otherwise, directly inputting feature noise will interfere with the gradient coherence. .
2.2 Generalization of Deep Neural Networks
The deeper the network, the gradient coherence phenomenon is amplified and has better generalization ability.
In the deep model, since the feedback between layers strengthens the coherent gradient, there are characteristics of coherent gradients (W6) and characteristics of incoherent gradients ( The relative difference between W1) is exponentially amplified during training. This makes deeper networks prefer coherent gradients, resulting in better generalization capabilities.
2.3 Early-stopping
Through early stopping we can reduce the excessive influence of non-coherent gradients and improve generalization.
During training, some easy samples are fitted earlier than other samples (hard samples). In the early stage of training, the correlation gradient of these easy samples dominates and is easy to fit. In the later stage of training, the incoherent gradient of difficult samples dominates the average gradient g(wt), resulting in poor generalization ability. At this time, it is necessary to stop early.
- (Note: Simple samples are those that have many gradients in common in the data set. For this reason, most gradients are beneficial to it. , the convergence is also faster.)
2.4 Full gradient descent VS learning rate
We found that full gradient descent can also have good generalization ability. Furthermore, careful experiments show that stochastic gradient descent does not necessarily lead to better generalization, but this does not rule out the possibility that stochastic gradients are more likely to jump out of local minima, play a role in regularization, etc.
- Based on our theory, finite learning rate, and mini-batch stochasticity are not necessary for generalization
We believe that a lower learning rate may not reduce the generalization error , because lower learning rate means more iterations (opposite of early stopping).
- Assuming a small enough learning rate, as training progresses, the generalization gap cannot decrease. This follows from the iterative stability analysis of training: with 40 more steps, stability can only degrade. If this is violated in a practical setting, it would point to an interesting limitation of the theory
2.5 L2 and L1 regularization
Add L2 and L1 regularization to the objective function, and the corresponding gradient calculation, The gradient that needs to be added to the L1 regular term is sign(w), and the L2 gradient is w. Taking L2 regularization as an example, the corresponding gradient W(i 1) update formula is: Picture
We can think of "L2 regularization (weight attenuation)" as a This kind of "background force" can push each parameter close to a data-independent zero value (L1 can easily obtain a sparse solution, and L2 can easily obtain a smooth solution approaching 0) to eliminate the influence in the weak gradient direction. Only in the case of coherent gradient directions can the parameters be relatively separated from the "background force" and the gradient update can be completed based on the data.
2.6 Advanced gradient descent algorithm
- Momentum, Adam and other gradient descent algorithms
Momentum, Adam For the equal gradient descent algorithm, the parameter W update direction is not only determined by the current gradient, but also related to the previously accumulated gradient direction (that is, the effect of the accumulated coherent gradient is retained). This allows the parameters to be updated faster in dimensions where the gradient direction changes slightly, and reduces the update amplitude in dimensions where the gradient direction changes significantly, thus resulting in the effect of accelerating convergence and reducing oscillation.
- Suppress gradient descent in weak gradient directions
We can suppress gradient updates in weak gradient directions by optimizing the batch gradient descent algorithm, further improving generalization capabilities. . For example, we can use winsorized gradient descent to exclude gradient outliers and then take the average. Or take the median of the gradient instead of the mean to reduce the impact of gradient outliers.
Summary
A few words at the end of the article. If you are interested in the theory of deep learning, you can read the related research mentioned in the paper.
The above is the detailed content of An article briefly discusses the generalization ability of deep learning. For more information, please follow other related articles on the PHP Chinese website!

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

Zend Studio 13.0.1
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.