Google Brain, 2017 ICLR Best Paper
Authors | Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals |
---|---|
openreview | https://openreview.net/forum?id=Sy8gdB9xx¬eId=Sy8gdB9xx |
paper link | https://openreview.net/pdf?id=Sy8gdB9xx |
Through experimentation, authors shows how traditional approaches to generalization failes to explain why large neural networks generalize well in practice. They also explains why deep learning requires rethinking generalization.
During non-parametric randomization tests, authors rationalized random labels destroyes learnable knowledge, therefore the network would fail to converge or slow down substantially. Surprisingly, network converges just as fast. small generalization error might not be something intrisit to the model.
Machine learning is about generalization garantee.
In the regime where the network's learning capability is much greater than the dataset, the network can virtually learn any random function. Therefore conventional theories like VC dimensions does not distinguish between the two cases:
$F$ fitting an image classification function on input X, and perform well
and
$F$ fits a random function on X, achieving close to random generalization error
This is why understanding generalization in deep learning requires new thinking (theory).
This regime can be reached as soon as the number of parameters exceeds the number of data points in training data. And we are usually in this regime in practice.
some neural networks generalize well, some don't. Normally if you have a model with a large learning capability, you would think the generalization error would also tend to be large because it overfits. However with a large number of parameters, deep learning networks generalize surprisingly well.
Existing learning theory does not offer an answer to this question.
- VC dimention 1,
- Rademacher complexity 2
- Uniform stability 345
- Regularization and early stopping being implicity regularization
used in this work comes from non-parametric statistics.
model | train label | train error | generalization |
---|---|---|---|
neural net | random | zero | random guess |
image classification | small | small error |
by smoothly varying the amoutn of noise, authors show network can memorize training noise while still being able to extract remaining useful signal.
Deep neural networks easily fit random labels.
Explicity forms of regularization (weight decay, dropout, data augmentation) do not explain this.
Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.
Dave: L2 regularization prevents the networks from learning the random labels during training. need to take a closer look
SGD itself is a form of implicity form of regularization. Under SGD even non-deep learning models can generalize well. They tested gaussian kernel machines. This doesn't explain why one architecture is better than the other, but does sugguest that SGD itself calls for more investigation.
Previous work on expressivity focuses on expressivity over entire domain. Authors instead forcuses on finite dataset.
Also authors demonstrate shallow network (2 layer LeNet) can show large expressivity by simply increasing width.
for MNIST, d<<n, model is linear in the "lifted feature space
Behnam Neyshabur https://arxiv.org/pdf/1412.6614v4.pdf