Skip to content

Latest commit

 

History

History
2934 lines (2280 loc) · 454 KB

README.md

File metadata and controls

2934 lines (2280 loc) · 454 KB

Maintenance PR Welcome  GitHub stars GitHub watchers GitHub forks GitHub Contributors

Awesome Deep Phenomena Awesome

Our understanding of modern neural networks lags behind their practical successes. This growing gap poses a challenge to the pace of progress in machine learning because fewer pillars of knowledge are available to designers of models and algorithms (Hanie Sedghi). Inspired by the ICML 2019 workshop Identifying and Understanding Deep Learning Phenomena, I collect papers and related resources which present interesting empirical study and insight into the nature of deep learning.

Table of Contents

DALLE

Empirical Study

avatar

Empirical Study: 2024

  • Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling. [paper]

    • Lechao Xiao.
    • Key Word: Regularization; Generalization; Neural Scaling Law.
    • Digest This paper explores the shift in machine learning from focusing on minimizing generalization error to reducing approximation error, particularly in the context of large language models (LLMs) and scaling laws. It questions whether traditional regularization principles, like L2 regularization and small batch sizes, remain relevant in this new paradigm. The authors introduce the concept of “scaling law crossover,” where techniques effective at smaller scales may fail as model size increases. The paper raises two key questions: what new principles should guide model scaling, and how can models be effectively compared at large scales where only single experiments are feasible?
  • AI models collapse when trained on recursively generated data. [paper]

    • Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal. Nature
    • Key Word: Model Collapse; Generative Model.
    • Digest The paper demonstrates that AI models trained on recursively generated data suffer from a collapse in performance. This finding highlights the critical need for diverse and high-quality data sources to maintain the robustness and reliability of AI systems.
  • Not All Language Model Features Are Linear. [paper]

    • Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark.
    • Key Word: Large Language Model; Linear Representation Hypothesis.
    • Digest This paper challenges the linear representation hypothesis in language models by proposing that some representations are inherently multi-dimensional. Using sparse autoencoders, the authors identify interpretable multi-dimensional features in GPT-2 and Mistral 7B, such as circular features for days and months, and demonstrate their computational significance through intervention experiments.
  • LoRA Learns Less and Forgets Less. [paper]

    • Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham.
    • Key Word: LoRA; Fine-Tuning; Learning-Forgetting Trade-off.
    • Digest The study compares the performance of Low-Rank Adaptation (LoRA), a parameter-efficient finetuning method for large language models, with full finetuning in programming and mathematics domains. While LoRA generally underperforms compared to full finetuning, it better maintains the base model's performance on tasks outside the target domain, providing stronger regularization than common techniques like weight decay and dropout. Full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, which may explain some performance gaps. The study concludes with best practices for finetuning with LoRA.
  • The Platonic Representation Hypothesis. [paper]

    • Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola.
    • Key Word: Foundation Models; Representational Convergence.
    • Digest The paper argues that representations in AI models, especially deep networks, are converging. This convergence is observed across time, multiple domains, and different data modalities. As models get larger, they measure distance between data points in increasingly similar ways. The authors hypothesize that this convergence is moving towards a shared statistical model of reality, which they term the "platonic representation." They discuss potential selective pressures towards this representation and explore the implications, limitations, and counterexamples to their analysis.
  • The Unreasonable Ineffectiveness of the Deeper Layers. [paper]

    • Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts.
    • Key Word: Large Language Model; Pruning.
    • Digest This study explores a straightforward layer-pruning approach on widely-used pretrained large language models (LLMs), showing that removing up to half of the layers results in only minimal performance decline on various question-answering benchmarks. The method involves selecting the best layers to prune based on layer similarity, followed by minimal finetuning to mitigate any loss in performance. Specifically, it employs parameter-efficient finetuning techniques like quantization and Low Rank Adapters (QLoRA), enabling experiments on a single A100 GPU. The findings indicate that layer pruning could both reduce finetuning computational demands and enhance inference speed and memory efficiency. Moreover, the resilience of LLMs to layer removal raises questions about the effectiveness of current pretraining approaches or highlights the significant knowledge-storing capacity of the models' shallower layers.
  • Unfamiliar Finetuning Examples Control How Language Models Hallucinate. [paper]

    • Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine.
    • Key Word: Large Language Model; Hallucination; Supervised Fine-Tuning.
    • Digest This study investigates the propensity of large language models (LLMs) to produce plausible but factually incorrect responses, focusing on their behavior with unfamiliar concepts. The research identifies a pattern where LLMs resort to hedged predictions for unfamiliar inputs, influenced by the supervision of such examples during fine-tuning. By adjusting the supervision of these examples, it's possible to direct LLM responses towards acknowledging their uncertainty (e.g., by saying "I don't know").
  • When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. [paper]

    • Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat.
    • Key Word: Neural Scaling Laws; Large Language Model; Fine-Tuning.
    • Digest This study investigates how scaling factors—model size, pretraining data size, finetuning parameter size, and finetuning data size—affect finetuning performance of large language models (LLMs) across two methods: full-model tuning (FMT) and parameter efficient tuning (PET). Experiments on bilingual LLMs for translation and summarization tasks reveal that finetuning performance scales multiplicatively with data size and other factors, favoring model scaling over pretraining data scaling, with PET parameter scaling showing limited effectiveness. These insights suggest the choice of finetuning method is highly task- and data-dependent, offering guidance for optimizing LLM finetuning strategies.
  • Rethink Model Re-Basin and the Linear Mode Connectivity. [paper]

    • Xingyu Qu, Samuel Horvath.
    • Key Word: Linear Mode Connectivity; Model Merging; Re-Normalization; Pruning.
    • Digest The paper discusses the "model re-basin regime," where most solutions found by stochastic gradient descent (SGD) in sufficiently wide models converge to similar states, impacting model averaging. It identifies limitations in current strategies due to a poor understanding of the mechanisms involved. The study critiques existing matching algorithms for their inadequacies and proposes that proper re-normalization can address these issues. By adopting a more analytical approach, the paper reveals how matching algorithms and re-normalization interact, offering clearer insights and improvements over previous work. This includes a connection between linear mode connectivity and pruning, leading to a new lightweight post-pruning method that enhances existing pruning techniques.
  • How Good is a Single Basin? [paper]

    • Kai Lion, Lorenzo Noci, Thomas Hofmann, Gregor Bachmann.
    • Key Word: Linear Mode Connectivity; Deep Ensembles.
    • Digest This paper investigates the assumption that the multi-modal nature of neural loss landscapes is key to the success of deep ensembles. By creating "connected" ensembles that are confined to a single basin, the study finds that this limitation indeed reduces performance. However, it also discovers that distilling knowledge from multiple basins into these connected ensembles can offset the performance deficit, effectively creating multi-basin deep ensembles within a single basin. This suggests that while knowledge from outside a given basin exists within it, it is not readily accessible without learning from other basins.

Empirical Study: 2023

  • Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction. [paper] [code]

    • Pratyusha Sharma, Jordan T. Ash, Dipendra Misra
    • Key Word: Large Language Models; Reasoning.
    • Digest Transformer-based Large Language Models (LLMs) have become a fixture in modern machine learning. Correspondingly, significant resources are allocated towards research that aims to further advance this technology, typically resulting in models of increasing size that are trained on increasing amounts of data. This work, however, demonstrates the surprising result that it is often possible to significantly improve the performance of LLMs by selectively removing higher-order components of their weight matrices. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed, and requires no additional parameters or data. We show extensive experiments demonstrating the generality of this finding across language models and datasets, and provide in-depth analyses offering insights into both when LASER is effective and the mechanism by which it operates.
  • The Transient Nature of Emergent In-Context Learning in Transformers. [paper]

    • Aaditya K. Singh, Stephanie C.Y. Chan, Ted Moskovitz, Erin Grant, Andrew M. Saxe, Felix Hill. NeurIPS 2023
    • Key Word: In-Context Learning.
    • Digest This paper shows that in-context learning (ICL) in transformers, where models exhibit abilities not explicitly trained for, is often transient rather than persistent during training. The authors find ICL emerges then disappears, giving way to in-weights learning (IWL). This occurs across model sizes and datasets, raising questions around stopping training early for ICL vs later for IWL. They suggest L2 regularization may lead to more persistent ICL, removing the need for early stopping based on ICL validation. The transience may be caused by competition between emerging ICL and IWL circuits in the model.
  • What do larger image classifiers memorise? [paper]

    • Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar.
    • Key Word: Large Model; Memorization.
    • Digest This paper explores the relationship between memorization and generalization in modern neural networks. It discusses Feldman's metric for measuring memorization and applies it to ResNet models for image classification. The paper then investigates whether larger neural models memorize more and finds that memorization trajectories vary across different training examples and model sizes. Additionally, it notes that knowledge distillation, a model compression technique, tends to inhibit memorization while improving generalization, particularly on examples with increasing memorization trajectories.
  • Can Neural Network Memorization Be Localized? [paper]

    • Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang. ICML 2023
    • Key Word: Atypical Example Memorization; Location of Memorization; Task Specific Neurons.
    • Digest The paper demonstrates that memorization in deep overparametrized networks is not limited to individual layers but rather confined to a small set of neurons across various layers of the model. Through experimental evidence from gradient accounting, layer rewinding, and retraining, the study reveals that most layers are redundant for example memorization, and the contributing layers are typically not the final layers. Additionally, the authors propose a new form of dropout called example-tied dropout, which allows them to selectively direct memorization to a pre-defined set of neurons, effectively reducing memorization accuracy while also reducing the generalization gap.
  • No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths. [paper]

    • Charles Guille-Escuret, Hiroki Naganuma, Kilian Fatras, Ioannis Mitliagkas.
    • Key Word: Restricted Secant Inequality; Error Bound; Loss Landscape Geometry.
    • Digest The paper explores the geometric properties of optimization paths in neural networks and reveals that the quantities related to the restricted secant inequality and error bound exhibit consistent behavior during training, suggesting that optimization trajectories encounter no significant obstacles and maintain stable dynamics, leading to linear convergence and supporting commonly used learning rate schedules.
  • Sharpness-Aware Minimization Leads to Low-Rank Features. [paper]

    • Maksym Andriushchenko, Dara Bahri, Hossein Mobahi, Nicolas Flammarion.
    • Key Word: Sharpness-Aware Minimization; Low-Rank Features.
    • Digest Sharpness-aware minimization (SAM) is a method that minimizes the sharpness of the training loss of a neural network. It improves generalization and reduces the feature rank at different layers of a neural network. This low-rank effect occurs for different architectures and objectives. A significant number of activations get pruned by SAM, contributing to rank reduction. This effect can also occur in deep networks.
  • A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation. [paper]

    • Florian Bordes, Samuel Lavoie, Randall Balestriero, Nicolas Ballas, Pascal Vincent.
    • Key Word: Pretraining; Fine-Tuning; Information Bottleneck.
    • Digest A commonly used trick in SSL, shown to make deep networks more robust to such bias, is the addition of a small projector (usually a 2 or 3 layer multi-layer perceptron) on top of a backbone network during training. In contrast to previous work that studied the impact of the projector architecture, we here focus on a simpler, yet overlooked lever to control the information in the backbone representation. We show that merely changing its dimensionality -- by changing only the size of the backbone's very last block -- is a remarkably effective technique to mitigate the pretraining bias.
  • Why is the winner the best? [paper]

    • Author List Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Sharib Ali, Vincent Andrearczyk, Marc Aubreville, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano, Jorge Bernal, Sebastian Bodenstedt, Alessandro Casella, Veronika Cheplygina, Marie Daum, Marleen de Bruijne, Adrien Depeursinge, Reuben Dorent, Jan Egger, David G. Ellis, Sandy Engelhardt, Melanie Ganz, Noha Ghatwary, Gabriel Girard, Patrick Godau, Anubha Gupta, Lasse Hansen, Kanako Harada, Mattias Heinrich, Nicholas Heller, Alessa Hering, Arnaud Huaulmé, Pierre Jannin, Ali Emre Kavur, Oldřich Kodym, Michal Kozubek, Jianning Li, Hongwei Li, Jun Ma, Carlos Martín-Isla, Bjoern Menze, Alison Noble, Valentin Oreiller, Nicolas Padoy, Sarthak Pati, Kelly Payette, Tim Rädsch, Jonathan Rafael-Patiño, Vivek Singh Bawa, Stefanie Speidel, Carole H. Sudre, Kimberlin van Wijnen, Martin Wagner, Donglai Wei, Amine Yamlahi, Moi Hoon Yap, Chun Yuan, Maximilian Zenk, Aneeq Zia, David Zimmerer, Dogu Baran Aydogan, Binod Bhattarai, Louise Bloch, Raphael Brüngel, Jihoon Cho, Chanyeol Choi, Qi Dou, Ivan Ezhov, Christoph M. Friedrich, Clifton Fuller, Rebati Raman Gaire, Adrian Galdran, Álvaro García Faura, Maria Grammatikopoulou, SeulGi Hong, Mostafa Jahanifar, Ikbeom Jang, Abdolrahim Kadkhodamohammadi, Inha Kang, Florian Kofler, Satoshi Kondo, Hugo Kuijf, Mingxing Li, Minh Huan Luu, Tomaž Martinčič, Pedro Morais, Mohamed A. Naser, Bruno Oliveira, David Owen, Subeen Pang, Jinah Park, Sung-Hong Park, Szymon Płotka, Elodie Puybareau, Nasir Rajpoot, Kanghyun Ryu, Numan Saeed , Adam Shephard, Pengcheng Shi, Dejan Štepec, Ronast Subedi, Guillaume Tochon, Helena R. Torres, Helene Urien, João L. Vilaça, Kareem Abdul Wahid, Haojie Wang, Jiacheng Wang, Liansheng Wang, Xiyue Wang, Benedikt Wiestler, Marek Wodzinski, Fangfang Xia, Juanying Xie, Zhiwei Xiong, Sen Yang, Yanwu Yang, Zixuan Zhao, Klaus Maier-Hein, Paul F. Jäger, Annette Kopp-Schneider, Lena Maier-Hein.
    • Key Word: Benchmarking Competitions; Medical Imaging.
    • Digest The article discusses the lack of investigation into what can be learned from international benchmarking competitions for image analysis methods. The authors conducted a multi-center study of 80 competitions conducted in the scope of IEEE ISBI 2021 and MICCAI 2021 to address this gap. Based on comprehensive descriptions of the submitted algorithms and their rankings, as well as participation strategies, statistical analyses revealed common characteristics of winning solutions. These typically include the use of multi-task learning and/or multi-stage pipelines, a focus on augmentation, image preprocessing, data curation, and postprocessing.
  • Sparks of Artificial General Intelligence: Early experiments with GPT-4. [paper]

    • Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang.
    • Key Word: Artificial General Intelligence; Benchmarking; GPT.
    • Digest We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT.
  • Is forgetting less a good inductive bias for forward transfer? [paper]

    • Jiefeng Chen, Timothy Nguyen, Dilan Gorur, Arslan Chaudhry. ICLR 2023
    • Key Word: Continual Learning; Catastrophic Forgetting; Forward Transfer; Inductive Bias.
    • Digest One of the main motivations of studying continual learning is that the problem setting allows a model to accrue knowledge from past tasks to learn new tasks more efficiently. However, recent studies suggest that the key metric that continual learning algorithms optimize, reduction in catastrophic forgetting, does not correlate well with the forward transfer of knowledge. We believe that the conclusion previous works reached is due to the way they measure forward transfer. We argue that the measure of forward transfer to a task should not be affected by the restrictions placed on the continual learner in order to preserve knowledge of previous tasks.
  • Dropout Reduces Underfitting. [paper] [code]

    • Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, Trevor Darrell.
    • Key Word: Dropout; Overfitting.
    • Digest In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. During the early phase, we find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. This helps counteract the stochasticity of SGD and limit the influence of individual batches on model training.
  • The Role of Pre-training Data in Transfer Learning. [paper]

    • Rahim Entezari, Mitchell Wortsman, Olga Saukh, M.Moein Shariatnia, Hanie Sedghi, Ludwig Schmidt.
    • Key Word: Pre-training; Transfer Learning.
    • Digest We investigate the impact of pre-training data distribution on the few-shot and full fine-tuning performance using 3 pre-training methods (supervised, contrastive language-image and image-image), 7 pre-training datasets, and 9 downstream datasets. Through extensive controlled experiments, we find that the choice of the pre-training data source is essential for the few-shot transfer, but its role decreases as more data is made available for fine-tuning.
  • The Dormant Neuron Phenomenon in Deep Reinforcement Learning. [paper] [code]

    • Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, Utku Evci.
    • Key Word: Dormant Neuron; Deep Reinforcement Learning.
    • Digest The paper identifies the dormant neuron phenomenon in deep reinforcement learning, where inactive neurons increase and hinder network expressivity, affecting learning. To address this, they propose a method called ReDo, which recycles dormant neurons during training. ReDo reduces the number of dormant neurons, maintains network expressiveness, and leads to improved performance.
  • Cliff-Learning. [paper]

    • Tony T. Wang, Igor Zablotchi, Nir Shavit, Jonathan S. Rosenfeld.
    • Key Word: Foundation Models; Fine-Tuning.
    • Digest We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot).

Empirical Study: 2022

  • ModelDiff: A Framework for Comparing Learning Algorithms. [paper] [code]

    • Harshay Shah, Sung Min Park, Andrew Ilyas, Aleksander Madry.
    • Key Word: Representation-based Comparison; Example-level Comparisons; Comparing Feature Attributions.
    • Digest We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data.
  • Overfreezing Meets Overparameterization: A Double Descent Perspective on Transfer Learning of Deep Neural Networks. [paper]

    • Yehuda Dar, Lorenzo Luzi, Richard G. Baraniuk.
    • Key Word: Transfer Learning; Deep Double Descent; Overfreezing.
    • Digest We study the generalization behavior of transfer learning of deep neural networks (DNNs). We adopt the overparameterization perspective -- featuring interpolation of the training data (i.e., approximately zero train error) and the double descent phenomenon -- to explain the delicate effect of the transfer learning setting on generalization performance. We study how the generalization behavior of transfer learning is affected by the dataset size in the source and target tasks, the number of transferred layers that are kept frozen in the target DNN training, and the similarity between the source and target tasks.
  • How to Fine-Tune Vision Models with SGD. [paper]

    • Ananya Kumar, Ruoqi Shen, Sébastien Bubeck, Suriya Gunasekar.
    • Key Word: Fine-Tuning; Out-of-Distribution Generalization.
    • Digest We show that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: merely freezing the embedding layer (less than 1\% of the parameters) leads to SGD performing competitively with AdamW while using less memory.
  • What Images are More Memorable to Machines? [paper] [code]

    • Junlin Han, Huangying Zhan, Jie Hong, Pengfei Fang, Hongdong Li, Lars Petersson, Ian Reid.
    • Key Word: Self-Supervised Memorization Quantification.
    • Digest This paper studies the problem of measuring and predicting how memorable an image is to pattern recognition machines, as a path to explore machine intelligence. Firstly, we propose a self-supervised machine memory quantification pipeline, dubbed ``MachineMem measurer'', to collect machine memorability scores of images. Similar to humans, machines also tend to memorize certain kinds of images, whereas the types of images that machines and humans memorialize are different.
  • Harmonizing the object recognition strategies of deep neural networks with humans. [paper] [code]

    • Thomas Fel, Ivan Felipe, Drew Linsley, Thomas Serre.
    • Key Word: Interpretation; Neural Harmonizer; Psychophysics.
    • Digest Across 84 different DNNs trained on ImageNet and three independent datasets measuring the where and the how of human visual strategies for object recognition on those images, we find a systematic trade-off between DNN categorization accuracy and alignment with human visual strategies for object recognition. State-of-the-art DNNs are progressively becoming less aligned with humans as their accuracy improves. We rectify this growing issue with our neural harmonizer: a general-purpose training routine that both aligns DNN and human visual strategies and improves categorization accuracy.
  • Pruning's Effect on Generalization Through the Lens of Training and Regularization. [paper]

    • Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite.
    • Key Word: Pruning; Regularization.
    • Digest We show that size reduction cannot fully account for the generalization-improving effect of standard pruning algorithms. Instead, we find that pruning leads to better training at specific sparsities, improving the training loss over the dense model. We find that pruning also leads to additional regularization at other sparsities, reducing the accuracy degradation due to noisy examples over the dense model. Pruning extends model training time and reduces model size. These two factors improve training and add regularization respectively. We empirically demonstrate that both factors are essential to fully explaining pruning's impact on generalization.
  • What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries. [paper] [code]

    • Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz.
    • Key Word: Class Manifold; Linear Region; Out-of-Distribution Generalization.
    • Digest Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon's escape theorem to tractably estimate the effective dimension of CMs and their boundaries through tomographic intersections with random affine subspaces of varying dimension. We show several connections between the dimension of CMs, generalization, and robustness.
  • In What Ways Are Deep Neural Networks Invariant and How Should We Measure This? [paper]

    • Henry Kvinge, Tegan H. Emerson, Grayson Jorgenson, Scott Vasquez, Timothy Doster, Jesse D. Lew. NeurIPS 2022
    • Key Word: Invariance and Equivariance.
    • Digest We explore the nature of invariance and equivariance of deep learning models with the goal of better understanding the ways in which they actually capture these concepts on a formal level. We introduce a family of invariance and equivariance metrics that allows us to quantify these properties in a way that disentangles them from other metrics such as loss or accuracy.
  • Relative representations enable zero-shot latent space communication. [paper]

    • Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, Emanuele Rodolà.
    • Key Word: Representation Similarity; Model stitching.
    • Digest Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations. Ideally, the distribution of the data points in the latent space should depend only on the task, the data, the loss, and other architecture-specific constraints. However, factors such as the random weights initialization, training hyperparameters, or other sources of randomness in the training phase may induce incoherent latent spaces that hinder any form of reuse. Nevertheless, we empirically observe that, under the same data and modeling choices, distinct latent spaces typically differ by an unknown quasi-isometric transformation: that is, in each space, the distances between the encodings do not change. In this work, we propose to adopt pairwise similarities as an alternative data representation, that can be used to enforce the desired invariance without any additional training.
  • Minimalistic Unsupervised Learning with the Sparse Manifold Transform. [paper]

    • Yubei Chen, Zeyu Yun, Yi Ma, Bruno Olshausen, Yann LeCun.
    • Key Word: Self-Supervision; Sparse Manifold Transform.
    • Digest We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100.
  • A Review of Sparse Expert Models in Deep Learning. [paper]

    • William Fedus, Jeff Dean, Barret Zoph.
    • Key Word: Mixture-of-Experts.
    • Digest Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
  • A Data-Based Perspective on Transfer Learning. [paper] [code]

    • Saachi Jain, Hadi Salman, Alaa Khaddaj, Eric Wong, Sung Min Park, Aleksander Madry.
    • Key Word: Transfer Learning; Influence Function; Data Leakage.
    • Digest It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset.
  • When Does Re-initialization Work? [paper]

    • Sheheryar Zaidi, Tudor Berariu, Hyunjik Kim, Jörg Bornschein, Claudia Clopath, Yee Whye Teh, Razvan Pascanu.
    • Key Word: Re-initialization; Regularization.
    • Digest We conduct an extensive empirical comparison of standard training with a selection of re-initialization methods to answer this question, training over 15,000 models on a variety of image classification benchmarks. We first establish that such methods are consistently beneficial for generalization in the absence of any other regularization. However, when deployed alongside other carefully tuned regularization techniques, re-initialization methods offer little to no added benefit for generalization, although optimal generalization performance becomes less sensitive to the choice of learning rate and weight decay hyperparameters. To investigate the impact of re-initialization methods on noisy data, we also consider learning under label noise. Surprisingly, in this case, re-initialization significantly improves upon standard training, even in the presence of other carefully tuned regularization techniques.
  • How You Start Matters for Generalization. [paper]

    • Sameera Ramasinghe, Lachlan MacDonald, Moshiur Farazi, Hemanth Sartachandran, Simon Lucey.
    • Key Word: Implicit regularization; Fourier Spectrum.
    • Digest We promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization. Through a Fourier lens, we derive a general result for the spectral bias of neural networks and show that the generalization of neural networks is heavily tied to their initialization. Further, we empirically solidify the developed theoretical insights using practical, deep networks.
  • Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [paper] [code]

    • Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer.
    • Key Word: Natural Language Processing; In-Context Learning.
    • Digest We show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.

Empirical Study: 2021

  • Masked Autoencoders Are Scalable Vision Learners. [paper] [code]

    • Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. CVPR 2022
    • Key Word: Self-Supervision; Autoencoders.
    • Digest This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.
  • Learning in High Dimension Always Amounts to Extrapolation. [paper]

    • Randall Balestriero, Jerome Pesenti, Yann LeCun.
    • Key Word: Interpolation and Extrapolation.
    • Digest The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample x whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when x falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (>100) dataset, interpolation almost surely never happens.
  • Understanding Dataset Difficulty with V-Usable Information. [paper] [code]

    • Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta. ICML 2022
    • Key Word: Dataset Difficulty Measures; Information Theory.
    • Digest Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model V -- as the lack of V-usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for V. We further introduce pointwise V-information (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution.
  • Exploring the Limits of Large Scale Pre-training. [paper]

    • Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi. ICLR 2022
    • Key Word: Pre-training.
    • Digest We investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks.
  • Stochastic Training is Not Necessary for Generalization. [paper] [code]

    • Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein. ICLR 2022
    • Key Word: Stochastic Gradient Descent; Regularization.
    • Digest It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline.
  • Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. [paper]

    • Chiyuan Zhang, Maithra Raghu, Jon Kleinberg, Samy Bengio.
    • Key Word: Out-of-Distribution Generalization.
    • Digest In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture.
  • What can linear interpolation of neural network loss landscapes tell us? [paper]

    • Tiffany Vlaar, Jonathan Frankle. ICML 2022
    • Key Word: Linear Interpolation; Loss Landscapes.
    • Digest We put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model.
  • Can Vision Transformers Learn without Natural Images? [paper] [code]

    • Kodai Nakashima, Hirokatsu Kataoka, Asato Matsumoto, Kenji Iwata, Nakamasa Inoue. AAAI 2022
    • Key Word: Formula-driven Supervised Learning; Vision Transformer.
    • Digest We pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent.
  • The Low-Rank Simplicity Bias in Deep Networks. [paper]

    • Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, Phillip Isola.
    • Key Word: Low-Rank Embedding; Inductive Bias.
    • Digest We make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutions with lower effective rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low effective rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well.
  • Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. [paper] [code]

    • Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar. ICLR 2021
    • Key Word: Edge of Stability.
    • Digest We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value 2/(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training.
  • Pre-training without Natural Images. [paper] [code]

    • Hirokatsu Kataoka, Kazushige Okayasu, Asato Matsumoto, Eisuke Yamagata, Ryosuke Yamada, Nakamasa Inoue, Akio Nakamura, Yutaka Satoh. ACCV 2020
    • Key Word: Formula-driven Supervised Learning.
    • Digest The paper proposes a novel concept, Formula-driven Supervised Learning. We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law existing in the background knowledge of the real world. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinite scale dataset of labeled images. Although the models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, does not necessarily outperform models pre-trained with human annotated datasets at all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models.

Empirical Study: 2020

  • When Do Curricula Work? [paper] [code]

    • Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur. ICLR 2021
    • Key Word: Curriculum Learning.
    • Digest We set out to investigate the relative benefits of ordered learning. We first investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of explicit curricula, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered.
  • In Search of Robust Measures of Generalization. [paper] [code]

    • Gintare Karolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, Daniel M. Roy. NeurIPS 2020
    • Key Word: Generalization Measures.
    • Digest One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now trains networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically.
  • The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers. [paper] [code]

    • Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi. ICLR 2021
    • Key Word: Online Learning; Finite-Sample Deviations.
    • Digest We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning.
  • Characterising Bias in Compressed Models. [paper]

    • Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton.
    • Key Word: Pruning; Fairness.
    • Digest The popularity and widespread use of pruning and quantization is driven by the severe resource constraints of deploying deep neural networks to environments with strict latency, memory and energy requirements. These techniques achieve high levels of compression with negligible impact on top-line metrics (top-1 and top-5 accuracy). However, overall accuracy hides disproportionately high errors on a small subset of examples; we call this subset Compression Identified Exemplars (CIE).
  • Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. [paper] [code]

    • Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi. EMNLP 2020
    • Key Word: Training Dynamics; Data Map; Curriculum Learning.
    • Digest Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps.
  • What is being transferred in transfer learning? [paper] [code]

    • Behnam Neyshabur, Hanie Sedghi, Chiyuan Zhang. NeurIPS 2020
    • Key Word: Transfer Learning.
    • Digest We provide new tools and analyses to address these fundamental questions. Through a series of analyses on transferring to block-shuffled images, we separate the effect of feature reuse from learning low-level statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.
  • Deep Isometric Learning for Visual Recognition. [paper] [code]

    • Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, Jitendra Malik. ICML 2020
    • Key Word: Isometric Networks.
    • Digest This paper shows that deep vanilla ConvNets without normalization nor skip connections can also be trained to achieve surprisingly good performance on standard image recognition benchmarks. This is achieved by enforcing the convolution kernels to be near isometric during initialization and training, as well as by using a variant of ReLU that is shifted towards being isometric.
  • On the Generalization Benefit of Noise in Stochastic Gradient Descent. [paper]

    • Samuel L. Smith, Erich Elsen, Soham De. ICML 2020
    • Key Word: Stochastic Gradient Descent.
    • Digest In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses.
  • Do CNNs Encode Data Augmentations? [paper]

    • Eddie Yan, Yanping Huang.
    • Key Word: Data Augmentations.
    • Digest Surprisingly, neural network features not only predict data augmentation transformations, but they predict many transformations with high accuracy. After validating that neural networks encode features corresponding to augmentation transformations, we show that these features are primarily encoded in the early layers of modern CNNs.
  • Do We Need Zero Training Loss After Achieving Zero Training Error? [paper] [code]

    • Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama. ICML 2020
    • Key Word: Regularization.
    • Digest Our approach makes the loss float around the flooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.
  • Understanding Why Neural Networks Generalize Well Through GSNR of Parameters. [paper]

    • Jinlong Liu, Guoqing Jiang, Yunzhi Bai, Ting Chen, Huayan Wang. ICLR 2020
    • Key Word: Generalization Indicators.
    • Digest In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance, over the data distribution.

Empirical Study: 2019

  • Angular Visual Hardness. [paper]

    • Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima Anandkumar. ICML 2020
    • Key Word: Calibration; Example Hardness Measures.
    • Digest We propose a novel measure for CNN models known as Angular Visual Hardness. Our comprehensive empirical studies show that AVH can serve as an indicator of generalization abilities of neural networks, and improving SOTA accuracy entails improving accuracy on hard example
  • Fantastic Generalization Measures and Where to Find Them. [paper] [code]

    • Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio. ICLR 2020
    • Key Word: Complexity Measures; Spurious Correlations.
    • Digest We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
  • Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory. [paper] [code]

    • Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein. ICLR 2020
    • Key Word: Local Minima.
    • Digest The authors take a closer look at widely held beliefs about neural networks. Using a mix of analysis and experiment, they shed some light on the ways these assumptions break down.
  • Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. [paper] [code]

    • Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals. ICLR 2020
    • Key Word: Meta Learning.
    • Digest Despite MAML's popularity, a fundamental open question remains -- is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor.
  • Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias. [paper] [code]

    • Stéphane d'Ascoli, Levent Sagun, Joan Bruna, Giulio Biroli. NeurIPS 2019
    • Key Word: Architectural Bias.
    • Digest In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.
  • Adversarial Training Can Hurt Generalization. [paper]

    • Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, Percy Liang.
    • Key Word: Adversarial Examples.
    • Digest While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data.
  • Bad Global Minima Exist and SGD Can Reach Them. [paper] [code]

    • Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas. NeurIPS 2020
    • Key Word: Stochastic Gradient Descent.
    • Digest Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels.
  • Deep ReLU Networks Have Surprisingly Few Activation Patterns. [paper]

    • Boris Hanin, David Rolnick. NeurIPS 2019
    • Digest In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns.
  • Sensitivity of Deep Convolutional Networks to Gabor Noise. [paper] [code]

    • Kenneth T. Co, Luis Muñoz-González, Emil C. Lupu.
    • Key Word: Robustness.
    • Digest Deep Convolutional Networks (DCNs) have been shown to be sensitive to Universal Adversarial Perturbations (UAPs): input-agnostic perturbations that fool a model on large portions of a dataset. These UAPs exhibit interesting visual patterns, but this phenomena is, as yet, poorly understood. Our work shows that visually similar procedural noise patterns also act as UAPs. In particular, we demonstrate that different DCN architectures are sensitive to Gabor noise patterns. This behaviour, its causes, and implications deserve further in-depth study.
  • Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks. [paper]

    • Guangyong Chen, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, Shengyu Zhang.
    • Key Word: Batch Normalization; Dropout.
    • Digest Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent.
  • A critical analysis of self-supervision, or what we can learn from a single image. [paper] [code]

    • Yuki M. Asano, Christian Rupprecht, Andrea Vedaldi. ICLR 2020
    • Key Word: Self-Supervision.
    • Digest We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training.
  • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. [paper] [code]

    • Wieland Brendel, Matthias Bethge. ICLR 2019
    • Key Word: Bag-of-Features.
    • Digest Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet.
  • Transfusion: Understanding Transfer Learning for Medical Imaging. [paper] [code]

    • Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio. NeurIPS 2019
    • Key Word: Transfer Learning; Medical Imaging.
    • Digest we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures.
  • Identity Crisis: Memorization and Generalization under Extreme Overparameterization. [paper]

    • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer. ICLR 2020
    • Key Word: Memorization.
    • Digest We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task.
  • Are All Layers Created Equal? [paper]

    • Chiyuan Zhang, Samy Bengio, Yoram Singer. JMLR
    • Key Word: Robustness.
    • Digest We show that the layers can be categorized as either "ambient" or "critical". Resetting the ambient layers to their initial values has no negative consequence, and in many cases they barely change throughout training. On the contrary, resetting the critical layers completely destroys the predictor and the performance drops to chance.

Empirical Study: 2018

  • Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. [paper] [code]

    • Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf. CVPR 2019
    • Key Word: ReLU.
    • Digest Classifiers used in the wild, in particular for safety-critical systems, should not only have good generalization properties but also should know when they don't know, in particular make low confidence predictions far away from the training data. We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data.
  • An Empirical Study of Example Forgetting during Deep Neural Network Learning. [paper] [code]

    • Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon. ICLR 2019
    • Key Word: Curriculum Learning; Sample Weighting; Example Forgetting.
    • Digest We define a 'forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.
  • On Implicit Filter Level Sparsity in Convolutional Neural Networks. [paper]

    • Dushyant Mehta, Kwang In Kim, Christian Theobalt. CVPR 2019
    • Key Word: Regularization; Sparsification.
    • Digest We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. We conduct an extensive experimental study casting our initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice.
  • Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. [paper] [code]

    • Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem. ICML 2019
    • Key Word: Disentanglement.
    • Digest Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.
  • Insights on representational similarity in neural networks with canonical correlation. [paper] [code]

    • Ari S. Morcos, Maithra Raghu, Samy Bengio. NeurIPS 2018
    • Key Word: Representational Similarity.
    • Digest Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA.
  • Layer rotation: a surprisingly powerful indicator of generalization in deep networks? [paper] [code]

    • Simon Carbonnelle, Christophe De Vleeschouwer.
    • Key Word: Weight Evolution.
    • Digest Our work presents extensive empirical evidence that layer rotation, i.e. the evolution across training of the cosine distance between each layer's weight vector and its initialization, constitutes an impressively consistent indicator of generalization performance. In particular, larger cosine distances between final and initial weights of each layer consistently translate into better generalization performance of the final model.
  • Sensitivity and Generalization in Neural Networks: an Empirical Study. [paper]

    • Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein. ICLR 2018
    • Key Word: Sensitivity.
    • Digest In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.

Empirical Study: 2017

  • Deep Image Prior. [paper] [code]

    • Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky.
    • Key Word: Low-Level Vision.
    • Digest In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting.
  • Critical Learning Periods in Deep Neural Networks. [paper]

    • Alessandro Achille, Matteo Rovere, Stefano Soatto. ICLR 2019
    • Key Word: Memorization.
    • Digest Our findings indicate that the early transient is critical in determining the final solution of the optimization associated with training an artificial neural network. In particular, the effects of sensory deficits during a critical period cannot be overcome, no matter how much additional training is performed.
  • A Closer Look at Memorization in Deep Networks. [paper]

    • Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien. ICML 2017
    • Key Word: Memorization.
    • Digest In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data.

Empirical Study: 2016

  • Understanding deep learning requires rethinking generalization. [paper]
    • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. ICLR 2017
    • Key Word: Memorization.
    • Digest Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data.

Neural Collapse

avatar

Neural Collapse: 2024

  • Formation of Representations in Neural Networks. [paper]

    • Liu Ziyin, Isaac Chuang, Tomer Galanti, Tomaso Poggio.
    • Key Word: Canonical Representation Hypothesis; Neural Representation.
    • Digest This paper proposes the Canonical Representation Hypothesis (CRH), which posits that during neural network training, a set of six alignment relations universally governs the formation of representations in hidden layers. Specifically, it suggests that the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned, leading to compact representations that are invariant to task-irrelevant transformations. The paper also introduces the Polynomial Alignment Hypothesis (PAH), which arises when CRH is broken, resulting in power-law relations between R, W, and G. The authors suggest that balancing gradient noise and regularization is key to the emergence of these canonical representations and propose that these hypotheses could unify major deep learning phenomena like neural collapse and the neural feature ansatz.
  • Neural Collapse Meets Differential Privacy: Curious Behaviors of NoisyGD with Near-perfect Representation Learning. [paper]

    • Chendi Wang, Yuqing Zhu, Weijie J. Su, Yu-Xiang Wang.
    • key Word: Neural Collapse; Differential Privacy.
    • Digest The study by De et al. (2022) explores the impact of large-scale representation learning on differentially private (DP) learning, focusing on the phenomenon of Neural Collapse (NC) in deep learning and transfer learning. The research establishes an error bound within the NC framework, evaluates feature quality, reveals the lesser robustness of DP fine-tuning, and suggests strategies to enhance its robustness, with empirical evidence supporting these findings.
  • Average gradient outer product as a mechanism for deep neural collapse. [paper]

    • Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin.
    • Key Word: Neural Collapse; Average Gradient Outer Product.
    • Digest This paper investigates the phenomenon of Deep Neural Collapse (DNC), where the final layers of Deep Neural Networks (DNNs) exhibit a highly structured representation of data. The study presents significant evidence that DNC primarily occurs through the process of deep feature learning, facilitated by the average gradient outer product (AGOP). This approach marks a departure from previous explanations that relied on feature-agnostic models. The authors highlight the role of the right singular vectors and values of the network weights in reducing within-class variability, a key aspect of DNC. They establish a link between this singular structure and the AGOP, further demonstrating experimentally and theoretically that AGOP can induce DNC even in randomly initialized neural networks. The paper also discusses Deep Recursive Feature Machines, a conceptual method representing AGOP feature learning in convolutional neural networks, and shows its capability to exhibit DNC.
  • Pushing Boundaries: Mixup's Influence on Neural Collapse. [paper]

    • Quinn Fisher, Haoming Meng, Vardan Papyan.
    • Key Word: Mixup; Neural Collapse.
    • Digest The abstract investigates "Mixup," a technique enhancing deep neural network robustness by blending training data and labels, focusing on its success through geometric configurations of network activations. It finds that mixup leads to a unique alignment of last-layer activations that challenges prior expectations, with mixed examples of the same class aligning with the classifier and different classes marking distinct boundaries. This unexpected behavior suggests mixup affects deeper network layers in a novel way, diverging from simple convex combinations of class features. The study connects these findings to improved model calibration and supports them with a theoretical analysis, highlighting the role of a specific geometric pattern (simplex equiangular tight frame) in optimizing last-layer features for better performance.

Neural Collapse: 2023

  • Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations. [paper]

    • Yongyi Yang, Jacob Steinhardt, Wei Hu. ICML 2023
    • Key Word: Neural Collapse.
    • Digest The paper challenges the notion of "Neural Collapse" in well-trained neural networks, arguing that while the last-layer representations may appear to collapse, there is still fine-grained structure present in the representations that captures the intrinsic structure of the input distribution.
  • Neural (Tangent Kernel) Collapse. [paper]

    • Mariia Seleznova, Dana Weitzner, Raja Giryes, Gitta Kutyniok, Hung-Hsu Chou.
    • Key Word: Neural Collapse; Neural Tangent Kernel.
    • Digest This paper investigates how the Neural Tangent Kernel (NTK), which tracks how deep neural networks (DNNs) change during training, and the Neural Collapse (NC) phenomenon, which refers to the symmetry and structure in the last-layer features of trained classification DNNs, are related. They assume that the empirical NTK has a block structure that matches the class labels, meaning that samples of the same class are more correlated than samples of different classes. They show how this assumption leads to the dynamics of DNNs trained with mean squared (MSE) loss and the emergence of NC in DNNs with block-structured NTK. They support their theory with large-scale experiments on three DNN architectures and three datasets.
  • Neural Collapse Inspired Feature-Classifier Alignment for Few-Shot Class Incremental Learning. [paper] [code]

    • Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, Dacheng Tao. ICLR 2023
    • Key Word: Few-Shot Class Incremental Learning; Neural Collapse.
    • Digest We deal with this misalignment dilemma in FSCIL inspired by the recently discovered phenomenon named neural collapse, which reveals that the last-layer features of the same class will collapse into a vertex, and the vertices of all classes are aligned with the classifier prototypes, which are formed as a simplex equiangular tight frame (ETF). It corresponds to an optimal geometric structure for classification due to the maximized Fisher Discriminant Ratio.
  • Neural Collapse in Deep Linear Network: From Balanced to Imbalanced Data. [paper]

    • Hien Dang, Tan Nguyen, Tho Tran, Hung Tran, Nhat Ho.
    • Key Word: Neural Collapse; Imbalanced Learning.
    • Digest We take a step further and prove the Neural Collapse occurrence for deep linear network for the popular mean squared error (MSE) and cross entropy (CE) loss. Furthermore, we extend our research to imbalanced data for MSE loss and present the first geometric analysis for Neural Collapse under this setting.

Neural Collapse: 2022

  • Principled and Efficient Transfer Learning of Deep Models via Neural Collapse. [paper]

    • Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, Qing Qu.
    • Key Word: Neural Collapse; Transfer Learning.
    • Digest This work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task.
  • Perturbation Analysis of Neural Collapse. [paper]

    • Tom Tirer, Haoxiang Huang, Jonathan Niles-Weed.
    • Key Word: Neural Collapse.
    • Digest We propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models.
  • Imbalance Trouble: Revisiting Neural-Collapse Geometry. [paper]

    • Christos Thrampoulidis, Ganesh R. Kini, Vala Vakilian, Tina Behnia.
    • Key Word: Neural Collapse; Class Imbalance.
    • Digest Neural Collapse refers to the remarkable structural properties characterizing the geometry of class embeddings and classifier weights, found by deep nets when trained beyond zero training error. However, this characterization only holds for balanced data. Here we thus ask whether it can be made invariant to class imbalances. Towards this end, we adopt the unconstrained-features model (UFM), a recent theoretical model for studying neural collapse, and introduce Simplex-Encoded-Labels Interpolation (SELI) as an invariant characterization of the neural collapse phenomenon.
  • Neural Collapse: A Review on Modelling Principles and Generalization. [paper]

    • Vignesh Kothapalli, Ebrahim Rasromani, Vasudev Awatramani.
    • Key Word: Neural Collapse.
    • Digest We analyse the principles which aid in modelling such a phenomena from the ground up and show how they can build a common understanding of the recently proposed models that try to explain NC. We hope that our analysis presents a multifaceted perspective on modelling NC and aids in forming connections with the generalization capabilities of neural networks. Finally, we conclude by discussing the avenues for further research and propose potential research problems.
  • Do We Really Need a Learnable Classifier at the End of Deep Neural Network? [paper]

    • Yibo Yang, Liang Xie, Shixiang Chen, Xiangtai Li, Zhouchen Lin, Dacheng Tao.
    • Key Word: Neural Collapse.
    • Digest We study the potential of training a network with the last-layer linear classifier randomly initialized as a simplex ETF and fixed during training. This practice enjoys theoretical merits under the layer-peeled analytical framework. We further develop a simple loss function specifically for the ETF classifier. Its advantage gets verified by both theoretical and experimental results.
  • Limitations of Neural Collapse for Understanding Generalization in Deep Learning. [paper]

    • Like Hui, Mikhail Belkin, Preetum Nakkiran.
    • Key Word: Neural Collapse.
    • Digest We point out that Neural Collapse is primarily an optimization phenomenon, not a generalization one, by investigating the train collapse and test collapse on various dataset and architecture combinations. We propose more precise definitions — "strong" and "weak" Neural Collapse for both the train set and the test set — and discuss their theoretical feasibility.

Neural Collapse: 2021

  • On the Role of Neural Collapse in Transfer Learning. [paper]

    • Tomer Galanti, András György, Marcus Hutter. ICLR 2022
    • Key Word: Neural Collapse; Transfer Learning.
    • Digest We provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse.
  • An Unconstrained Layer-Peeled Perspective on Neural Collapse. [paper]

    • Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, Weijie J. Su. ICLR 2022
    • Key Word: Neural Collapse; Uncostrained Model; Implicit Regularization.
    • Digest We introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon.
  • Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. [paper]

    • X.Y. Han, Vardan Papyan, David L. Donoho. ICLR 2022
    • Key Word: Neural Collapse; Gradient Flow.
    • Digest The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
  • A Geometric Analysis of Neural Collapse with Unconstrained Features. [paper] [code]

    • Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, Qing Qu. NeurIPS 2021
    • Key Word: Neural Collapse, Nonconvex Optimization.
    • Digest We provide the first global optimization landscape analysis of Neural Collapse -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified unconstrained feature model, which isolates the topmost layers from the classifier of the neural network.
  • Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training. [paper] [code]

    • Cong Fang, Hangfeng He, Qi Long, Weijie J. Su. PNAS
    • Key Word: Neural Collapse; Imbalanced Training.
    • Digest In this paper, we introduce the Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. When moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep learning models on the minority classes.

Neural Collapse: 2020

  • Prevalence of Neural Collapse during the terminal phase of deep learning training. [paper] [code]
    • Vardan Papyan, X.Y. Han, David L. Donoho. PNAS
    • Key Word: Neural Collapse.
    • Digest This paper studied the terminal phase of training (TPT) of today’s canonical deepnet training protocol. It documented that during TPT a process called Neural Collapse takes place, involving four fundamental and interconnected phenomena: (NC1)-(NC4).

Deep Double Descent

avatar

Deep Double Descent: 2023

  • A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning. [paper]

    • Alicia Curth, Alan Jeffares, Mihaela van der Schaar.
    • Key Word: Deep Double Descent.
    • Digest This paper challenges the conventional understanding of the relationship between model complexity and prediction error. It explores the phenomenon of "double descent," which suggests that test error can decrease as the parameter count exceeds the sample size. While this phenomenon has been observed in deep learning and other models like linear regression, trees, and boosting, the paper argues that the interpretation is influenced by multiple complexity axes. It demonstrates that the second descent occurs when the transition between these underlying axes happens and is not inherently tied to the interpolation threshold. The paper proposes a generalized measure for the effective number of parameters, which resolves tensions between double descent and traditional statistical intuition.
  • Dropout Drops Double Descent. [paper]

    • Tian-Le Yang, Joe Suzuki.
    • Key Word: Dropout; Deep Double Descent.
    • Digest The paper finds that adding a dropout layer before the fully-connected linear layer can drop the double descent phenomenon. Double descent is when the prediction error rises and drops as sample or model size increases. Optimal dropout can alleviate this in linear and nonlinear regression models, both theoretically and empirically. Optimal dropout can achieve a monotonic test error curve in nonlinear neural networks. Previous deep learning models do not encounter double-descent because they already apply regularization approaches like dropout.
  • Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle. [paper]

    • Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W. Rocks, Ila Rani Fiete, Oluwasanmi Koyejo.
    • Key Word: Deep Double Descent; Tutorial.
    • Digest We briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent.
  • Unifying Grokking and Double Descent. [paper] [code]

    • Xander Davies, Lauro Langosco, David Krueger.
    • Key Word: Deep Double Descent; Grokking.
    • Digest We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds. We propose that this framework also applies when varying model capacity instead of optimization steps, and provide the first demonstration of model-wise grokking.

Deep Double Descent: 2022

  • Sparse Double Descent: Where Network Pruning Aggravates Overfitting. [paper] [code]

    • Zheng He, Zeke Xie, Quanzhi Zhu, Zengchang Qin. ICML 2022
    • Key Word: Deep Double Descent; Lottery Ticket Hypothesis.
    • Digest While recent studies focused on the deep double descent with respect to model overparameterization, they failed to recognize that sparsity may also cause double descent. In this paper, we have three main contributions. First, we report the novel sparse double descent phenomenon through extensive experiments. Second, for this phenomenon, we propose a novel learning distance interpretation that the curve of ℓ2 learning distance of sparse models (from initialized parameters to final parameters) may correlate with the sparse double descent curve well and reflect generalization better than minima flatness. Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win.
  • Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective. [paper] [code]

    • Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Baraniuk, Micah Goldblum, Tom Goldstein. CVPR 2022
    • Key Word: Deep Double Descent; Manifold.
    • Digest We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena.
  • Phenomenology of Double Descent in Finite-Width Neural Networks. [paper] [code]

    • Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, Bernhard Schölkopf. ICLR 2022
    • Key Word: Deep Double Descent.
    • Digest 'Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model.

Deep Double Descent: 2021

  • Multi-scale Feature Learning Dynamics: Insights for Double Descent. [paper] [code]

    • Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie.
    • Key Word: Deep Double Descent.
    • Digest We investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error.
  • Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks. [paper]

    • Ryumei Nakada, Masaaki Imaizumi.
    • Key Word: Deep Double Descent.
    • Digest We consider a likelihood maximization problem without the model constraints and analyze the upper bound of an asymptotic risk of an estimator with penalization. Technically, we combine a property of the Fisher information matrix with an extended Marchenko-Pastur law and associate the combination with empirical process techniques. The derived bound is general, as it describes both the double descent and the regularized risk curves, depending on the penalization.
  • Distilling Double Descent. [paper]

    • Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou.
    • Key Word: Deep Double Descent; Distillation.
    • Digest Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with soft labels, e.g. probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides hard labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches.

Deep Double Descent: 2020

  • Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition. [paper]

    • Ben Adlam, Jeffrey Pennington. NeurIPS 2020
    • Key Word: Deep Double Descent; Bias-Variance.
    • Digest Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels.
  • Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win. [paper] [code]

    • Utku Evci, Yani A. Ioannou, Cem Keskin, Yann Dauphin. AAAI 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). Through our analysis of gradient flow during training we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions?
  • Multiple Descent: Design Your Own Generalization Curve. [paper]

    • Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi. NeurIPS 2021
    • Key Word: Deep Double Descent.
    • Digest This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.
  • Early Stopping in Deep Networks: Double Descent and How to Eliminate it. [paper] [code]

    • Reinhard Heckel, Fatih Furkan Yilmaz. ICLR 2021
    • Key Word: Deep Double Descent; Early Stopping.
    • Digest We show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and eliminating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a two-layer neural network, where the first and second layer each govern a bias-variance tradeoff. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
  • Triple descent and the two kinds of overfitting: Where & why do they appear? [paper] [code]

    • Stéphane d'Ascoli, Levent Sagun, Giulio Biroli.
    • Key Word:Deep Double Descent.
    • Digest In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent.
  • A Brief Prehistory of Double Descent. [paper]

    • Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, David M.J. Tax.
    • Key Word: Deep Double Descent.
    • Digest This letter draws attention to some original, earlier findings, of interest to double descent.
  • Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime. [paper] [code]

    • Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, Florent Krzakala. ICML 2020
    • Key Word: Deep Double Descent; Bias-Variance.
    • Digest Deep neural networks can achieve remarkable generalization performances while interpolating the training data perfectly. Rather than the U-curve emblematic of the bias-variance trade-off, their test error often follows a "double descent" - a mark of the beneficial role of overparametrization. In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression. We obtain a precise asymptotic expression for the bias-variance decomposition of the test error, and show that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant.
  • Rethinking Bias-Variance Trade-off for Generalization of Neural Networks. [paper] [code]

    • Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, Yi Ma. ICML 2020
    • Key Word: Deep Double Descent; Bias-Variance.
    • Digest The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network.
  • The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization. [paper]

    • Yifei Min, Lin Chen, Amin Karbasi. UAI 2021
    • Key Word: Deep Double Descent.
    • Digest We challenge this conventional belief and show that more training data can hurt the generalization of adversarially robust models in the classification problems. We first investigate the Gaussian mixture classification with a linear loss and identify three regimes based on the strength of the adversary. In the weak adversary regime, more data improves the generalization of adversarially robust models. In the medium adversary regime, with more training data, the generalization loss exhibits a double descent curve, which implies the existence of an intermediate stage where more training data hurts the generalization. In the strong adversary regime, more data almost immediately causes the generalization error to increase.

Deep Double Descent: 2019

  • Deep Double Descent: Where Bigger Models and More Data Hurt. [paper]
    • Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever. ICLR 2020
    • Key Word: Deep Double Descent.
    • Digest We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.

Deep Double Descent: 2018

  • Reconciling modern machine learning practice and the bias-variance trade-off. [paper]

    • Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. PNAS
    • Key Word: Bias-Variance; Over-Parameterization.
    • Digest In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance.
  • A Modern Take on the Bias-Variance Tradeoff in Neural Networks. [paper]

    • Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, Ioannis Mitliagkas.
    • Key Word: Bias-Variance; Over-Parameterization.
    • Digest The bias-variance tradeoff tells us that as model complexity increases, bias falls and variances increases, leading to a U-shaped test error curve. However, recent empirical results with over-parameterized neural networks are marked by a striking absence of the classic U-shaped test error curve: test error keeps decreasing in wider networks. Motivated by the shaky evidence used to support this claim in neural networks, we measure bias and variance in the modern setting. We find that both bias and variance can decrease as the number of parameters grows. To better understand this, we introduce a new decomposition of the variance to disentangle the effects of optimization and data sampling.

Lottery Ticket Hypothesis

avatar

Lottery Ticket Hypothesis: 2024

  • On the Sparsity of the Strong Lottery Ticket Hypothesis.

    • Emanuele Natale, Davide Ferré, Giordano Giambartolomei, Frédéric Giroire, Frederik Mallmann-Trenn.
    • Key Word: Strong Lottery Ticket Hypothesis.
    • Digest Recent research has explored the Strong Lottery Ticket Hypothesis (SLTH), which suggests that a random neural network contains subnetworks that can approximate smaller networks without training. This builds on the weaker Lottery Ticket Hypothesis, which states that large networks contain sparse subnetworks that can be efficiently trained to perform as well as the full network. However, previous SLTH results lacked guarantees on subnetwork size due to reliance on the Random Subset Sum (RSS) Problem. This paper provides the first proof of SLTH with subnetwork sparsity guarantees in settings like dense and equivariant networks. The key contribution is an improved bound on a variant of the RSS Problem, offering new insights into subnetwork size constraints.
  • No Free Prune: Information-Theoretic Barriers to Pruning at Initialization. [paper]

    • Tanishq Kumar, Kevin Luo, Mark Sellke.
    • Key Word: Lottery Ticket Hypothesis; Overparameterization; Mutual Information.
    • Digest This paper investigates the concept of "lottery tickets" in deep learning, questioning the necessity of large models versus identifying and training sparse networks from the start. Despite attempts, finding these efficient subnetworks without training the full model has largely failed. The study proposes a theoretical reason for this, focusing on the effective parameter count, which includes non-zero weights and the data-related information within the sparsity mask. It extends the Law of Robustness to sparse networks, suggesting that data-dependent masks are crucial for robust performance. The findings indicate that masks created during or after training contain more information than those at initialization, affecting the network's effective capacity. This explains the difficulty in finding lottery tickets without full model training, as confirmed by experimental results on neural networks.

Lottery Ticket Hypothesis: 2023

  • Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models. [paper]

    • Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, Zhangyang Wang.
    • Key Word: Lottery Tickets; Model Soup.
    • Digest The paper introduces Instant Soup Pruning (ISP), a novel approach that leverages the idea of model soups to generate high-quality subnetworks from large pre-trained models, reducing the computational cost compared to traditional iterative magnitude pruning (IMP) methods.
  • A Three-regime Model of Network Pruning. [paper]

    • Yefan Zhou, Yaoqing Yang, Arin Chang, Michael W. Mahoney.
    • Key Word: Pruning; Linear Mode Connectivity.
    • Digest This paper proposes a model based on statistical mechanics to predict how training hyperparameters affect pruning performance of neural networks. The paper finds a sharp transition phenomenon that depends on two parameters in the pre-pruned and pruned models. The paper also identifies three types of global structures in the pruned loss landscape and applies the model to three practical scenarios.
  • Generalization Bounds for Magnitude-Based Pruning via Sparse Matrix Sketching. [paper]

    • Etash Kumar Guha, Prasanjit Dubey, Xiaoming Huo.
    • Key Word: Magnitude-Based Pruning; Norm-based Generalization Bound; Sparse Matrix Sketching.
    • Digest This paper proposes a new bound on the generalization error of Magnitude-Based pruning1, a technique that removes weights with small magnitudes from neural networks. The paper improves on previous bounds by using Sparse Matrix Sketching, a method that compresses pruned matrices into smaller dimensions. The paper also extends the results to Iterative Pruning, a process that prunes and retrains the network multiple times. The paper shows that the new method achieves better generalization than existing methods on some datasets.
  • Pruning at Initialization -- A Sketching Perspective. [paper]

    • Noga Bar, Raja Giryes.
    • Key Word: Pruning at Ininitialization; Sketching Algorithm; Neural Tangent Kernel.
    • Digest The paper studies how to prune linear neural networks before training. They show that this problem is related to the sketching problem for fast matrix multiplication. They use this connection to analyze the error and data dependence of pruning at initialization. They also propose a general improvement to existing pruning algorithms based on sketching techniques.
  • NTK-SAP: Improving neural network pruning by aligning training dynamics. [paper] [code]

    • Yite Wang, Dawei Li, Ruoyu Sun. ICLR 2023
    • Key Word: Pruning at Ininitialization; Neural Tangent Kernel.
    • Digest We propose to prune the connections that have the least influence on the spectrum of the NTK. This method can help maintain the NTK spectrum, which may help align the training dynamics to that of its dense counterpart. However, one possible issue is that the fixed-weight-NTK corresponding to a given initial point can be very different from the NTK corresponding to later iterates during the training phase.
  • Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together! [paper] [code]

    • Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Jaiswal, Zhangyang Wang. ICLR 2023
    • Key Word: Sparse Neural Network; Benchmark.
    • Digest In absence of a carefully crafted evaluation benchmark, most if not all, sparse algorithms are evaluated against fairly simple and naive tasks (eg. CIFAR, ImageNet, GLUE, etc.), which can potentially camouflage many advantages as well unexpected predicaments of SNNs. In pursuit of a more general evaluation and unveiling the true potential of sparse algorithms, we introduce "Sparsity May Cry" Benchmark (SMC-Bench), a collection of carefully-curated 4 diverse tasks with 10 datasets, that accounts for capturing a wide range of domain-specific and sophisticated knowledge.
  • Pruning Deep Neural Networks from a Sparsity Perspective. [paper] [code]

    • Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, Vahid Tarokh. ICLR 2023
    • Key Word: Theory of Model Compression; Sparsity Measure.
    • Digest Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may under-prune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm.
  • Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning. [paper] [code]

    • Huan Wang, Can Qin, Yue Bai, Yun Fu.
    • Key Word: Pruning; Empirical Study.
    • Digest Two mysteries in pruning represent such a confusing status: the performance-boosting effect of a larger finetuning learning rate, and the no-value argument of inheriting pretrained weights in filter pruning. In this work, we attempt to explain the confusing state of network pruning by demystifying the two mysteries.
  • Theoretical Characterization of How Neural Network Pruning Affects its Generalization. [paper]

    • Hongru Yang, Yingbin Liang, Xiaojie Guo, Lingfei Wu, Zhangyang Wang.
    • Key Word: Lottery Ticket Hypothesis; Generalization Bound.
    • Digest This work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.

Lottery Ticket Hypothesis: 2022

  • Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions. [paper] [code]

    • Shaochen Zhong, Guanqun Zhang, Ningjia Huang, Shuai Xu. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We revisit the idea of kernel pruning, a heavily overlooked approach under the context of structured pruning. This is because kernel pruning will naturally introduce sparsity to filters within the same convolutional layer — thus, making the remaining network no longer dense. We address this problem by proposing a versatile grouped pruning framework where we first cluster filters from each convolutional layer into equal-sized groups, prune the grouped kernels we deem unimportant from each filter group, then permute the remaining filters to form a densely grouped convolutional architecture (which also enables the parallel computing capability) for fine-tuning.
  • Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks. [paper]

    • Arthur da Cunha, Emanuele Natale, Laurent Viennot, Laurent_Viennot. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent theoretical works proved an even stronger version: every sufficiently overparameterized (dense) neural network contains a subnetwork that, even without training, achieves accuracy comparable to that of the trained large network. These works left as an open problem to extend the result to convolutional neural networks (CNNs). In this work we provide such generalization by showing that, with high probability, it is possible to approximate any CNN by pruning a random CNN whose size is larger by a logarithmic factor.
  • Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable. [paper] [code]

    • Shaojin Ding, Tianlong Chen, Zhangyang Wang. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis; Speech Recognition.
    • Digest We investigate the tantalizing possibility of using lottery ticket hypothesis to discover lightweight speech recognition models, that are (1) robust to various noise existing in speech; (2) transferable to fit the open-world personalization; and 3) compatible with structured sparsity.
  • Strong Lottery Ticket Hypothesis with ε--perturbation. [paper]

    • Zheyang Xiong, Fangshuo Liao, Anastasios Kyrillidis.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization.
  • Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers. [paper]

    • Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar.
    • Key Word: Sparse Activation; Large Models; Transformers.
    • Digest This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP.
  • Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? [paper]

    • Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite.
    • Key Word: Lottery Ticket Hypothesis; Mode Connectivity.
    • Digest First, we find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune.
  • How Erdös and Rényi Win the Lottery. [paper]

    • Advait Gadhikar, Sohum Mukherjee, Rebekka Burkholz.
    • Key Word: Lottery Ticket Hypothesis; Erdös-Rényi Random Graphs.
    • Digest Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erdös-Rényi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/log(1/sparsity).
  • SparCL: Sparse Continual Learning on the Edge. [paper]

    • Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, Jennifer Dy. NeurIPS 2022
    • Key Word: Continual Learning; Sparse Training.
    • Digest We propose a novel framework called Sparse Continual Learning(SparCL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates.
  • One-shot Network Pruning at Initialization with Discriminative Image Patches. [paper]

    • Yinan Yang, Ying Ji, Yu Wang, Heng Qi, Jien Kato.
    • Key Word: One-Shot Network Pruning.
    • Digest We propose two novel methods, Discriminative One-shot Network Pruning (DOP) and Super Stitching, to prune the network by high-level visual discriminative image patches. Our contributions are as follows. (1) Extensive experiments reveal that OPaI is data-dependent. (2) Super Stitching performs significantly better than the original OPaI method on benchmark ImageNet, especially in a highly compressed model.
  • SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning. [paper] [code]

    • Haoran You, Baopu Li, Zhanyi Sun, Xu Ouyang, Yingyan Lin. ECCV 2022
    • Key Word: Lottery Ticket Hypothesis; Neural Architecture Search.
    • Digest We discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identification strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training.
  • Lottery Ticket Hypothesis for Spiking Neural Networks. [paper]

    • Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, Priyadarshini Panda. ECCV 2022
    • Key Word: Lottery Ticket Hypothesis; Spiking Neural Networks.
    • Digest Spiking Neural Networks (SNNs) have recently emerged as a new generation of low-power deep neural networks where binary spikes convey information across multiple timesteps. Pruning for SNNs is highly important as they become deployed on a resource-constraint mobile/edge device. The previous SNN pruning works focus on shallow SNNs (2~6 layers), however, deeper SNNs (>16 layers) are proposed by state-of-the-art SNN works, which is difficult to be compatible with the current pruning work. To scale up a pruning technique toward deep SNNs, we investigate Lottery Ticket Hypothesis (LTH) which states that dense networks contain smaller subnetworks (i.e., winning tickets) that achieve comparable performance to the dense networks. Our studies on LTH reveal that the winning tickets consistently exist in deep SNNs across various datasets and architectures, providing up to 97% sparsity without huge performance degradation.
  • Winning the Lottery Ahead of Time: Efficient Early Network Pruning. [paper]

    • John Rachwan, Daniel Zügner, Bertrand Charpentier, Simon Geisler, Morgane Ayle, Stephan Günnemann. ICML 2022
    • Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
    • Digest Although state-of-the-art pruning methods extract highly sparse models, they neglect two main challenges: (1) the process of finding these sparse models is often very expensive; (2) unstructured pruning does not provide benefits in terms of GPU memory, training time, or carbon emissions. We propose Early Compression via Gradient Flow Preservation (EarlyCroP), which efficiently extracts state-of-the-art sparse models before or early in training addressing challenge (1), and can be applied in a structured manner addressing challenge (2). This enables us to train sparse networks on commodity GPUs whose dense versions would be too large, thereby saving costs and reducing hardware requirements.
  • "Understanding Robustness Lottery": A Comparative Visual Analysis of Neural Network Pruning Approaches. [paper]

    • Zhimin Li, Shusen Liu, Xin Yu, Kailkhura Bhavya, Jie Cao, Diffenderfer James Daniel, Peer-Timo Bremer, Valerio Pascucci.
    • Key Word: Lottery Ticket Hypothesis; Out-of-Distribution Generalization; Visualization.
    • Digest This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding.
  • Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. [paper] [code]

    • Tianlong Chen, Zhenyu Zhang, Sijia Liu, Yang Zhang, Shiyu Chang, Zhangyang Wang. ICML 2022
    • Key Word: Lottery Ticket Hypothesis; Adversarial Training; Robust Pre-training.
    • Digest We formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts.
  • HideNseek: Federated Lottery Ticket via Server-side Pruning and Sign Supermask. [paper]

    • Anish K. Vallapuram, Pengyuan Zhou, Young D. Kwon, Lik Hang Lee, Hengwei Xu, Pan Hui.
    • Key Word: Lottery Ticket Hypothesis; Federated Learning.
    • Digest We propose HideNseek which employs one-shot data-agnostic pruning at initialization to get a subnetwork based on weights' synaptic saliency. Each client then optimizes a sign supermask multiplied by the unpruned weights to allow faster convergence with the same compression rates as state-of-the-art.
  • Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. [paper] [code]

    • Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite. NeurIPS 2022
    • Key Word: Lottery Ticket Hypothesis; Pre-training.
    • Digest We seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP.
  • Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. [paper]

    • Keitaro Sakamoto, Issei Sato. NeurIPS 2022
    • Key Word: Lottery Ticket Hypothesis; PAC-Bayes.
    • Digest We confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets.
  • Dual Lottery Ticket Hypothesis. [paper] [code]

    • Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, Yun Fu. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis.
    • Digest This paper articulates a Dual Lottery Ticket Hypothesis (DLTH) as a dual format of original Lottery Ticket Hypothesis (LTH). Correspondingly, a simple regularization based sparse network training strategy, Random Sparse Network Transformation (RST), is proposed to validate DLTH and enhance sparse network training.
  • Rare Gems: Finding Lottery Tickets at Initialization. [paper]

    • Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos. NeurIPS 2022
    • Key Word: Lottery Ticket Hypothesis; Sanity Checks; Pruning at Initialization.
    • Digest Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to 19× faster.
  • Reconstruction Task Finds Universal Winning Tickets. [paper]

    • Ruichen Li, Binghui Li, Qi Qian, Liwei Wang.
    • Key Word: Lottery Ticket Hypothesis; Self-Supervision.
    • Digest We show that the image-level pretrain task is not capable of pruning models for diverse downstream tasks. To mitigate this problem, we introduce image reconstruction, a pixel-level task, into the traditional pruning framework. Concretely, an autoencoder is trained based on the original model, and then the pruning process is optimized with both autoencoder and classification losses.
  • Finding Dynamics Preserving Adversarial Winning Tickets. [paper] [code]

    • Xupeng Shi, Pengfei Zheng, A. Adam Ding, Yuan Gao, Weizhong Zhang. AISTATS 2022
    • Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
    • Digest Based on recent works of Neural Tangent Kernel (NTK), we systematically study the dynamics of adversarial training and prove the existence of trainable sparse sub-network at initialization which can be trained to be adversarial robust from scratch. This theoretically verifies the lottery ticket hypothesis in adversarial context and we refer such sub-network structure as Adversarial Winning Ticket (AWT). We also show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.

Lottery Ticket Hypothesis: 2021

  • Plant 'n' Seek: Can You Find the Winning Ticket? [paper] [code]

    • Jonas Fischer, Rebekka Burkholz. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks.
  • On the Existence of Universal Lottery Tickets. [paper] [code]

    • Rebekka Burkholz, Nilanjana Laha, Rajarshi Mukherjee, Alkis Gotovos. ICLR 2022
    • Key Word: Lottery Ticket Hypothesis.
    • Digest The lottery ticket hypothesis conjectures the existence of sparse subnetworks of large randomly initialized deep neural networks that can be successfully trained in isolation. Recent work has experimentally observed that some of these tickets can be practically reused across a variety of tasks, hinting at some form of universality. We formalize this concept and theoretically prove that not only do such universal tickets exist but they also do not require further training.
  • Universality of Winning Tickets: A Renormalization Group Perspective. [paper]

    • William T. Redman, Tianlong Chen, Zhangyang Wang, Akshunna S. Dogra. ICML 2022
    • Key Word: Lottery Ticket Hypothesis; Renormalization Group Theory.
    • Digest Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. This has generated broad interest, but methods to study this universality are lacking. We make use of renormalization group theory, a powerful tool from theoretical physics, to address this need. We find that iterative magnitude pruning, the principal algorithm used for discovering winning tickets, is a renormalization group scheme, and can be viewed as inducing a flow in parameter space.
  • How many degrees of freedom do we need to train deep networks: a loss landscape perspective. [paper] [code]

    • Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli. ICLR 2022
    • Key Word: Loss Landscape; Lottery Ticket Hypothesis.
    • Digest A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sublevel set when training within a random subspace of a given training dimensionality.
  • A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness. [paper]

    • James Diffenderfer, Brian R. Bartoldson, Shreya Chaganti, Jize Zhang, Bhavya Kailkhura. NeurIPS 2021
    • Key Word: Lottery Ticket Hypothesis; Out-of-Distribution Generalization.
    • Digest We perform a large-scale analysis of popular model compression techniques which uncovers several intriguing patterns. Notably, in contrast to traditional pruning approaches (e.g., fine tuning and gradual magnitude pruning), we find that "lottery ticket-style" approaches can surprisingly be used to produce CARDs, including binary-weight CARDs. Specifically, we are able to create extremely compact CARDs that, compared to their larger counterparts, have similar test accuracy and matching (or better) robustness -- simply by pruning and (optionally) quantizing.
  • Efficient Lottery Ticket Finding: Less Data is More. [paper] [code]

    • Zhenyu Zhang, Xuxi Chen, Tianlong Chen, Zhangyang Wang. ICML 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest This paper explores a new perspective on finding lottery tickets more efficiently, by doing so only with a specially selected subset of data, called Pruning-Aware Critical set (PrAC set), rather than using the full training set. The concept of PrAC set was inspired by the recent observation, that deep networks have samples that are either hard to memorize during training, or easy to forget during pruning.
  • A Probabilistic Approach to Neural Network Pruning. [paper]

    • Xin Qian, Diego Klabjan. ICML 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We theoretically study the performance of two pruning techniques (random and magnitude-based) on FCNs and CNNs. Given a target network whose weights are independently sampled from appropriate distributions, we provide a universal approach to bound the gap between a pruned and the target network in a probabilistic sense. The results establish that there exist pruned networks with expressive power within any specified bound from the target network.
  • On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning. [paper]

    • Marc Aurel Vischer, Robert Tjarko Lange, Henning Sprekeler. ICLR 2022
    • Key Word: Reinforcement Learning; Lottery Ticket Hypothesis.
    • Digest The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained with behavioural cloning compared to reinforcement learning can be pruned to higher levels of sparsity without performance degradation.
  • Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network. [paper] [code]

    • James Diffenderfer, Bhavya Kailkhura. ICLR 2021
    • Key Word: Lottery Ticket Hypothesis; Binary Neural Networks.
    • Digest This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained.
  • Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. [paper] [code]

    • Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, Mykola Pechenizkiy. ICML 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training.
  • Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. [paper]

    • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste.
    • Key Word: Sparsity; Survey.
    • Digest We survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward.
  • A Unified Paths Perspective for Pruning at Initialization. [paper]

    • Thomas Gebhart, Udit Saxena, Paul Schrater.
    • Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
    • Digest Leveraging recent theoretical approximations provided by the Neural Tangent Kernel, we unify a number of popular approaches for pruning at initialization under a single path-centric framework. We introduce the Path Kernel as the data-independent factor in a decomposition of the Neural Tangent Kernel and show the global structure of the Path Kernel can be computed efficiently. This Path Kernel decomposition separates the architectural effects from the data-dependent effects within the Neural Tangent Kernel, providing a means to predict the convergence dynamics of a network from its architecture alone.

Lottery Ticket Hypothesis: 2020

  • PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well without Training Data. [paper] [code]

    • Shreyas Malakarjun Patil, Constantine Dovrolis. ICLR 2021
    • Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
    • Digest Our work is based on a recently proposed decomposition of the Neural Tangent Kernel (NTK) that has decoupled the dynamics of the training process into a data-dependent component and an architecture-dependent kernel - the latter referred to as Path Kernel. That work has shown how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm. We first show that even though Synflow-L2 is optimal in terms of convergence, for a given network density, it results in sub-networks with "bottleneck" (narrow) layers - leading to poor performance as compared to other data-agnostic methods that use the same number of parameters.
  • A Gradient Flow Framework For Analyzing Network Pruning. [paper] [code]

    • Ekdeep Singh Lubana, Robert P. Dick. ICLR 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters.
  • Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. [paper] [code]

    • Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee. NeurIPS 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call "initial tickets"), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.
  • Pruning Neural Networks at Initialization: Why are We Missing the Mark? [paper]

    • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin. ICLR 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
  • ESPN: Extremely Sparse Pruned Networks. [paper] [code]

    • Minsu Cho, Ameya Joshi, Chinmay Hegde.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Deep neural networks are often highly overparameterized, prohibiting their use in compute-limited systems. However, a line of recent works has shown that the size of deep networks can be considerably reduced by identifying a subset of neuron indicators (or mask) that correspond to significant weights prior to training. We demonstrate that an simple iterative mask discovery method can achieve state-of-the-art compression of very deep networks. Our algorithm represents a hybrid approach between single shot network pruning methods (such as SNIP) with Lottery-Ticket type approaches. We validate our approach on several datasets and outperform several existing pruning approaches in both test accuracy and compression ratio.
  • Logarithmic Pruning is All You Need. [paper]

    • Laurent Orseau, Marcus Hutter, Omar Rivasplata. NeurIPS 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.
  • Exploring Weight Importance and Hessian Bias in Model Pruning. [paper]

    • Mingchen Li, Yahya Sattar, Christos Thrampoulidis, Samet Oymak.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Model pruning is an essential procedure for building compact and computationally-efficient machine learning models. A key feature of a good pruning algorithm is that it accurately quantifies the relative importance of the model weights. While model pruning has a rich history, we still don't have a full grasp of the pruning mechanics even for relatively simple problems involving linear models or shallow neural nets. In this work, we provide a principled exploration of pruning by building on a natural notion of importance.
  • Progressive Skeletonization: Trimming more fat from a network at initialization. [paper] [code]

    • Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania. ICLR 2021
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration.
  • Pruning neural networks without any data by iteratively conserving synaptic flow. [paper] [code]

    • Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design.
  • Finding trainable sparse networks through Neural Tangent Transfer. [paper] [code]

    • Tianlin Liu, Friedemann Zenke. ICML 2020
    • Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
    • Digest We introduce Neural Tangent Transfer, a method that instead finds trainable sparse networks in a label-free manner. Specifically, we find sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space. Finally, we evaluate our label-agnostic approach on several standard classification tasks and show that the resulting sparse networks achieve higher classification performance while converging faster.
  • What is the State of Neural Network Pruning? [paper] [code]

    • Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag. MLSys 2020
    • Key Word: Lottery Ticket Hypothesis; Survey.
    • Digest Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods.
  • Comparing Rewinding and Fine-tuning in Neural Network Pruning. [paper] [code]

    • Alex Renda, Jonathan Frankle, Michael Carbin. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We compare fine-tuning to alternative retraining techniques. Weight rewinding (as proposed by Frankle et al., (2019)), rewinds unpruned weights to their values from earlier in training and retrains them from there using the original training schedule. Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding. Both rewinding techniques outperform fine-tuning, forming the basis of a network-agnostic pruning algorithm that matches the accuracy and compression ratios of several more network-specific state-of-the-art techniques.
  • Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection. [paper] [code]

    • Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu. ICML 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Recent empirical works show that large deep neural networks are often highly redundant and one can find much smaller subnetworks without a significant drop of accuracy. However, most existing methods of network pruning are empirical and heuristic, leaving it open whether good subnetworks provably exist, how to find them efficiently, and if network pruning can be provably better than direct training using gradient descent. We answer these problems positively by proposing a simple greedy selection approach for finding good subnetworks, which starts from an empty network and greedily adds important neurons from the large network.
  • The Early Phase of Neural Network Training. [paper] [code]

    • Jonathan Frankle, David J. Schwab, Ari S. Morcos. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations.
  • Robust Pruning at Initialization. [paper]

    • Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, Yee Whye Teh.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.
  • Picking Winning Tickets Before Training by Preserving Gradient Flow. [paper] [code]

    • Chaoqi Wang, Guodong Zhang, Roger Grosse. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP).
  • Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning. [paper] [code]

    • Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization.

Lottery Ticket Hypothesis: 2019

  • Linear Mode Connectivity and the Lottery Ticket Hypothesis. [paper]

    • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin. ICML 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy.
  • What's Hidden in a Randomly Weighted Neural Network? [paper] [code]

    • Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari. CVPR 2020
    • Key Word: Lottery Ticket Hypothesis; Neural Architecture Search; Weight Agnositic Neural Networks.
    • Digest Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them.
  • Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks. [paper] [code]

    • Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, Yingyan Lin. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.
  • Rigging the Lottery: Making All Tickets Winners. [paper] [code]

    • Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen. ICML 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques.
  • The Difficulty of Training Sparse Neural Networks. [paper]

    • Utku Evci, Fabian Pedregosa, Aidan Gomez, Erich Elsen.
    • Key Word: Pruning.
    • Digest We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of has shown that sparse ResNet-50 architectures trained on ImageNet-2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution.
  • A Signal Propagation Perspective for Pruning Neural Networks at Initialization. [paper] [code]

    • Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, Philip H. S. Torr. ICLR 2020
    • Key Word: Lottery Ticket Hypothesis; Mean Field Theory.
    • Digest In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability.
  • One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. [paper]

    • Ari S. Morcos, Haonan Yu, Michela Paganini, Yuandong Tian. NeurIPS 2019
    • Key Word: Lottery Ticket Hypothesis.
    • Digest Perhaps surprisingly, we found that, within the natural images domain, winning ticket initializations generalized across a variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365, often achieving performance close to that of winning tickets generated on the same dataset.
  • Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. [paper] [code]

    • Hattie Zhou, Janice Lan, Rosanne Liu, Jason Yosinski. NeurIPS 2019
    • Key Word: Lottery Ticket Hypothesis.
    • Digest In this paper, we have studied how three components to LT-style network pruning—mask criterion, treatment of kept weights during retraining (mask-1 action), and treatment of pruned weights during retraining (mask-0 action)—come together to produce sparse and performant subnetworks.
  • The State of Sparsity in Deep Neural Networks. [paper] [code]

    • Trevor Gale, Erich Elsen, Sara Hooker.
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results.

Lottery Ticket Hypothesis: 2018

  • SNIP: Single-shot Network Pruning based on Connection Sensitivity. [paper] [code]

    • Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr. ICLR 2019
    • Key Word: Lottery Ticket Hypothesis.
    • Digest In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.
  • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. [paper] [code]

    • Jonathan Frankle, Michael Carbin ICLR 2019
    • Key Word: Lottery Ticket Hypothesis.
    • Digest We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.

Emergence and Phase Transitions

avatar

  • The Complexity Dynamics of Grokking. [paper]

    • Branton DeMoss, Silvia Sapora, Jakob Foerster, Nick Hawes, Ingmar Posner.
    • Key Word: Grokking; Generalization; Minimum Description Length Principle.
    • Digest This paper explores generalization in neural networks through the lens of compression, focusing on the phenomenon of grokking—where networks transition from memorization to generalization after overfitting. The authors introduce a novel intrinsic complexity measure based on Kolmogorov complexity, revealing a consistent rise-and-fall pattern in complexity during training that corresponds to memorization and generalization phases. Leveraging insights from rate-distortion theory and the minimum description length principle, they propose a new regularization method that penalizes spectral entropy to promote low-rank representations. This approach improves generalization and achieves superior dataset compression compared to baseline methods.
  • Context-Scaling versus Task-Scaling in In-Context Learning. [paper]

    • Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu, Mikhail Belkin.
    • Key Word: In-Context Learning; Task-Scaling; Context-Scaling.
    • Digest This paper analyzes two key aspects of In-Context Learning (ICL) in transformers: (1) context-scaling, where performance improves with more in-context examples, and (2) task-scaling, where performance improves with more pre-training tasks. The authors show that while transformers achieve both, standard MLPs only achieve task-scaling. To understand context-scaling, they propose a simplified transformer without key, query, value weights, which performs comparably to GPT-2 in various tasks. They find that a data-dependent feature map enables context-scaling, and when combined with an MLP, it achieves both context and task-scaling, offering a simpler framework to study ICL.
  • A Hitchhiker's Guide to Scaling Law Estimation. [paper]

    • Leshem Choshen, Yang Zhang, Jacob Andreas.
    • Key Word: Neural scaling Law.
    • Digest The paper investigates how to best estimate and interpret scaling laws that predict machine learning model performance. By analyzing a large dataset of losses and evaluations from 485 pretrained models, the authors estimate over 1000 scaling laws. They propose best practices, such as using intermediate training checkpoints and comparing models of similar sizes, to improve prediction accuracy. While models of different families may show variable scaling behavior, predictions can often be made using scaling estimates from related models. Additionally, training multiple small models can be more effective than training a single large one due to seed variability.

Emergence and Phase Transitions: 2024

  • Grokking at the Edge of Linear Separability. [paper]

    • Alon Beck, Noam Levi, Yohai Bar-Sinai.
    • Key Word: Grokking.
    • Digest This paper investigates the generalization properties of binary logistic classification and the phenomenon of grokking—delayed generalization with non-monotonic test loss. The authors study a random feature model and show that grokking is amplified when the training set is nearly linearly separable. They prove that if the data is linearly separable from the origin, the model overfits due to the implicit bias of logistic loss. However, for non-separable data, the model generalizes perfectly asymptotically, though early overfitting can occur. The study also finds that near the transition to linear separability, the model can overfit for extended periods before generalizing, drawing parallels to critical phenomena in physical systems.
  • Information-Theoretic Foundations for Neural Scaling Laws. [paper]

    • Hong Jun Jeon, Benjamin Van Roy.
    • Key Word: Information Theory; Neural Scaling Laws.
    • Digest This paper develops rigorous information-theoretic foundations for neural scaling laws, which describe how out-of-sample error depends on model and dataset size. The authors find that for data generated by an infinitely wide two-layer neural network, the optimal relationship between data and model size is linear, up to logarithmic factors. These findings support large-scale empirical observations and aim to clarify and guide future research in neural scaling laws.

Emergence and Phase Transitions: 2023

  • More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory. [paper]

    • James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin.
    • Key Word: Overparameterization; Random Feature Regression.
    • Digest This paper highlights the empirical observation that larger model size, more data, and more computation improve performance in deep learning. The paper provides theoretical backing to these observations by showing that these properties hold in random feature regression, equivalent to shallow networks with only the last layer trained. The study demonstrates that the test risk of random feature regression decreases with the number of features and samples, implying that infinite width random feature architectures are preferable. Additionally, it emphasizes the importance of training to near-zero training loss for achieving near-optimal performance, especially for tasks characterized by power law eigenstructure. The findings suggest the benefits of overparameterization, overfitting, and more data in random feature models.
  • Grokking as the Transition from Lazy to Rich Training Dynamics. [paper]

    • Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, Cengiz Pehlevan.
    • Key Word: Grokking; Kernel Dynamics; Feature Learning.
    • Digest The paper explores the phenomenon of "grokking" in neural networks, where the training loss decreases much earlier than the test loss. It suggests that grokking occurs as neural networks transition from lazy training dynamics to rich feature learning. The study uses a simple polynomial regression problem with a two-layer neural network to illustrate this mechanism and identifies key factors contributing to grokking. These factors include the rate of feature learning, alignment of initial features with the target function, dataset size, and the network's initial training regime. The paper argues that this transition from lazy to rich training dynamics can also impact grokking in more general settings, such as MNIST, one-layer Transformers, and student-teacher networks.
  • Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks. [paper]

    • Noa Rubin, Inbar Seroussi, Zohar Ringel.
    • Key Word: Grokking.
    • Digest This paper explores the phenomenon of Grokking in deep neural networks (DNNs) and its connection to feature learning. It applies the adaptive kernel approach to two teacher-student models and provides analytical predictions on feature learning and Grokking properties. The paper suggests that Grokking in DNNs is akin to a phase transition, resulting in distinct internal representations of the teacher after the transition.
  • Explaining grokking through circuit efficiency. [paper]

    • Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar.
    • Key Word: Grokking.
    • Digest This paper addresses the phenomenon of "grokking" in neural networks, where a network initially achieves perfect training accuracy but poor generalization. The authors propose that grokking happens when a task allows both a generalizing solution (slower but more efficient) and a memorizing solution. They suggest that memorization becomes less efficient with larger training datasets, while generalization remains unaffected. This implies a critical dataset size where both approaches become equally efficient. The paper makes and confirms four novel predictions about grokking, including the surprising observations of "ungrokking" (regression from perfect to low test accuracy) and "semi-grokking" (delayed generalization to partial test accuracy).
  • Scaling Laws Do Not Scale. [paper]

    • Fernando Diaz, Michael Madaio.
    • Key Word: Neural Scaling Laws.
    • Digest This papepr challenges the notion of scaling laws in artificial intelligence (AI) models, arguing that as dataset sizes increase, the diverse values and preferences of different communities represented in the data may not align with the metrics used to evaluate model performance.
  • Absorbing Phase Transitions in Artificial Deep Neural Networks. [paper]

    • Keiichi Tamai, Tsuyoshi Okubo, Truong Vinh Truong Duy, Naotake Natori, Synge Todo.
    • Key Word: Phase Transitions; Neural Scaling Laws.
    • Digest This paper presents a framework for understanding the behavior of finite artificial deep neural networks by drawing parallels to universal critical phenomena in absorbing phase transitions. The authors investigate order-to-chaos transitions in fully-connected feedforward and convolutional neural networks, demonstrating that these transitions exist even in finite networks and that the architecture influences the universality class of the transition. Finite-size scaling is also successfully applied, allowing for a semi-quantitative description of signal propagation dynamics.
  • The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets. [paper]

    • Noam Levi, Yaron Oz.
    • Key Word: Neural Scaling Laws; Random Matrix Theory.
    • Digest The paper explores the underlying scaling laws and universal statistical structure of complex datasets, using tools from statistical physics and Random Matrix Theory (RMT). They analyze the feature-feature covariance matrix and observe that the power-law scalings of eigenvalues differ between uncorrelated random data and real-world data. They find that introducing long-range correlations can recover the scaling behavior in synthetic data, and both synthetic and real-world datasets belong to the same universality class as chaotic systems rather than integrable systems. The expected RMT statistical behavior is evident in empirical covariance matrices at smaller dataset sizes than traditionally used for training, and it can be related to the number of samples needed to approximate the population power-law scaling behavior.
  • Hidden symmetries of ReLU networks. [paper]

    • J. Elisenda Grigsby, Kathryn Lindsey, David Rolnick. ICML 2023
    • Key Word: Permutation Symmetries.
    • Digest The paper explores the representation of feedforward ReLU neural networks using their parameter space during training, investigating the existence of hidden symmetries that result in different parameter settings producing the same function. The authors prove that networks without narrow layers have parameter settings without hidden symmetries. They also identify mechanisms that lead to hidden symmetries and conduct experiments indicating that the probability of networks having no hidden symmetries decreases as depth increases, but increases as width and input dimension increase.
  • Are Emergent Abilities of Large Language Models a Mirage? [paper]

    • Rylan Schaeffer, Brando Miranda, Sanmi Koyejo.
    • Key Word: Large Language Models; Neural Scaling Laws; Emergent Abilities.
    • Digest Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, one can choose a metric which leads to the inference of an emergent ability or another metric which does not. We find strong supporting evidence that emergent abilities may not be a fundamental property of scaling AI models.
  • Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. [paper]

    • Fadhel Ayed, Soufiane Hayou.
    • Key Word: Data Pruning; Neural Scaling Laws.
    • Digest In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
  • Progress measures for grokking via mechanistic interpretability. [paper]

    • Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, Jacob Steinhardt.
    • Key Work: Grokking; Interpretability.
    • Digest We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks.
  • Grokking modular arithmetic. [paper]

    • Andrey Gromov.
    • Key Word: Grokking; Modular Addition; Interpretability.
    • Digest We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''. Concretely, we present (i) fully-connected two-layer networks that exhibit grokking on various modular arithmetic tasks under vanilla gradient descent with the MSE loss function in the absence of any regularization.

Emergence and Phase Transitions: 2022

  • Feature learning in neural networks and kernel machines that recursively learn features. [paper] [code]

    • Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin.
    • Key Word: Feature Learning; Kernel Machines; Grokking; Lottery Ticket Hypothesis.
    • Digest We isolate the key mechanism driving feature learning in fully connected neural networks by connecting neural feature learning to the average gradient outer product. We subsequently leverage this mechanism to design Recursive Feature Machines (RFMs), which are kernel machines that learn features. We show that RFMs (1) accurately capture features learned by deep fully connected neural networks, (2) close the gap between kernel machines and fully connected networks, and (3) surpass a broad spectrum of models including neural networks on tabular data.
  • Grokking phase transitions in learning local rules with gradient descent. [paper]

    • Bojan Žunkovič, Enej Ilievski.
    • Key Word: Tensor Network; Grokking; Many-Body Quantum Mechanics; Neural Collapse.
    • Digest We discuss two solvable grokking (generalisation beyond overfitting) models in a rule learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular automata learning task, numerically determine the critical exponent and the grokking time distributions and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse the connection between structure formation and grokking.
  • Broken Neural Scaling Laws. [paper] [code]

    • Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger.
    • Key Word: Neural Scaling Laws.
    • Digest We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, diffusion generative modeling of images, arithmetic, and reinforcement learning.
  • How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. [paper] [code]

    • Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, Andrew Gordon Wilson.
    • Key Word: Data Augmentation; Neural Scaling Laws; Implicit Regularization.
    • Digest Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data.
  • Omnigrok: Grokking Beyond Algorithmic Data. [paper]

    • Ziming Liu, Eric J. Michaud, Max Tegmark.
    • Key Word: Grokking Dynamics.
    • Digest Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the "LU mechanism" because training and test losses (against model weight norm) typically resemble "L" and "U", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc.
  • Revisiting Neural Scaling Laws in Language and Vision. [paper]

    • Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai.
    • Key Word: Neural Scaling Laws; Multi-modal Learning.
    • Digest The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark.
  • On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence. [paper]

    • Yi Ma, Doris Tsao, Heung-Yeung Shum.
    • Key Word: Intelligence; Parsimony; Self-Consistency; Rate Reduction.
    • Digest Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that we believe to be cornerstones for the emergence of Intelligence, artificial or natural. While these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways.
  • Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm. [paper]

    • Lechao Xiao, Jeffrey Pennington. ICML 2022
    • Key Word: Synergy; Symmetry; Implicit Bias; Neural Tangent Kernel; Neural Scaling Laws.
    • Digest Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previous efforts have investigated this question by studying the data (D), model (M), and inference algorithm (I) as independent modules, in this paper, we analyze the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality.
  • How Much More Data Do I Need? Estimating Requirements for Downstream Tasks. [paper]

    • Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law. CVPR 2022
    • Key Word: Neural Scaling Laws; Active Learning.
    • Digest Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements.
  • Beyond neural scaling laws: beating power law scaling via data pruning. [paper]

    • Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos.
    • Key Word: Dataset Pruning; Ensemble Active Learning.
    • Digest Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet.
  • Exact Phase Transitions in Deep Learning. [paper]

    • Liu Ziyin, Masahito Ueda.
    • Key Word: Phase Transitions; Symmetry Breaking; Mean-Field Analysis; Statistical Physics.
    • Digest The paper presents a theory that demonstrates the existence of first-order and second-order phase transitions in deep learning, similar to those observed in statistical physics, by analyzing the interplay between prediction error and model complexity in the training loss. The findings have implications for neural network optimization and shed light on the origin of the posterior collapse problem in Bayesian deep learning.
  • Towards Understanding Grokking: An Effective Theory of Representation Learning. [paper]

    • Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams.
    • Key Word: Grokking; Physics of Learning; Deep Double Descent.
    • Digest We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion.
  • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. [paper] [code]

    • Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra.
    • Key Word: Grokking; Overfitting.
    • Digest In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting.

Emergence and Phase Transitions: 2021

  • Learning Curve Theory. [paper]

    • Marcus Hutter.
    • Key Word: Neural Scaling Law; Learning Curve Theory.
    • Digest Recently a number of empirical "universal" scaling law papers have been published, most notably by OpenAI. `Scaling laws' refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/or more compute. In this work we focus on scaling w.r.t. data size n. Theoretical understanding of this phenomenon is largely lacking, except in finite-dimensional models for which error typically decreases with n−1/2 or n−1, where n is the sample size. We develop and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β>0, and determine whether power laws are universal or depend on the data distribution.
  • Explaining Neural Scaling Laws. [paper] [code]

    • Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma. ICLR 2022
    • Key Word: Scaling Laws; Neural Tangent Kernel.
    • Digest We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold.

Emergence and Phase Transitions: 2020

  • A Neural Scaling Law from the Dimension of the Data Manifold. [paper]
    • Utkarsh Sharma, Jared Kaplan.
    • Key Word: Neural Scaling Law; Manifold Intrinsic Dimension; Fractal Dimension.
    • Digest This paper investigates neural network performance with abundant data, finding that well-trained networks exhibit power-law loss scaling (L∝N−α) based on network parameters (N). The phenomenon applies broadly across diverse data types and scales. The study proposes that this behavior stems from neural models effectively conducting regression on an intrinsic dimension (d) data manifold. The theory predicts α≈4/d scaling exponents for cross-entropy and mean-squared error losses. Empirical validation occurs via independent measurements of intrinsic dimension and scaling exponents in a teacher/student framework, including various d and α values through random teacher network adjustments. CNN classifiers and GPT-style language models further test the theory's applicability across datasets.

Interactions with Neuroscience

avatar

Interactions with Neuroscience: 2023

  • How deep is the brain? The shallow brain hypothesis. [paper]

    • Mototaka Suzuki; Cyriel M. A. Pennartz; Jaan Aru.
    • Key Word: Shallow Brain Hypothesis.
    • Digest This paper critiques the common assumption in deep learning and predictive coding that neural network inference is hierarchical, pointing out the overlooked neurobiological evidence of direct interactions between all cortical areas and subcortical regions. It challenges the prevalent cortico-centric, hierarchical models in current neural networks, suggesting they miss key computational principles used by the brain. Introducing the "shallow brain hypothesis," the authors propose that hierarchical cortical processing works in tandem with a parallel process significantly involving subcortical areas. They argue this integrated architecture, which leverages the computational abilities of cortical microcircuits and thalamo-cortical loops absent in conventional models, offers crucial advantages for achieving the rapid and flexible computational capabilities seen in mammalian brains.
  • Finding Neurons in a Haystack: Case Studies with Sparse Probing. [paper] [code]

    • Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas.
    • Key Word: Probing; Mechanistic Interpretability; Superposition; Sparse Coding.
    • Digest We seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train k-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of k we study the sparsity of learned representations and how this varies with model scale. With k=1, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs.

Interactions with Neuroscience: 2022

  • Multilevel development of cognitive abilities in an artificial neural network. [paper]

    • Konstantin Volzhenin, Jean-Pierre Changeux, Guillaume Dumas. PNAS
    • Key Word: Global Neuronal Workspace.
    • Digest We introduce a three-level computational model of information processing and acquisition of cognitive abilities. We propose minimal architectural requirements to build these levels, and how the parameters affect their performance and relationships. The first sensorimotor level handles local nonconscious processing, here during a visual classification task. The second level or cognitive level globally integrates the information from multiple local processors via long-ranged connections and synthesizes it in a global, but still nonconscious, manner. The third and cognitively highest level handles the information globally and consciously. It is based on the global neuronal workspace (GNW) theory and is referred to as the conscious level.
  • Deep Problems with Neural Network Models of Human Vision. [paper]

    • Jeffrey S BowersGaurav MalhotraMarin DujmovićMilton Llera MonteroChristian TsvetkovValerio BiscioneGuillermo PueblaFederico G AdolfiJohn HummelRachel Flood HeatonBenjamin EvansJeff MitchellRyan Blything.
    • Key Word: Brain-Score; Computational Neuroscience; Convolutional Neural Networks; Representational Similarity Analysis.
    • Digest We show that the good prediction on these datasets may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition.
  • Reassessing hierarchical correspondences between brain and deep networks through direct interface. [paper]

    • Nicholas J Sexton, Bradley C Love. Science Advances
    • Key Word: Neural Interfacing Analysis; Shared Neural Variance.
    • Digest Functional correspondences between deep convolutional neural networks (DCNNs) and the mammalian visual system support a hierarchical account in which successive stages of processing contain ever higher-level information. However, these correspondences between brain and model activity involve shared, not task-relevant, variance. We propose a stricter account of correspondence: If a DCNN layer corresponds to a brain region, then replacing model activity with brain activity should successfully drive the DCNN’s object recognition decision. Using this approach on three datasets, we found that all regions along the ventral visual stream best corresponded with later model layers, indicating that all stages of processing contained higher-level information about object category.
  • Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream. [paper]

    • Franziska Geiger, Martin Schrimpf, Tiago Marques, James J. DiCarlo. ICLR 2022
    • Key Word: Computational Neuroscience; Primate Visual Ventral Stream.
    • Digest We develop biologically-motivated initialization and training procedures to train models with 200x fewer synaptic updates (epochs x labeled images x weights) while maintaining 80% of brain predictivity on a set of neural and behavioral benchmarks.
  • Curriculum learning as a tool to uncover learning principles in the brain. [paper]

    • Daniel R. Kepple, Rainer Engelken, Kanaka Rajan. ICLR 2022
    • Key Word: Curriculum Learning; Neuroscience.
    • Digest We present a novel approach to use curricula to identify principles by which a system learns. Previous work in curriculum learning has focused on how curricula can be designed to improve learning of a model on particular tasks. We consider the inverse problem: what can a curriculum tell us about how a learning system acquired a task? Using recurrent neural networks (RNNs) and models of common experimental neuroscience tasks, we demonstrate that curricula can be used to differentiate learning principles using target-based and a representation-based loss functions as use cases.
  • Building Transformers from Neurons and Astrocytes. [paper]

    • Leo Kozachkov, Ksenia V. Kastanenka, Dmitry Krotov.
    • Key Word: Transformers; Glia; Astrocytes.
    • Digest In this work we hypothesize that neuron-astrocyte networks can naturally implement the core computation performed by the Transformer block in AI. The omnipresence of astrocytes in almost any brain area may explain the success of Transformers across a diverse set of information domains and computational tasks.
  • High-performing neural network models of visual cortex benefit from high latent dimensionality. [paper]

    • Eric Elmoznino, Michael F. Bonner.
    • Key Word: Dimensionality and Alignment in Computational Brain Models.
    • Digest The prevailing view holds that optimal DNNs compress their representations onto low-dimensional manifolds to achieve invariance and robustness, which suggests that better models of visual cortex should have low-dimensional geometries. Surprisingly, we found a strong trend in the opposite direction—neural networks with high-dimensional image manifolds tend to have better generalization performance when predicting cortical responses to held-out stimuli in both monkey electrophysiology and human fMRI data.
  • Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy. [paper] [code]

    • Siavash Golkar, Tiberiu Tesileanu, Yanis Bahroun, Anirvan M. Sengupta, Dmitri B. Chklovskii. NeurIPS 2022
    • Key Word: Predictive Coding Theory.
    • Digest The paper presents a modified version of the Predictive Coding (PC) framework, called Constrained Predictive Coding, which addresses unresolved issues and controversies in mapping PC onto the cortical hierarchy. The authors introduce a disentangling-inspired constraint on hidden-layer neural activities, derive an upper bound for the PC objective, and optimize it to develop a biologically plausible network that performs as well as the original PC objective.
  • Painful intelligence: What AI can tell us about human suffering. [paper]

    • Aapo Hyvärinen.
    • Key Word: Neuroscience.
    • Digest This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected.
  • The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. [paper] [code]

    • Lukas S. Huber, Robert Geirhos, Felix A. Wichmann.
    • Key Word: Object Recognition; Out-of-Distribution Generalization; Children.
    • Digest We find, first, that already 4–6 year-olds showed remarkable robustness to image distortions and outperform DNNs trained on ImageNet. Second, we estimated the number of “images” children have been exposed to during their lifetime. Compared to various DNNs, children's high robustness requires relatively little data. Third, when recognizing objects children—like adults but unlike DNNs—rely heavily on shape but not on texture cues. Together our results suggest that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely the result of a mere accumulation of experience with distorted visual input.
  • Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks. [paper] [code]

    • Anne Harrington, Arturo Deza. ICLR 2022
    • Key Word: Adversarial Robustness; Peripheral Computation; Psychophysics.
    • Digest To understand how adversarially robust optimizations/representations compare to human vision, we performed a psychophysics experiment using a metamer task where we evaluated how well human observers could distinguish between images synthesized to match adversarially robust representations compared to non-robust representations and a texture synthesis model of peripheral vision. We found that the discriminability of robust representation and texture model images decreased to near chance performance as stimuli were presented farther in the periphery.

Interactions with Neuroscience: 2021

  • Relating transformers to models and neural representations of the hippocampal formation. [paper]

    • James C.R. Whittington, Joseph Warren, Timothy E.J. Behrens. ICLR 2022
    • Key Word: Transformers; Hippocampus; Cortex.
    • Digest We show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience.
  • Partial success in closing the gap between human and machine vision. [paper] [code]

    • Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel. NeurIPS 2021
    • Key Word: Out-of-Distribution Generalization; Psychophysical Experiments.
    • Digest A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants.
  • Does enhanced shape bias improve neural network robustness to common corruptions? [paper]

    • Chaithanya Kumar Mummadi, Ranjitha Subramaniam, Robin Hutmacher, Julien Vitay, Volker Fischer, Jan Hendrik Metzen. ICLR 2021
    • Key Word: Shape-Texture; Robustness.
    • Digest We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct.

Interactions with Neuroscience: 2020

  • Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations. [paper]

    • Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David Cox, James J. DiCarlo. NeurIPS 2020
    • Key Word: Robustness; V1 Model.
    • Digest Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system. However, these CNNs can be fooled by imperceptibly small, explicitly crafted perturbations, and struggle to recognize objects in corrupted images that are easily recognized by humans. Here, by making comparisons with primate neural data, we first observed that CNN models with a neural hidden layer that better matches primate primary visual cortex (V1) are also more robust to adversarial attacks. Inspired by this observation, we developed VOneNets, a new class of hybrid CNN vision models. Each VOneNet contains a fixed weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models.
  • On 1/n neural representation and robustness. [paper] [code]

    • Josue Nassar, Piotr Aleksander Sokol, SueYeon Chung, Kenneth D. Harris, Il Memming Park. NeurIPS 2020
    • Key Word: Adversarial Robustness; 1/n Power Law.
    • Digest We investigate the latter by juxtaposing experimental results regarding the covariance spectrum of neural representations in the mouse V1 (Stringer et al) with artificial neural networks. We use adversarial robustness to probe Stringer et al's theory regarding the causal role of a 1/n covariance spectrum. We empirically investigate the benefits such a neural code confers in neural networks, and illuminate its role in multi-layer architectures. Our results show that imposing the experimentally observed structure on artificial neural networks makes them more robust to adversarial attacks. Moreover, our findings complement the existing theory relating wide neural networks to kernel methods, by showing the role of intermediate representations.
  • Shape-Texture Debiased Neural Network Training. [paper] [code]

    • Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie. ICLR 2021
    • Key Word: Shape-Texture; Robustness.
    • Digest Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (eg, an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously.
  • Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. [paper] [code]

    • Robert Geirhos, Kristof Meding, Felix A. Wichmann.
    • Key Word: Error Consistency.
    • Digest Here we introduce trial-by-trial error consistency, a quantitative analysis for measuring whether two decision making systems systematically make errors on the same inputs. Making consistent errors on a trial-by-trial basis is a necessary condition if we want to ascertain similar processing strategies between decision makers.
  • Biologically Inspired Mechanisms for Adversarial Robustness. [paper]

    • Manish V. Reddy, Andrzej Banburski, Nishka Pant, Tomaso Poggio. NeurIPS 2020
    • Key Word: Robustness; Retinal Fixations.
    • Digest A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the non-uniform sampling performed by the primate retina and the presence of multiple receptive fields with a range of receptive field sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations
  • Five Points to Check when Comparing Visual Perception in Humans and Machines. [paper] [code]

    • Christina M. Funke, Judy Borowski, Karolina Stosio, Wieland Brendel, Thomas S. A. Wallis, Matthias Bethge. JOV
    • Key Word: Model Comparison.
    • Digest With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed towards comparing information processing in humans and machines. These studies are an exciting chance to learn about one system by studying the other. Here, we propose ideas on how to design, conduct and interpret experiments such that they adequately support the investigation of mechanisms when comparing human and machine perception. We demonstrate and apply these ideas through three case studies.
  • Shortcut Learning in Deep Neural Networks. [paper] [code]

    • Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, Felix A. Wichmann. Nature Machine Intelligence
    • Key Word: Out-of-Distribution Generalization; Survey.
    • Digest Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distil how many of deep learning's problem can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.

Interactions with Neuroscience: 2019

  • A deep learning framework for neuroscience. [paper]

    • Author List Blake A. Richards, Timothy P. Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli, Colleen J. Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W. Lindsay, Kenneth D. Miller, Richard Naud, Christopher C. Pack, Panayiota Poirazi, Pieter Roelfsema, João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C. Schapiro, Walter Senn, Greg Wayne, Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien & Konrad P. Kording. *Nature Neuroscience*
    • Key Word: Deep Learning; Neuroscience.
    • Digest The article discusses the similarities and differences between systems neuroscience and artificial intelligence. It argues that the three components in artificial neural networks - objective functions, learning rules, and architectures - are crucial for modeling and optimizing complex artificial learning systems. The authors suggest that a greater focus on these components could benefit systems neuroscience and drive theoretical and experimental progress.
  • White Noise Analysis of Neural Networks. [paper] [code]

    • Ali Borji, Sikun Lin. ICLR 2020
    • Key Word: Spike-Triggered Analysis.
    • Digest A white noise analysis of modern deep neural networks is presented to unveil their biases at the whole network level or the single neuron level. Our analysis is based on two popular and related methods in psychophysics and neurophysiology namely classification images and spike triggered analysis.
  • The Origins and Prevalence of Texture Bias in Convolutional Neural Networks. [paper]

    • Katherine L. Hermann, Ting Chen, Simon Kornblith. NeurIPS 2020
    • Key Word: Shape-Texture; Robustness.
    • Digest Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, CNNs learn to classify by shape at least as easily as by texture. What factors, then, produce the texture bias in CNNs trained on ImageNet? Different unsupervised training objectives and different architectures have small but significant and largely independent effects on the level of texture bias. However, all objectives and architectures still lead to models that make texture-based classification decisions a majority of the time, even if shape information is decodable from their hidden representations. The effect of data augmentation is much larger.
  • Learning From Brains How to Regularize Machines. [paper]

    • Zhe Li, Wieland Brendel, Edgar Y. Walker, Erick Cobos, Taliah Muhammad, Jacob Reimer, Matthias Bethge, Fabian H. Sinz, Xaq Pitkow, Andreas S. Tolias. NeurIPS 2019
    • Key Word: Neural Representation Similarity.
    • Digest Despite impressive performance on numerous visual tasks, Convolutional Neural Networks (CNNs) --- unlike brains --- are often highly sensitive to small perturbations of their input, e.g. adversarial noise leading to erroneous decisions. We propose to regularize CNNs using large-scale neuroscience data to learn more robust neural features in terms of representational similarity. We presented natural images to mice and measured the responses of thousands of neurons from cortical visual areas.
  • A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs. [paper] [code]

    • Jack Lindsey, Samuel A. Ocko, Surya Ganguli, Stephane Deny. ICLR 2019
    • Key Word: Visual System; Convolutional Neural Networks; Efficient Coding; Retina.
    • Digest There is currently no unified theory explaining these differences in representations across layers. Here, using a deep convolutional neural network trained on image recognition as a model of the visual system, we show that such differences in representation can emerge as a direct consequence of different neural resource constraints on the retinal and cortical networks, and we find a single model from which both geometries spontaneously emerge at the appropriate stages of visual processing. The key constraint is a reduced number of neurons at the retinal output, consistent with the anatomy of the optic nerve as a stringent bottleneck.

Interactions with Neuroscience: 2018

  • ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. [paper] [code]

    • Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel. ICLR 2019
    • Key Word: Shape-Texture; Psychophysical Experiments.
    • Digest Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies.
  • Generalisation in humans and deep neural networks. [paper] [code]

    • Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, Felix A. Wichmann. NeurIPS 2018
    • Key Word: Robustness.
    • Digest We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types.

Interactions with Neuroscience: 2017

  • Comparing deep neural networks against humans: object recognition when the signal gets weaker. [paper] [code]
    • Robert Geirhos, David H. J. Janssen, Heiko H. Schütt, Jonas Rauber, Matthias Bethge, Felix A. Wichmann. NeurIPS 2018
    • Key Word: Model Comparison; Robustness.
    • Digest Human visual object recognition is typically rapid and seemingly effortless, as well as largely independent of viewpoint and object orientation. Until very recently, animate visual systems were the only ones capable of this remarkable computational feat. This has changed with the rise of a class of computer vision algorithms called deep neural networks (DNNs) that achieve human-level classification performance on object recognition tasks. Furthermore, a growing number of studies report similarities in the way DNNs and the human visual system process objects, suggesting that current DNNs may be good models of human visual object recognition. Yet there clearly exist important architectural and processing differences between state-of-the-art DNNs and the primate visual system. The potential behavioural consequences of these differences are not well understood. We aim to address this issue by comparing human and DNN generalisation abilities towards image degradations.

Information Bottleneck

avatar

Information Bottleneck: 2023

  • How Does Information Bottleneck Help Deep Learning? [paper]

    • Kenji Kawaguchi, Zhun Deng, Xu Ji, Jiaoyang Huang.
    • Key Word: Information Bottleneck; Generalization Bound.
    • Digest This paper shows that information bottleneck, the idea of minimizing unnecessary information while maximizing task-relevant information, can help to reduce generalization errors in deep learning. The paper proves a new learning theory that connects information bottleneck to generalization errors and validates it with experiments. The paper also compares the new bounds with existing bounds based on other complexity measures.
  • To Compress or Not to Compress -- Self-Supervised Learning and Information Theory: A Review. [paper]

    • Ravid Shwartz-Ziv, Yann LeCun.
    • Key Word: Self-Supervised Learning; Information Theory; Survey.
    • Digest We review various approaches to self-supervised learning from an information-theoretic standpoint and present a unified framework that formalizes the self-supervised information-theoretic learning problem. We integrate existing research into a coherent framework, examine recent self-supervised methods, and identify research opportunities and challenges.

Information Bottleneck: 2022

  • Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. [paper]
    • Anirban Samaddar, Sandeep Madireddy, Prasanna Balaprakash
    • Key Word: Information Bottleneck; Robustness.
    • Digest We present a novel sparsity-inducing spike-slab prior that uses sparsity as a mechanism to provide flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism to learn a joint distribution of the latent variable and the sparsity. Thus, unlike other approaches, it can account for the full uncertainty in the latent space.

Information Bottleneck: 2021

  • Information Bottleneck Disentanglement for Identity Swapping. [paper]

    • Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, Ran He. CVPR 2021
    • Key Word: Information Bottleneck; Identity Swapping.
    • Digest We propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck trade-off, in terms of finding an optimal compression of the pre-trained latent features.
  • PAC-Bayes Information Bottleneck. [paper] [code]

    • Zifeng Wang, Shao-Lun Huang, Ercan E. Kuruoglu, Jimeng Sun, Xi Chen, Yefeng Zheng. ICLR 2022
    • Key Word: Information Bottleneck; PAC-Bayes.
    • Digest There have been a series of theoretical works trying to derive non-vacuous bounds for NNs. Recently, the compression of information stored in weights (IIW) is proved to play a key role in NNs generalization based on the PAC-Bayes theorem. However, no solution of IIW has ever been provided, which builds a barrier for further investigation of the IIW's property and its potential in practical deep learning. In this paper, we propose an algorithm for the efficient approximation of IIW. Then, we build an IIW-based information bottleneck on the trade-off between accuracy and information complexity of NNs, namely PIB.
  • Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. [paper] [code]

    • Stephan Sloth Lorenzen, Christian Igel, Mads Nielsen. ICLR 2022
    • Key Word: Information Bottleneck; Quantization.
    • Digest We study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors.
  • Compressive Visual Representations. [paper] [code]

    • Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, Ian Fischer. NeurIPS
    • Key Word: Self-Supervision; Contrastive Learning; Conditional Entropy Bottleneck; Out-of-Distribution Generalization.
    • Digest We hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn.
  • Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization. [paper] [code]

    • Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Yoshua Bengio, Ioannis Mitliagkas, Irina Rish. NeurIPS 2021
    • Key Word: Information Bottleneck; Out-of-Distribution Generalization; Invarianct Risk Minimization.
    • Digest We revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address the key failures when invariant features capture all the information about the label and also retains the existing success when they do not.
  • Perturbation Theory for the Information Bottleneck. [paper]

    • Vudtiwat Ngampruetikorn, David J. Schwab. NeurIPS 2021
    • Key Word: Information Bottleneck; Perturbation Theory.
    • Digest Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory for the IB method and report the first complete characterization of the learning onset, the limit of maximum relevant information per bit extracted from data.
  • A Critical Review of Information Bottleneck Theory and its Applications to Deep Learning. [paper]

    • Mohammad Ali Alomrani.
    • Key Word: Information Bottleneck; Survey.
    • Digest A known information-theoretic method called the information bottleneck theory has emerged as a promising approach to better understand the learning dynamics of neural networks. In principle, IB theory models learning as a trade-off between the compression of the data and the retainment of information. The goal of this survey is to provide a comprehensive review of IB theory covering it's information theoretic roots and the recently proposed applications to understand deep learning models.

Information Bottleneck: 2020

  • Graph Information Bottleneck. [paper] [code]

    • Tailin Wu, Hongyu Ren, Pan Li, Jure Leskovec. NeurIPS 2020
    • Key Word: Information Bottleneck; Graph Neural Networks.
    • Digest We introduce Graph Information Bottleneck (GIB), an information-theoretic principle that optimally balances expressiveness and robustness of the learned representation of graph-structured data. Inheriting from the general Information Bottleneck (IB), GIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target, and simultaneously constraining the mutual information between the representation and the input data.
  • Learning Optimal Representations with the Decodable Information Bottleneck. [paper] [code]

    • Yann Dubois, Douwe Kiela, David J. Schwab, Ramakrishna Vedantam. NeurIPS 2020
    • Key Word: Information Bottleneck.
    • Digest We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.
  • Concept Bottleneck Models. [paper] [code]

    • Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang. ICML 2020
    • Key Word: Information Bottleneck
    • Digest We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction.
  • On Information Plane Analyses of Neural Network Classifiers -- A Review. [paper]

    • Bernhard C. Geiger. TNNLS
    • Key Word: Information Bottleneck; Survey.
    • Digest We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the respective information quantities were estimated.
  • On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. [paper]

    • Abdellatif Zaidi, Inaki Estella Aguerri, Shlomo Shamai. Entropy
    • Key Word: Information Bottleneck; Survey.
    • Digest This tutorial paper focuses on the variants of the bottleneck problem taking an information theoretic perspective and discusses practical methods to solve it, as well as its connection to coding and learning aspects. The intimate connections of this setting to remote source-coding under logarithmic loss distortion measure, information combining, common reconstruction, the Wyner-Ahlswede-Korner problem, the efficiency of investment information, as well as, generalization, variational inference, representation learning, autoencoders, and others are highlighted.
  • Phase Transitions for the Information Bottleneck in Representation Learning. [paper]

    • Tailin Wu, Ian Fischer. ICLR 2020
    • Key Word: Information Bottleneck.
    • Digest Our work provides the first theoretical formula to address IB phase transitions in the most general setting. In addition, we present an algorithm for iteratively finding the IB phase transition points.
  • Restricting the Flow: Information Bottlenecks for Attribution. [paper] [code]

    • Karl Schulz, Leon Sixt, Federico Tombari, Tim Landgraf. ICLR 2020
    • Key Word: Information Bottleneck; Attribution.
    • Digest We adapt the information bottleneck concept for attribution. By adding noise to intermediate feature maps we restrict the flow of information and can quantify (in bits) how much information image regions provide.

Information Bottleneck: 2019

  • Learnability for the Information Bottleneck. [paper]
    • Tailin Wu, Ian Fischer, Isaac L. Chuang, Max Tegmark. UAI 2019
    • Key Word: Information Bottleneck.
    • Digest We presented theoretical results for predicting the onset of learning, and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset, and showed that those predictions are accurate, even in cases of extreme label noise.

Information Bottleneck: 2018

  • On the Information Bottleneck Theory of Deep Learning. [paper] [code]
    • Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, David Daniel Cox. ICLR 2018
    • Key Word: Information Bottleneck.
    • Digest This submission explores [recent theoretical work](https://arxiv.org/abs/1703.00810) by Shwartz-Ziv and Tishby on explaining the generalization ability of deep networks. The paper gives counter-examples that suggest aspects of the theory might not be relevant for all neural networks.

Information Bottleneck: 2017

  • Emergence of Invariance and Disentanglement in Deep Representations. [paper]

    • Alessandro Achille, Stefano Soatto. JMLR
    • Key Word: PAC-Bayes; Information Bottleneck.
    • Digest Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights.
  • Information-theoretic analysis of generalization capability of learning algorithms. [paper]

    • Aolin Xu, Maxim Raginsky. NeurIPS 2017
    • Key Word: Information Bottleneck.
    • Digest We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The bounds provide an information-theoretic understanding of generalization in learning problems, and give theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information. We propose a number of methods for this purpose, among which are algorithms that regularize the ERM algorithm with relative entropy or with random noise.
  • Opening the Black Box of Deep Neural Networks via Information. [paper]

    • Ravid Shwartz-Ziv, Naftali Tishby.
    • Key Word: Information Bottleneck.
    • Digest [Previous work](https://arxiv.org/abs/1503.02406) proposed to analyze DNNs in the *Information Plane*; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs.

Neural Tangent Kernel

avatar

Neural Tangent Kernel: 2024

  • The lazy (NTK) and rich (μP) regimes: a gentle tutorial. [paper]

    • Dhruva Karkada.
    • Key Work: Neural Tangent Kernel; Feature Learning; Tensor Programs; Tutorial.
    • Digest The paper discusses the training of wide neural networks and the role of hyperparameters in controlling the richness of training behavior. It highlights that there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights to effectively train wide networks. This degree of freedom determines the richness of training behavior, ranging from lazy training similar to a kernel machine to feature learning in the μP regime. The paper synthesizes recent research results, offers new perspectives and intuitions, and provides empirical evidence supporting these claims. The authors hope that further study of the richness scale will contribute to the development of a scientific theory of feature learning in practical deep neural networks.
  • LoRA Training in the NTK Regime has No Spurious Local Minima. [paper]

    • Uijeong Jang, Jason D. Lee, Ernest K. Ryu.
    • Key Word: Low-Rank Adaptation; Neural Tangent Kernel Regime.
    • Digest The paper provides a theoretical analysis of Low-rank adaptation (LoRA), a method for efficiently fine-tuning large language models (LLMs), in the context of the neural tangent kernel (NTK) regime with \(N\) data points. It reveals three key findings: (i) full fine-tuning without LoRA naturally leads to a low-rank solution with rank approximately \(\sqrt{N}\); (ii) employing LoRA with a rank slightly larger than \(\sqrt{N}\) helps avoid spurious local minima, enabling gradient descent to effectively identify these low-rank solutions; (iii) the low-rank solutions obtained through LoRA exhibit good generalization. This work deepens our understanding of why LoRA is effective for parameter-efficient fine-tuning of LLMs.
  • Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks. [paper]

    • Daniel Beaglehole, Ioannis Mitliagkas, Atish Agarwala.
    • Key Word: Neural Tangent Kernel; Neural Feature Ansatz.
    • Digest This paper explores a key unsolved issue in supervised learning: how neural networks process and learn from the statistical relationships between inputs and labels. It delves into the previously observed phenomenon where the gram matrices of neural network weights align with the model's average gradient outer product, a concept known as the Neural Feature Ansatz (NFA). The study clarifies why these elements correlate during training by linking the NFA to the alignment between the weight matrices' left singular structure and key components of the empirical neural tangent kernels. It introduces the idea of a centered NFA that underscores this alignment and demonstrates that the development speed of the NFA can be analytically predicted early in training based on input and label statistics. Additionally, the paper presents a novel method to enhance NFA correlation in neural networks, significantly improving the learned features' quality.

Neural Tangent Kernel: 2023

  • A Spectral Condition for Feature Learning. [paper]

    • Greg Yang, James B. Simon, Jeremy Bernstein.
    • Key Word: Feature Learning; Spectral Normalization; Parametrizations for Wide Neural Networks
    • Digest This abstract discusses the motivation to train larger neural networks and the importance of feature learning. It highlights that scaling the spectral norm of weight matrices and their updates is key to achieving feature learning. This approach differs from common heuristics based on Frobenius norm and entry size. The abstract also introduces the concept of "maximal update parametrization" and aims to provide a solid conceptual understanding of feature learning in neural networks.
  • On the Neural Tangent Kernel of Equilibrium Models. [paper]

    • Zhili Feng, J.Zico Kolter.
    • Key Word: Neural Tangent Kernel; Deep Equilibrium Models.
    • Digest This paper examines the neural tangent kernel (NTK) of the deep equilibrium (DEQ) model, an architecture that computes the infinite-depth limit of a weight-tied network through root-finding. It demonstrates that, unlike fully-connected neural networks, the NTK of the DEQ model remains deterministic even when both width and depth tend to infinity simultaneously, and it can be efficiently computed using root-finding.
  • On the Disconnect Between Theory and Practice of Overparametrized Neural Networks. [paper]

    • Jonathan Wenger, Felix Dangel, Agustinus Kristiadi.
    • Key Word: Overparameterized Neural Networks; Neural Tangent Kernel.
    • Digest This paper explores the theoretical concept of the infinite-width limit of neural networks, which relates them to kernel methods. While previous research suggested potential advantages in optimization, uncertainty quantification, and continual learning, the paper's empirical findings indicate that these benefits do not apply to practical, large-width neural network architectures. This disconnect between theory and practice questions the practical significance of the infinite-width limit.
  • Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs. [paper]

    • Rajat Vadiraj Dwaraknath, Tolga Ergen, Mert Pilanci.
    • Key Word: Neural Tangent Kernel; Convex Optimization.
    • Digest This paper explores theoretical aspects of deep neural networks in two main directions: 1) Understanding neural network training with SGD under specific conditions like infinite hidden-layer width and infinitesimally small learning rates using the Neural Tangent Kernel (NTK), and 2) Optimizing the training objective for ReLU networks through convex reformulations, leading to gated ReLU networks that are globally optimizable. The paper connects these two directions by interpreting the convex program for gated ReLU networks as a Multiple Kernel Learning (MKL) model. It establishes a relationship between the NTK and the optimal MKL kernel for a specific choice of mask weights. The NTK's lack of dependence on learning targets means it can't outperform the optimal MKL kernel on the training data. The paper proposes an iterative reweighting method to improve NTK weights, obtaining the optimal MKL kernel equivalent to the convex reformulation of the gated ReLU network. Numerical simulations support the theory, and the paper also analyzes prediction errors using group lasso consistency results.
  • Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit. [paper]

    • Greg Yang, Etai Littwin.
    • Key Word: Tensor Programs.
    • Digest The paper explores the behavior of wide neural networks trained with adaptive optimizers like Adam beyond stochastic gradient descent (SGD). It reveals that the dichotomy between feature learning and kernel behaviors, observed in SGD, also applies to Adam with a nonlinear notion of "kernel." The study derives the "neural tangent" and "maximal update" limits for any architecture. The paper introduces a new Tensor Program language, NEXORT, to express how adaptive optimizers process gradients into updates, and utilizes bra-ket notation to simplify expressions and calculations in Tensor Programs. The work provides a comprehensive summary and generalization of previous results in the Tensor Programs series of papers.
  • The NTK approximation is valid for longer than you think. [paper]

    • Enric Boix-Adsera, Etai Littwin.
    • Key Word: Neural Tangent Kernel Approximation.
    • Digest We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of α=O(T) suffices for the NTK approximation to be valid until training time T. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of α=O(T2).
  • Effective Theory of Transformers at Initialization. [paper]

    • Emily Dinan, Sho Yaida, Susan Zhang.
    • Key Word: Transformers; Neural Tangent Kernel.
    • Digest We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transformers in practical setups.
  • Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. [paper]

    • Simone Bombari, Shayan Kiyani, Marco Mondelli.
    • Key Word: Neural Tangent Kernel; Random Feature.
    • Digest Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK).
  • Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. [paper]

    • Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang.
    • Key Word: Neural Tangent Kernel; Feature Learning.
    • Digest We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime.

Neural Tangent Kernel: 2022

  • A Kernel Perspective of Skip Connections in Convolutional Networks. [paper]

    • Daniel Barzilai, Amnon Geifman, Meirav Galun, Ronen Basri.
    • Key Word: Neural Tangent Kernel; Gaussian Process; Understanding Skip Connections.
    • Digest Over-parameterized residual networks (ResNets) are amongst the most successful convolutional neural architectures for image processing. Here we study their properties through their Gaussian Process and Neural Tangent kernels. We derive explicit formulas for these kernels, analyze their spectra, and provide bounds on their implied condition numbers. Our results indicate that (1) with ReLU activation, the eigenvalues of these residual kernels decay polynomially at a similar rate compared to the same kernels when skip connections are not used, thus maintaining a similar frequency bias; (2) however, residual kernels are more locally biased.
  • Transfer Learning with Kernel Methods. [paper]

    • Adityanarayanan Radhakrishnan, Max Ruiz Luyten, Neha Prasad, Caroline Uhler.
    • Key Word: Transfer Learning; Neural Tangent Kernel.
    • Digest We propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. In particular, we show that transferring modern kernels trained on large-scale image datasets can result in substantial performance increase as compared to using the same kernel trained directly on the target task.
  • Neural Tangent Kernel: A Survey. [paper]

    • Eugene Golikov, Eduard Pokonechnyy, Vladimir Korviakov.
    • Key Word: Neural Tangent Kernel; Survey.
    • Digest A seminal work [Jacot et al., 2018] demonstrated that training a neural network under specific parameterization is equivalent to performing a particular kernel method as width goes to infinity. This equivalence opened a promising direction for applying the results of the rich literature on kernel methods to neural nets which were much harder to tackle. The present survey covers key results on kernel convergence as width goes to infinity, finite-width corrections, applications, and a discussion of the limitations of the corresponding method.
  • Limitations of the NTK for Understanding Generalization in Deep Learning. [paper]

    • Nikhil Vyas, Yamini Bansal, Preetum Nakkiran.
    • Key Word: Neural Tangent Kernel.
    • Digest In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.
  • Fast Finite Width Neural Tangent Kernel. [paper] [code]

    • Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz. ICML 2022
    • Key Word: Neural Tangent Kernel.
    • Digest In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency.
  • On the Generalization Power of the Overfitted Three-Layer Neural Tangent Kernel Model. [paper]

    • Peizhong Ju, Xiaojun Lin, Ness B. Shroff.
    • Key Word: Neural Tangent Kernel.
    • Digest We study the generalization performance of overparameterized 3-layer NTK models. We show that, for a specific set of ground-truth functions (which we refer to as the "learnable set"), the test error of the overfitted 3-layer NTK is upper bounded by an expression that decreases with the number of neurons of the two hidden layers. Different from 2-layer NTK where there exists only one hidden-layer, the 3-layer NTK involves interactions between two hidden-layers. Our upper bound reveals that, between the two hidden-layers, the test error descends faster with respect to the number of neurons in the second hidden-layer (the one closer to the output) than with respect to that in the first hidden-layer (the one closer to the input).
  • Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. [paper]

    • Blake Bordelon, Cengiz Pehlevan.
    • Key Word: Neural Tangent Kernel; Mean Field Theory.
    • Digest We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training.

Neural Tangent Kernel: 2021

  • Neural Tangent Generalization Attacks. [paper] [code]

    • Chia-Hung Yuan, Shan-Hung Wu. ICML 2021
    • Key Word: Neural Tangent Kernel; Poisoning Attacks.
    • Digest We study the generalization attacks against DNNs, where an attacker aims to slightly modify training data in order to spoil the training process such that a trained network lacks generalizability. These attacks can be performed by data owners and protect data from unexpected use. However, there is currently no efficient generalization attack against DNNs due to the complexity of a bilevel optimization involved. We propose the Neural Tangent Generalization Attack (NTGA) that, to the best of our knowledge, is the first work enabling clean-label, black-box generalization attack against DNNs.
  • On the Equivalence between Neural Network and Support Vector Machine. [paper] [code]

    • Yilan Chen, Wei Huang, Lam M. Nguyen, Tsui-Wei Weng. NeurIPS 2021
    • Key Word: Neural Tangent Kernel; Support Vector Machine.
    • Digest We prove the equivalence between neural network (NN) and support vector machine (SVM), specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalence between NN and a broad family of L2 regularized kernel machines (KMs) with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM.
  • An Empirical Study of Neural Kernel Bandits. [paper] [code]

    • Michal Lisicki, Arash Afkanpour, Graham W. Taylor.
    • Key Word: Neural Tangent Kernel.
    • Digest We propose to directly apply NK-induced distributions to guide an upper confidence bound or Thompson sampling-based policy. We show that NK bandits achieve state-of-the-art performance on highly non-linear structured data. Furthermore, we analyze practical considerations such as training frequency and model partitioning.
  • A Neural Tangent Kernel Perspective of GANs. [paper] [code]

    • Jean-Yves Franceschi, Emmanuel de Bézenac, Ibrahim Ayed, Mickaël Chen, Sylvain Lamprier, Patrick Gallinari. ICML 2021
    • Key Word: Neural Tangent Kernel; Generative Adversarial Networks.
    • Digest We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We start by pointing out a fundamental flaw in previous theoretical analyses that leads to ill-defined gradients for the discriminator. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network.
  • Reverse Engineering the Neural Tangent Kernel. [paper] [code]

    • James B. Simon, Sajant Anand, Michael R. DeWeese.
    • Key Word: Neural Tangent Kernel.
    • Digest The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning. To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the conjugate or neural tangent kernel of a fully-connected neural network with only one hidden layer.
  • Out-of-Distribution Generalization in Kernel Regression. [paper] [code]

    • Abdulkadir Canatar, Blake Bordelon, Cengiz Pehlevan. NeurIPS 2021
    • Key Word: Out-of-Distribution Generalization; Neural Tangent Kernel.
    • Digest We study generalization in kernel regression when the training and test distributions are different using methods from statistical physics. Using the replica method, we derive an analytical formula for the out-of-distribution generalization error applicable to any kernel and real datasets. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel as a key determinant of generalization performance under distribution shift.
  • FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis. [paper]

    • Baihe Huang, Xiaoxiao Li, Zhao Song, Xin Yang. ICML 2021
    • Key Word: Federated Learning; Neural Tangent Kernel.
    • Digest This paper presents a new class of convergence analysis for FL, Federated Learning Neural Tangent Kernel (FL-NTK), which corresponds to overparamterized ReLU neural networks trained by gradient descent in FL and is inspired by the analysis in Neural Tangent Kernel (NTK). Theoretically, FL-NTK converges to a global-optimal solution at a linear rate with properly tuned learning parameters. Furthermore, with proper distributional assumptions, FL-NTK can also achieve good generalization.
  • Random Features for the Neural Tangent Kernel. [paper]

    • Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, Jinwoo Shin.
    • Key Word: Neural Tangent Kernel; Random Features.
    • Digest We propose an efficient feature map construction of the NTK of fully-connected ReLU network which enables us to apply it to large-scale datasets. We combine random features of the arc-cosine kernels with a sketching-based algorithm which can run in linear with respect to both the number of data points and input dimension. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.

Neural Tangent Kernel: 2020

  • Mathematical Models of Overparameterized Neural Networks. [paper] [code]\

    • Cong Fang, Hanze Dong, Tong Zhang. Proceedings of the IEEE
    • Key Word: Neural Tangent Kernel; Mean-Field Theory.
    • Digest Known by practitioners that overparameterized neural networks are easy to learn, in the past few years there have been important theoretical developments in the analysis of overparameterized neural networks. In particular, it was shown that such systems behave like convex systems under various restricted settings, such as for two-layer NNs, and when learning is restricted locally in the so-called neural tangent kernel space around specialized initializations. This paper discusses some of these recent progresses leading to significant better understanding of neural networks. We will focus on the analysis of two-layer neural networks, and explain the key mathematical models, with their algorithmic implications.
  • Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel. [paper]

    • Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli. NeurIPS 2020
    • Key Word: Neural Tangent Kernel.
    • Digest In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK.
  • Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks. [paper]

    • Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler.
    • Key Word: Convolutional Neural Tangent Kernel.
    • Digest We demonstrate that the test risk of over-parameterized convolutional networks is a U-shaped curve (i.e. monotonically decreasing, then increasing) with increasing depth. We first provide empirical evidence for this phenomenon via image classification experiments using both ResNets and the convolutional neural tangent kernel (CNTK). We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.
  • Finite Versus Infinite Neural Networks: an Empirical Study. [paper] [code]

    • Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein. NeurIPS 2020
    • Key Word: Neural Tangent Kernel.
    • Digest We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime.
  • Bayesian Deep Ensembles via the Neural Tangent Kernel. [paper] [code]

    • Bobby He, Balaji Lakshminarayanan, Yee Whye Teh.
    • Key Word: Neural Tangent Kernel.
    • Digest We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK): a recent development in understanding the training dynamics of wide neural networks (NNs). Previous work has shown that even in the infinite width limit, when NNs become GPs, there is no GP posterior interpretation to a deep ensemble trained with squared error loss. We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member, that enables a posterior interpretation in the infinite width limit.
  • The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks. [paper]

    • Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington. NeurIPS 2020
    • Key Word: Neural Tangent Kernel.
    • Digest We show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs.
  • When Do Neural Networks Outperform Kernel Methods? [paper] [code]

    • Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari. NeurIPS 2020
    • Key Word: Neural Tangent Kernel.
    • Digest How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work.
  • A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks. [paper]

    • Zixiang Chen, Yuan Cao, Quanquan Gu, Tong Zhang. NeurIPS 2020
    • Key Word: Neural Tangent Kernel; Mean Field Theory.
    • Digest We provide a generalized neural tangent kernel analysis and show that noisy gradient descent with weight decay can still exhibit a "kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.

Neural Tangent Kernel: 2019

  • Disentangling Trainability and Generalization in Deep Neural Networks. [paper]

    • Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz. ICML 2020
    • Key Word: Neural Tangent Kernel.
    • Digest We provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence.
  • Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee. [paper]

    • Wei Hu, Zhiyuan Li, Dingli Yu. ICLR 2020
    • Key Word: Neural Tangent Kernel; Regularization.
    • Digest This paper proposes and analyzes two simple and intuitive regularization methods: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels.
  • On Exact Computation with an Infinitely Wide Neural Net. [paper] [code]

    • Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang. NeurIPS 2019
    • Key Word: Neural Tangent Kernel.
    • Digest The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
  • Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation. [paper]

    • Greg Yang.
    • Key Word: Neural Tangent Kernel.
    • Digest Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized.

Neural Tangent Kernel: 2018

  • A Convergence Theory for Deep Learning via Over-Parameterization. [paper]

    • Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song. ICML 2019
    • Key Word: Stochastic Gradient Descent; Neural Tangent Kernel.
    • Digest We prove why stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: polynomial in L, the number of layers and in n, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
  • Neural Tangent Kernel: Convergence and Generalization in Neural Networks. [paper]

    • Arthur Jacot, Franck Gabriel, Clément Hongler. NeurIPS 2018
    • Key Word: Neural Tangent Kernel.
    • Digest We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK).

Others

Others: 2024

  • Theoretical limitations of multi-layer Transformer. [paper]

    • Lijie Chen, Binghui Peng, Hongxun Wu.
    • Key Word: Transformer; Chain-of-Thought.
    • Digest This paper establishes the first unconditional lower bound on the expressive power of multi-layer decoder-only transformers, demonstrating that they require polynomially large dimensions to perform sequential composition of L functions over n tokens for any constant L. It reveals that multi-layer transformers face an exponential depth-width trade-off, where fewer layers make tasks exponentially harder, and highlights an advantage of encoders over decoders for certain tasks, as well as the exponential simplification of tasks using chain-of-thought reasoning. The authors introduce a novel multi-party autoregressive communication model and a new proof technique for deriving lower bounds, providing foundational tools for understanding the computational power of transformers.
  • Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers. [paper]

    • Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou.
    • Key Word: Benign Overfitting; Feature Learning.
    • Digest This paper studies benign overfitting in over-parameterized neural networks, specifically focusing on two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers. It finds that the initialization scale of the output layer significantly affects training dynamics. Large initialization scales make training similar to fixed-output scenarios, with the hidden layer growing while the output layer remains stable. Small scales lead to complex interactions where both layers grow proportionally. The paper also provides bounds on test errors, identifying conditions on initialization scale and signal-to-noise ratio (SNR) that determine whether benign overfitting occurs. Numerical experiments support these findings.
  • Simplicity Bias via Global Convergence of Sharpness Minimization. [paper]

    • Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka.
    • Key Word: Simplicity Bias; Sharpness Minimization.
    • Digest This paper investigates the connection between the generalization ability of neural networks, typically attributed to the implicit bias of stochastic gradient descent (SGD), and the simplicity of the final trained model, particularly in relation to low-rank features. The authors focus on label noise SGD, which tends to converge to flatter regions of the loss landscape. They demonstrate that for two-layer neural networks, label noise SGD converges to a solution where all neurons replicate a single linear feature, leading to a rank-one feature matrix. Their key contribution is showing that label noise SGD minimizes sharpness on the zero-loss manifold and discovering a novel property of local geodesic convexity in the trace of the Hessian.
  • Loss Landscape Characterization of Neural Networks without Over-Parametrziation. [paper]

    • Rustem Islamov, Niccolò Ajroldi, Antonio Orvieto, Aurelien Lucchi.
    • Key Word: Loss Landscape; Over-Parameterization; Invexity.
    • Digest This paper addresses the challenge of ensuring convergence in optimization methods for deep learning, where the loss landscapes are non-convex. While the Polyak-Lojasiewicz (PL) inequality offers a common structural condition for convergence, it requires impractical over-parametrization in deep networks. The authors propose a new class of functions that describe the loss landscape of modern deep models without needing extensive over-parametrization and can account for saddle points. They prove that gradient-based optimizers converge under this new assumption and support it with theoretical analysis and empirical experiments.
  • Leveraging free energy in pretraining model selection for improved fine-tuning. [paper]

    • Michael Munn, Susan Wei.
    • Key Word: Model Selection; Free Energy.
    • Digest This paper explores the success of the pretrain-then-adapt paradigm in artificial intelligence models, like BERT and GPT, and introduces a Bayesian model selection criterion called downstream free energy. This criterion evaluates a model’s adaptability to downstream tasks by measuring the concentration of favorable parameters near the pretrained checkpoint, without needing access to the downstream data. The authors show that this criterion correlates with improved fine-tuning performance, providing a way to predict how well pretrained models will adapt to new tasks.
  • Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective. [paper]

    • Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma.
    • Key Word: Gradient Descent Dynamics; Loss Landscape.
    • Digest The paper introduces the Warmup-Stable-Decay (WSD) learning rate schedule, which allows training language models without a pre-fixed compute budget. WSD uses a constant learning rate for most of the training (stable phase) and then applies a rapidly decaying learning rate (decay phase) to produce strong models. Unlike traditional schedules, WSD creates a loss curve where the loss stays high during the stable phase and sharply drops in the decay phase. The authors explain this using a “river valley” landscape analogy, where large oscillations during the stable phase drive fast progress, and the decay phase fine-tunes the optimization. They propose WSD-S, a variant that reuses decayed checkpoints, outperforming WSD and Cyclic-Cosine in generating language models across different compute budgets.
  • The Optimization Landscape of SGD Across the Feature Learning Strength. [paper]

    • Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan.
    • Key Word: Loss Landscape; Feature Learning.
    • Digest This paper investigates the effect of scaling a neural network's final layer by a hyperparameter γ, which controls feature learning dynamics. The study explores how γ interacts with the learning rate η across various models and datasets in online training. The authors identify optimal learning rate scaling regimes, where η* scales with γ² when γ is small and with γ²/L for deep networks when γ is large. In the under-explored "ultra-rich" γ≫1 regime, networks exhibit distinctive loss curves with plateaus and steps, optimizing similarly across large γ values. The study highlights the importance of tuning γ for optimal performance and suggests further analytical exploration of the large-γ limit.
  • Provable Weak-to-Strong Generalization via Benign Overfitting. [paper]

    • David X. Wu, Anant Sahai.
    • Key Word: Weak-to-Strong Generalization; Benign Overfitting.
    • Digest This paper explores weak-to-strong generalization, where a weak teacher supervises a strong student using imperfect pseudolabels, as introduced by Burns et al. (2023). The authors theoretically analyze this paradigm for binary and multilabel classification in an overparameterized Gaussian model, where the weak teacher’s pseudolabels are nearly random. They identify two outcomes for the student: successful generalization or random guessing. Their results highlight the importance of logits for weak supervision and include a new tight lower bound for the maximum of correlated Gaussians, potentially useful for extending to multiclass classification.
  • Autoregressive Large Language Models are Computationally Universal. [paper]

    • Dale Schuurmans, Hanjun Dai, Francesco Zanini.
    • Key Word: Autoregressive Large Language Model; Univeral Turing Machine.
    • Digest This paper demonstrates that autoregressive decoding of a transformer-based language model can achieve universal computation without modifying the model’s weights. The authors introduce a generalization of autoregressive decoding, where emitted tokens extend the context window for processing long inputs. They show this system corresponds to a Lag system, a known computationally universal model. By proving a universal Turing machine can be simulated with 2027 production rules, they test whether a large language model can mimic this behavior. They confirm that gemini-1.5-pro-001, with a specific prompt and greedy decoding, can function as a general-purpose computer under the Church-Turing thesis.
  • On the Geometry of Deep Learning. [paper]

    • Randall Balestriero, Ahmed Imtiaz Humayun, Richard Baraniuk.
    • Key Word: Geometry.
    • Digest This paper explores the mathematical foundations of deep learning, focusing on the connection between deep networks and function approximation using affine splines (piecewise linear functions). It reviews recent work on the geometrical properties of deep networks, particularly how they tessellate input space, demonstrating how this perspective can enhance our understanding and optimization of deep networks.
  • Understanding Finetuning for Factual Knowledge Extraction. [paper]

    • Gaurav Ghosal, Tatsunori Hashimoto, Aditi Raghunathan.
    • Key Word: Fine-Tuning; Factual Knowledge.
    • Digest Fine-tuning question-answering models on lesser-known facts results in worse factuality compared to using well-known facts, as models may produce generic responses rather than accurate ones. This study shows that fine-tuning with well-known data improves performance, highlighting the need to consider the storage of facts in pretrained models for effective fine-tuning.
  • Hardness of Learning Neural Networks under the Manifold Hypothesis. [paper]

    • Bobak T. Kiani, Jason Wang, Melanie Weber.
    • Key Word: Manifold Hypothesis; Hardness of Learning Neural Networks.
    • Digest The paper investigates the difficulty of learning neural networks under the manifold hypothesis, which posits that high-dimensional data lies on or near a low-dimensional manifold. It demonstrates that learning is hard under manifolds of bounded curvature but becomes feasible with additional assumptions on the manifold's volume, suggesting that certain geometric properties can significantly impact the learnability of neural networks.
  • Is In-Context Learning in Large Language Models Bayesian? A Martingale Perspective. [paper]

    • Fabian Falck, Ziyu Wang, Chris Holmes.
    • Key Word: In-Context Learning; Bayesian Inference.
    • Digest The paper examines the hypothesis that in-context learning (ICL) in large language models (LLMs) functions as Bayesian inference by analyzing the martingale property, a key requirement for Bayesian learning with exchangeable data. The authors find that while the martingale property is necessary for unambiguous predictions and principled uncertainty, their experiments reveal violations of this property and deviations from expected Bayesian behavior, thus challenging the hypothesis that ICL is inherently Bayesian.
  • Why Larger Language Models Do In-context Learning Differently? [paper]

    • Zhenmei Shi, Junyi Wei, Zhuoyan Xu, Yingyu Liang.
    • Key Word: Large Language Model; In-Context Learning.
    • Digest Large language models (LLMs) exhibit in-context learning (ICL), performing well on new tasks using brief examples without parameter adjustments. This study theoretically explores why larger models are more sensitive to noise, finding that smaller models focus on key features and are more robust, while larger models cover more features and are more easily distracted, supported by preliminary experimental results.
  • Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics. [paper]

    • Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, Stuart Russell.
    • Key Word: Large Language Model; Reasoning; Training Dynamics; Reversal Curse.
    • Digest Auto-regressive large language models struggle with simple logical reasoning tasks like inverse search, known as the "reversal curse." Through analyzing the training dynamics of two auto-regressive models, this paper reveals that the asymmetry in the weights is a core reason for the reversal curse and shows the necessity of chain-of-thought for one-layer transformers.
  • Understanding LLMs Requires More Than Statistical Generalization. [paper]

    • Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár.
    • Key Word: Large Language Model; Generalization Measure; Transferability; Inductive Biases.
    • Digest The last decade has seen blossoming research in deep learning theory attempting to answer, 'Why does deep learning generalize?'" argues for a shift in perspective in understanding the desirable qualities of Language Models (LLMs). The authors highlight the non-identifiability of AR probabilistic models, where models with zero or near-zero KL divergence can exhibit different behaviors. They provide mathematical examples and empirical observations to support their argument and discuss the practical relevance of non-identifiability through three case studies. The paper concludes by reviewing research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.
  • Categorical Deep Learning: An Algebraic Theory of Architectures. [paper]

    • Bruno Gavranović, Paul Lessard, Andrew Dudzik, Tamara von Glehn, João G. M. Araújo, Petar Veličković.
    • Key Word: Category Theory.
    • Digest The abstract discusses the challenge of creating a universal framework for defining and analyzing deep learning architectures. It criticizes previous efforts for failing to effectively link the theoretical constraints of models with their practical implementations. The authors suggest using category theory, specifically the universal algebra of monads within a 2-category of parametric maps, as a comprehensive theory that can encompass both theoretical and practical aspects of neural network design. They argue that this approach can accurately represent constraints found in geometric deep learning and implementations across various neural network architectures, including Recurrent Neural Networks (RNNs). Additionally, they demonstrate how their theory can naturally express standard concepts in computer science and automata theory.
  • A PAC-Bayesian Link Between Generalisation and Flat Minima. [paper]

    • Maxime Haddouche, Paul Viallard, Umut Simsekli, Benjamin Guedj.
    • Key Word: PAC-Bayes Generalization; Flat Minima.
    • Digest The paper presents new generalization bounds for machine learning predictors in overparameterized settings, where the number of parameters exceeds dataset size. These bounds, which focus on gradient terms, are derived using the PAC-Bayes framework along with Poincaré and Log-Sobolev inequalities, circumventing the need for explicit consideration of predictor space dimension. The findings emphasize the beneficial impact of flat minima on generalization performance, underscoring the advantage of the optimization phase in enhancing model generalizability without directly depending on the model's complexity or dataset size.
  • Tighter Generalisation Bounds via Interpolation. [paper]

    • Paul Viallard, Maxime Haddouche, Umut Şimşekli, Benjamin Guedj.
    • Key Word: PAC-Bayes Generalization Bounds.
    • Digest This paper introduces a method for creating new PAC-Bayes generalization bounds using the (f,Γ)-divergence. It also presents interpolated PAC-Bayes bounds across various probability divergences, such as KL, Wasserstein, and total variation, tailored to the properties of posterior distributions. The study evaluates the tightness of these bounds and links them to established results in statistical learning, identifying them as specific instances. Furthermore, by implementing these bounds as training objectives, the paper demonstrates their effectiveness in providing significant guarantees and practical performance improvements in machine learning models.
  • Provably learning a multi-head attention layer. [paper]

    • Sitan Chen, Yuanzhi Li.
    • Key Word: Multi-Head Attention; Learning Theory.
    • Digest The paper discusses the multi-head attention layer, a crucial feature of the transformer architecture that differentiates it from conventional feed-forward models. It explores the theoretical aspects of learning a multi-head attention layer through random examples, presenting the first significant upper and lower bounds for this challenge. The findings include a method that can learn the function of the multi-head attention layer with small error under specific conditions, using random labeled examples from a defined set. The study also indicates that an exponential dependency on the number of attention heads (m) is inevitable in the worst-case scenarios. This research uses Boolean inputs to reflect the discrete nature of tokens in large language models but notes that the approach can be adapted to continuous settings. The proposed algorithm, which diverges from traditional methods by focusing on shaping a convex body around unknown parameters, offers a new direction in provable learning algorithms beyond the common reliance on the Gaussian distribution's properties.
  • Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling. [paper]

    • Mingze Wang, Weinan E.
    • Key Word: Expressivity; Transformer; Self-Attent Mechanism; Positional Ecoding.
    • Digest We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads, and these insights also provide natural suggestions for alternative architectures.
  • Residual Alignment: Uncovering the Mechanisms of Residual Networks. [paper]

    • Jianing Li, Vardan Papyan. NeurIPS 2023
    • Key Word: ResNet; Neural Collapse; Neural ODE; Optimal Transport.
    • Digest This paper examines the ResNet architecture in deep learning, focusing on understanding its effectiveness through an analysis of its residual blocks. The study uncovers a phenomenon called Residual Alignment (RA), characterized by: Even distribution of intermediate representations in high-dimensional space (RA1). Alignment of singular vectors in Residual Jacobians across different network depths (RA2). Limitation of Residual Jacobians' rank by the number of classes in fully-connected ResNets (RA3). Inverse scaling of top singular values with network depth (RA4). Residual Alignment is found to be crucial for the model's performance, occurring in well-generalizing models across various architectures and datasets. The absence of RA when skip connections are removed highlights their importance. The paper also proposes a mathematical model supporting these findings.
  • A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models. [paper]

    • Namjoon Suh, Guang Cheng.
    • Key Word: Survey; Learning Theory; Neural Tangent Kernel; Mean-Field Theory; Approximation Theory; Generative Modeling.
    • Digest The paper reviews statistical theories of neural networks, focusing on three areas: first, it analyzes neural network risks and construction within nonparametric frameworks, noting limitations in the analysis of deep networks. Second, it discusses training dynamics, especially how networks trained via gradient-based methods generalize. This section reviews two key paradigms: Neural Tangent Kernel (NTK) and Mean-Field (MF). Finally, it examines advances in generative models, notably Generative Adversarial Networks (GANs), diffusion models, and in-context learning in Large Language Models. The paper concludes with future directions for deep learning theory.

Others: 2023

  • Learning Theory from First Principles. [paper]

    • Francis Bach.
    • Key Word: Learning Theory; Book.
    • Digest The goal of the class (and thus of this textbook) is to present old and recent results in learning theory for the most widely-used learning architectures. This class is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are significant users of learning methods such as computer vision or natural language processing.
  • Understanding the Regularity of Self-Attention with Optimal Transport. [paper]

    • Valérie Castin, Pierre Ablin, Gabriel Peyré.
    • Key Word: Self-Attention; Optimal Transport.
    • Digest This paper analyzes the robustness of self-attention mechanisms in Transformers from a theoretical perspective. It studies the local Lipschitz constant of self-attention as a way to measure robustness agnostic to specific attacks. Using a measure-theoretic framework with the Wasserstein distance, it derives bounds on the Lipschitz constant on compact input spaces, showing it grows exponentially with input radius. It also finds measures with high Lipschitz constants typically have unbalanced mass concentrated in a few locations. Finally, it examines self-attention stability under perturbations changing token numbers, identifying a "mass splitting" phenomenon where duplicating tokens before perturbation can be a more effective attack.
  • Challenges with unsupervised LLM knowledge discovery. [paper]

    • Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah.
    • Key Word: Large Language Model; Unsuperivsed Knowledge Discovery.
    • Digest The paper demonstrates that current unsupervised methods for large language models do not effectively uncover knowledge, but rather emphasize prominent features of the model's activations. It introduces the concept of consistency structure for knowledge elicitation and presents experiments revealing that unsupervised methods may prioritize a different prominent feature over knowledge. The paper concludes that existing unsupervised methods are inadequate for discovering latent knowledge and suggests sanity checks for evaluating future knowledge elicitation methods. Additionally, it hypothesizes that identification issues, such as distinguishing a model's knowledge from that of a simulated character's, will persist in future unsupervised methods.
  • Proving Linear Mode Connectivity of Neural Networks via Optimal Transport. [paper]

    • Damien Ferbach, Baptiste Goujaud, Gauthier Gidel, Aymeric Dieuleveut.
    • Key Word: Linear Mode Connectivity; Optimal Transport.
    • Digest This paper explores the energy landscape of high-dimensional non-convex optimization problems in deep neural networks. It theoretically explains the empirical observation that different solutions found in stochastic training are often connected by simple continuous paths, such as linear ones. The framework is based on convergence rates in Wasserstein distance, showing that wide two-layer neural networks trained with stochastic gradient descent are linearly connected with high probability. The paper also provides upper and lower bounds on the layer width for linear connectivity in deep neural networks. Empirical evidence supports the approach, linking the dimension of weight distribution support with Wasserstein convergence rates and linear mode connectivity.
  • Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. [paper]

    • Miao Lu, Beining Wu, Xiaodong Yang, Difan Zou.
    • Key Word: Stochastic Gradient Descent; Large Learning Rate; Feature Learning.
    • Digest This paper investigates the generalization of neural networks trained using a stochastic gradient descent (SGD) algorithm with large learning rates. The key finding is that the weight oscillations caused by this training regime, termed "benign oscillation," can improve generalization compared to networks trained with smaller learning rates that converge more smoothly. The theory is based on a feature learning perspective and demonstrates that large learning rate SGD allows networks to effectively learn weak features in the presence of strong features. Experimental results support the concept of "benign oscillation."
  • It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models. [paper]

    • Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar.
    • Key Word: Bias and Variance.
    • Digest This paper challenges the conventional idea that bias and variance in machine learning trade off against each other. Instead, it demonstrates that, in deep learning ensemble models, bias and variance are closely related for correctly classified samples. The paper provides empirical evidence across different deep learning models and datasets. It also explores this phenomenon theoretically from two perspectives: calibration and neural collapse. The findings suggest a connection between bias and variance in these models.
  • Why Does Sharpness-Aware Minimization Generalize Better Than SGD? [paper]

    • Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, Quanquan Gu. NeurIPS 2023
    • Key Word: Sharpness-Aware Minimization.
    • Digest This paper addresses the problem of overfitting in large neural networks and introduces Sharpness-Aware Minimization (SAM) as a method to improve generalization, even in the presence of label noise. It specifically investigates why SAM outperforms Stochastic Gradient Descent (SGD) in certain scenarios, using two-layer convolutional ReLU networks and a nonsmooth loss landscape. The paper's findings suggest that SAM prevents early noise learning, making feature learning more effective. Experimental results on synthetic and real data support these theoretical insights.
  • Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP. [paper]

    • Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu.
    • Key Word: Zero-Shot Transfer; Multi-Modal Foundation Models.
    • Digest This paper focuses on multi-modal learning, which combines information from different data sources like text and images to enhance model performance. It highlights the success of CLIP, a method that learns joint image and text representations through contrastive pretraining. While CLIP has shown practical success, this paper aims to provide a formal theoretical understanding of its representation learning and how it aligns features from different modalities. The paper also analyzes CLIP's performance in zero-shot transfer tasks and introduces a new CLIP-type approach inspired by their analysis, which outperforms CLIP and other state-of-the-art methods on benchmark datasets.
  • Physics of Language Models: Part 3.2, Knowledge Manipulation. [paper]

    • Zeyuan Allen-Zhu, Yuanzhi Li.
    • Key Word: Large Language Models.
    • Digest This paper investigates a language model's ability to use its stored knowledge for various types of logical reasoning, including retrieval, classification, comparison, and inverse search. The study finds that pre-trained language models like GPT2/3/4 perform well in knowledge retrieval but struggle with classification and comparison tasks unless Chain of Thoughts (CoTs) are used during both training and inference. They also perform poorly in inverse knowledge search, regardless of the prompts. The paper's main contribution is a synthetic dataset that confirms these limitations, showing that language models cannot efficiently manipulate their stored knowledge from pre-training data, even when the knowledge is perfectly stored and extractable in the models, and despite fine-tuning with appropriate instructions.
  • Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. [paper]

    • Zeyuan Allen Zhu, Yuanzhi Li.
    • Key Word: Large Language Models.
    • Digest This paper investigates how large language models answer questions, specifically whether they rely on memorization or genuine knowledge extraction. Using controlled semi-synthetic biography data, the study reveals a connection between the model's knowledge extraction ability and diversity measures of the training data. The paper employs linear probing techniques, showing a strong correlation between this relationship and whether the model encodes knowledge attributes in a linear fashion within the entity names' hidden embeddings or across other tokens in the training text.
  • Fantastic Generalization Measures are Nowhere to be Found. [paper]

    • Michael Gastpar, Ido Nachum, Jonathan Shafer, Thomas Weinberger.
    • Key Word: Generalization Bound; Overparameterization.
    • Digest The paper discusses generalization bounds for neural networks in the overparameterized setting. It highlights that existing generalization bounds are not tight enough to explain neural network performance. The paper examines two common types of generalization bounds: those depending on training data and output and those considering the learning algorithm. It mathematically proves that no bound of the first type can be uniformly tight in the overparameterized setting. For the second type, it shows a trade-off between algorithm performance and bound tightness, suggesting that tight generalization bounds are not possible without suitable assumptions on the population distribution.
  • Implicit regularization of deep residual networks towards neural ODEs. [paper]

    • Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau.
    • Key Word: Implicit Regularization; Neural Ordinary Differential Equations.
    • Digest This paper establishes an implicit regularization connection between deep residual networks and neural ordinary differential equations (ODEs) when trained with gradient flow. It proves that if the network is initially set up as a discretization of a neural ODE, this relationship persists during training. These findings are valid for both finite training times and in the limit of infinite training time under certain conditions. The paper also demonstrates this connection in the context of specific residual network architectures and shows numerical experiments to support the results.
  • On the Implicit Bias of Adam. [paper]

    • Matias D. Cattaneo, Jason M. Klusowski, Boris Shigida.
    • Key Word: Adam; Implicit Bias; Ordinary Differential Equations.
    • Digest This paper explores the concept of implicit regularization in optimization algorithms like RMSProp and Adam, comparing it to previous work on gradient descent trajectories. It demonstrates that these algorithms exhibit implicit regularization effects influenced by their hyperparameters and training stage. Specifically, they either penalize the one-norm of loss gradients or hinder its decrease. The paper supports these findings with numerical experiments and discusses their potential impact on generalization in machine learning.
  • Transformers as Support Vector Machines. [paper]

    • Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak.
    • Key Word: Transformer; Implicit Regularization.
    • Digest This paper establishes a formal equivalence between self-attention in transformers and a hard-margin SVM problem. It characterizes the convergence behavior of 1-layer transformers optimized with gradient descent, showing that it can converge towards locally-optimal directions. The paper also demonstrates that over-parameterization facilitates global convergence and introduces a more general SVM equivalence for nonlinear heads. These findings suggest interpreting transformers as a hierarchy of SVMs for token selection and separation.
  • Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization. [paper]

    • Kaiyue Wen, Tengyu Ma, Zhiyuan Li.
    • Key Word: Sharpness-Aware Minimization.
    • Digest The paper investigates the relationship between flatness and generalization in overparameterized neural networks. It identifies three scenarios for two-layer ReLU networks: (1) flatness implies generalization, (2) non-generalizing flattest models exist, and sharpness minimization algorithms fail to generalize, and (3) non-generalizing flattest models exist, but sharpness minimization algorithms still generalize. These findings indicate that the connection between sharpness and generalization depends on data distributions and model architectures, prompting the need to explore alternative explanations for the generalization of overparameterized neural networks.
  • On the curvature of the loss landscape. [paper]

    • Alison Pouplin, Hrittik Roy, Sidak Pal Singh, Georgios Arvanitidis.
    • Key Word: Loss Landscape; Scalar Curvature; Riemannian Manifold.
    • Digest The paper investigates the challenge of understanding the excellent performance of over-parameterized deep learning models when trained on limited data. It proposes analyzing the generalization abilities of deep neural networks by treating the loss landscape as an embedded Riemannian manifold, focusing on the computable scalar curvature and its connections to potential generalization.
  • Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. [paper]

    • Minhak Song, Chulhee Yun.
    • Key Word: Edge of Stability; Bifurcation Theory.
    • Digest The paper explores the Edge of Stability (EoS) phenomenon observed in the evolution of the largest eigenvalue of the loss Hessian during gradient descent (GD) training. It demonstrates that GD trajectories, when EoS occurs, align on a specific bifurcation diagram, independent of initialization, and provides rigorous proofs for this trajectory alignment in specific network architectures.
  • How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model. [paper]

    • Leonardo Petrini, Francesco Cagnetta, Umberto M. Tomasini, Alessandro Favero, Matthieu Wyart.
    • Key Word: Synonymic Invariance; Random Hierarchy Model.
    • Digest This paper explores how deep convolutional neural networks (CNNs) learn compositional data by investigating the Random Hierarchy Model, demonstrating that the number of training data required by deep CNNs grows asymptotically as a polynomial function of the input dimensionality.
  • Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space. [paper]

    • Zhengdao Chen.
    • Key Word: Reproducing Kernel Hilbert Spaces; Mean-Field Theory.
    • Digest The paper introduces the concept of Neural Hilbert Ladders (NHL), which views a multi-layer neural network as a hierarchy of reproducing kernel Hilbert spaces (RKHSs). It provides a generalized function space and complexity measure for deep neural networks (DNNs) and explores their theoretical properties and implications. The paper establishes a correspondence between L-layer neural networks and L-level NHLs, proves generalization guarantees for learning an NHL, analyzes the dynamics of NHLs in the infinite-width mean-field limit, demonstrates depth separation in NHLs under different activation functions, and supports the theory with numerical results.
  • Sparsity aware generalization theory for deep neural networks. [paper]

    • Ramchandran Muthukumar, Jeremias Sulam.
    • Key Word: Sparse Activation; Sensitivity Analysis; PAC-Bayes Bounds.
    • Digest The paper presents a novel approach to analyzing the generalization capabilities of deep feed-forward ReLU networks by considering the degree of sparsity in the hidden layer activations, revealing trade-offs between sparsity and generalization without strong assumptions about sparsity levels.
  • Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows. [paper]

    • Sibylle Marcotte, Rémi Gribonval, Gabriel Peyré.
    • Key Word: Gradient Flow; Conservation Laws; Lie Algebra.
    • Digest This paper explores the concept of "conservation laws" in gradient flows and their relevance to understanding the implicit bias and generalization properties of over-parameterized machine learning models, presenting a rigorous definition of conservation laws, methods to determine the number of these quantities, and algorithms to compute polynomial and non-polynomial conservation laws.
  • Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima. [paper]

    • Dongkuk Si, Chulhee Yun.
    • Key Word: Shaprness-Aware Minimization; Convergence.
    • Digest This paper explores the convergence properties of Sharpness-Aware Minimization (SAM) optimizer when used with practical configurations, such as a constant perturbation size and gradient normalization, and finds that SAM has limited capability to converge to global minima or stationary points in many scenarios.
  • Transformers learn through gradual rank increase. [paper]

    • Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind.
    • Key Word: Transformer; Gradual Rank.
    • Digest The paper identifies incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank, supported by theoretical proofs and experimental results.
  • Learning via Wasserstein-Based High Probability Generalisation Bounds. [paper]

    • Paul Viallard, Maxime Haddouche, Umut Simsekli, Benjamin Guedj.
    • Key Word: Generalization Bounds; PAC-Bayes; Wasserstein Distance.
    • Digest This work addresses the limitations of the PAC-Bayesian framework and introduces novel Wasserstein distance-based PAC-Bayesian generalization bounds. Previous bounds relying on the Kullback-Leibler (KL) divergence were limited in capturing the geometric structure of learning problems. The proposed bounds overcome this by utilizing the Wasserstein distance, which offers stronger guarantees in terms of high probability, applicability to unbounded losses, and optimizable training objectives. The derived Wasserstein-based PAC-Bayesian learning algorithms demonstrate empirical advantages in various experiments.
  • Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training. [paper]

    • Rie Johnson, Tong Zhang.
    • Key Word: Generalization Gap; Inconsistency; Instability.
    • Digest The authors study how the stochasticity of training deep neural networks affects their generalization gap. They propose two measures, inconsistency and instability, that can be computed on unlabeled data and show that they are correlated with the generalization gap. They also suggest ways to reduce inconsistency and improve performance. They claim that inconsistency is more informative than the loss sharpness for predicting generalization gap.
  • What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. [paper]

    • Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, Zhaoran Wang.
    • Key Word: In-Context Learning; Transformer; Bayesian Model Avaraging.
    • Digest This paper studies In-Context Learning (ICL), which is the ability of large language models to learn new tasks from a few examples in the context1. The paper answers three questions: (a) How do language models perform ICL? (b) How to measure ICL performance and error rates? © What makes the transformer architecture suitable for ICL? The paper shows that ICL can be seen as an implicit Bayesian inference process that leverages the attention mechanism. The paper also analyzes the ICL regret, approximation and generalization bounds from an online learning perspective. The paper provides a comprehensive understanding of the transformer and its ICL ability with theoretical and empirical evidence.
  • Benign Overfitting in Deep Neural Networks under Lazy Training. [paper]

    • Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, Francesco Locatello, Volkan Cevher. ICML 2023
    • Key Word: Benign Overfitting; Lazy Training; Neural Tangent Kernel.
    • Digest The paper studies how gradient descent trains over-parameterized deep ReLU networks to achieve optimal classification performance under certain conditions. The paper connects over-parameterization, benign overfitting, and Lipschitz constant of the networks. The paper also shows that smoother functions and Neural Tangent Kernel regime improve generalization. The paper gives lower bounds on margin and eigenvalue for non-smooth activation functions.
  • Most Neural Networks Are Almost Learnable. [paper]

    • Amit Daniely, Nathan Srebro, Gal Vardi.
    • Key Word: Neural Network Learnability.
    • Digest They assume that the network’s weights are initialized randomly using a standard scheme and that the input distribution is uniform on a sphere. They show that random networks with Lipschitz activation functions can be approximated by low-degree polynomials, and use this to derive a polynomial-time approximation scheme (PTAS) for learning them. They also show that for sigmoid and ReLU-like activation functions, the PTAS can be improved to a quasi-polynomial-time algorithm. They support their theory with experiments on three network architectures and three datasets.
  • The Crucial Role of Normalization in Sharpness-Aware Minimization. [paper]

    • Yan Dai, Kwangjun Ahn, Suvrit Sra.
    • Key Word: Sharpness-Awareness Minimization; Normalization.
    • Digest Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer that greatly improves the prediction performance of deep neural networks. There has been a surge of interest in explaining its empirical success. We focus on understanding the role played by normalization, a key component of the SAM updates. We study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization. These two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM.
  • From Tempered to Benign Overfitting in ReLU Neural Networks. [paper]

    • Guy Kornowski, Gilad Yehudai, Ohad Shamir.
    • Key Word: Overparameterized neural networks; Benign overfitting; Tempered overfitting.
    • Digest Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on “benign overfitting”, where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as “tempered overfitting”. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions.
  • When are ensembles really effective? [paper]

    • Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, Michael W. Mahoney.
    • Key Word: Ensemble; Disagreement-Error Ratio.
    • Digest Ensembling is a machine learning technique that combines multiple models to improve the overall performance. Ensembling has a long history in statistical data analysis, but its benefits are not always obvious in modern machine learning settings. We study the fundamental question of when ensembling yields significant performance improvements in classification tasks. We prove new results relating the ensemble improvement rate to the disagreement-error ratio. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate.
  • The Hessian perspective into the Nature of Convolutional Neural Networks. [paper]

    • Sidak Pal Singh, Thomas Hofmann, Bernhard Schölkopf. ICML 2023
    • Key Word: Hessian Maps; Convolutional Neural Networks.
    • Digest We provide a novel perspective on Convolutional Neural Networks (CNNs) by studying their Hessian maps, which capture parameter interactions. Using a Toeplitz representation framework, we reveal the Hessian structure and establish tight upper bounds on its rank. Our findings show that the Hessian rank in CNNs grows as the square root of the number of parameters, challenging previous assumptions.
  • Model-agnostic Measure of Generalization Difficulty. [paper] [code]

    • Akhilan Boopathy, Kevin Liu, Jaedong Hwang, Shu Ge, Asaad Mohammedsaleh, Ila Fiete. ICML 2023
    • Key Word: Generalization Difficulty; Information Content of Inductive Biases.
    • Digest The measure of a machine learning algorithm is the difficulty of the tasks it can perform, and sufficiently difficult tasks are critical drivers of strong machine learning models. However, quantifying the generalization difficulty of machine learning benchmarks has remained challenging. We propose what is to our knowledge the first model-agnostic measure of the inherent generalization difficulty of tasks. Our inductive bias complexity measure quantifies the total information required to generalize well on a task minus the information provided by the data.
  • Wasserstein PAC-Bayes Learning: A Bridge Between Generalisation and Optimisation. [paper]

    • Maxime Haddouche, Benjamin Guedj.
    • Key Word: PAC-Bayes Bound; Wasserstein Distances.
    • Digest PAC-Bayes learning is an established framework to assess the generalisation ability of learning algorithm during the training phase. However, it remains challenging to know whether PAC-Bayes is useful to understand, before training, why the output of well-known algorithms generalise well. We positively answer this question by expanding the Wasserstein PAC-Bayes framework, briefly introduced in \cite{amit2022ipm}. We provide new generalisation bounds exploiting geometric assumptions on the loss function. Using our framework, we prove, before any training, that the output of an algorithm from \citet{lambert2022variational} has a strong asymptotic generalisation ability. More precisely, we show that it is possible to incorporate optimisation results within a generalisation framework, building a bridge between PAC-Bayes and optimisation algorithms.
  • Do deep neural networks have an inbuilt Occam's razor? [paper]

    • Chris Mingard, Henry Rees, Guillermo Valle-Pérez, Ard A. Louis.
    • Key Word: Kolmogorov Complexity; Inductive Bias; Occam’s Razor; No Free Lunch Theorems.
    • Digest The remarkable performance of overparameterized deep neural networks (DNNs) must arise from an interplay between network architecture, training algorithms, and structure in the data. To disentangle these three components, we apply a Bayesian picture, based on the functions expressed by a DNN, to supervised learning. The prior over functions is determined by the network, and is varied by exploiting a transition between ordered and chaotic regimes. For Boolean function classification, we approximate the likelihood using the error spectrum of functions on data. When combined with the prior, this accurately predicts the posterior, measured for DNNs trained with stochastic gradient descent. This analysis reveals that structured data, combined with an intrinsic Occam's razor-like inductive bias towards (Kolmogorov) simple functions that is strong enough to counteract the exponential growth of the number of functions with complexity, is a key to the success of DNNs.
  • The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning. [paper]

    • Micah Goldblum, Marc Finzi, Keefer Rowan, Andrew Gordon Wilson.
    • Key Word: No Free Lunch Theorem; Kolmogorov Complexity; Model Selection.
    • Digest No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains.
  • The Benefits of Mixup for Feature Learning. [paper]

    • Difan Zou, Yuan Cao, Yuanzhi Li, Quanquan Gu.
    • Key Word: Mixup; Data Augmentation; Feature Learning.
    • Digest We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup.
  • Bayes Complexity of Learners vs Overfitting. [paper]

    • Grzegorz Głuch, Rudiger Urbanke.
    • Key Word: PAC-Bayes; Bayes Complexity; Overfitting.
    • Digest We introduce a new notion of complexity of functions and we show that it has the following properties: (i) it governs a PAC Bayes-like generalization bound, (ii) for neural networks it relates to natural notions of complexity of functions (such as the variation), and (iii) it explains the generalization gap between neural networks and linear schemes.
  • Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization. [paper]

    • Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro.
    • Key Word: Benign Overfitting; Implicit Bias.
    • Digest Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data.
  • The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks. [paper]

    • Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro.
    • Key Word: Implicit Bias; Adversarial Robustness.
    • Digest In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples.
  • Why (and When) does Local SGD Generalize Better than SGD? [paper]

    • Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora. ICLR 2023
    • Key Word: Local Stochastic Gradient Descent; Stochastic Differential Equations.
    • Digest This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.
  • Hiding Data Helps: On the Benefits of Masking for Sparse Coding. [paper]

    • Muthu Chidambaram, Chenwei Wu, Yu Cheng, Rong Ge.
    • Key Word: Sparse Coding; Self-Supervised Learning.
    • Digest We show that for over-realized sparse coding in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the ground-truth dictionary, regardless of the magnitude of the signal in the data-generating process. Furthermore, drawing from the growing body of work on self-supervised learning, we propose a novel masking objective and we prove that minimizing this new objective can recover the ground-truth dictionary.
  • Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width. [paper]

    • Dayal Singh Kalra, Maissam Barkeshli.
    • Key Word: Sharpness; Neural Tangent Kernel.
    • Digest By analyzing the maximum eigenvalue λHt of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime.
  • Sharpness-Aware Minimization: An Implicit Regularization Perspective. [paper]

    • Kayhan Behdin, Rahul Mazumder.
    • Key Word: Sharpness-Aware Minimization; Implicit Regularization.
    • Digest We study SAM through an implicit regularization lens, and present a new theoretical explanation of why SAM generalizes well. To this end, we study the least-squares linear regression problem and show a bias-variance trade-off for SAM's error over the course of the algorithm. We show SAM has lower bias compared to Gradient Descent (GD), while having higher variance.
  • Modular Deep Learning. [paper]

    • Jonas Pfeiffer, Sebastian Ruder, Ivan Vulić, Edoardo Maria Ponti.
    • Key Word: Parameter-Efficient Fine-Tuning; Mixture-of-Expert; Rounting; Model Aggregation.
    • Digest Modular deep learning has emerged as a promising solution to these challenges. In this framework, units of computation are often implemented as autonomous parameter-efficient modules. Information is conditionally routed to a subset of modules and subsequently aggregated. These properties enable positive transfer and systematic generalisation by separating computation from routing and updating modules locally. We offer a survey of modular architectures, providing a unified view over several threads of research that evolved independently in the scientific literature.
  • mSAM: Micro-Batch-Averaged Sharpness-Aware Minimization. [paper]

    • Key Word: Sharpness-Aware Minimization.
    • Digest We focus on a variant of SAM known as micro-batch SAM (mSAM), which, during training, averages the updates generated by adversarial perturbations across several disjoint shards (micro batches) of a mini-batch. We extend a recently developed and well-studied general framework for flatness analysis to show that distributed gradient computation for sharpness-aware minimization theoretically achieves even flatter minima.
  • Machine Love. [paper]

    • Key Word: Maslow’s Gridworld; Psychology.
    • Digest While ML generates much economic value, many of us have problematic relationships with social media and other ML-powered applications. One reason is that ML often optimizes for what we want in the moment, which is easy to quantify but at odds with what is known scientifically about human flourishing. Thus, through its impoverished models of us, ML currently falls far short of its exciting potential, which is for it to help us to reach ours. While there is no consensus on defining human flourishing, from diverse perspectives across psychology, philosophy, and spiritual traditions, love is understood to be one of its primary catalysts. Motivated by this view, this paper explores whether there is a useful conception of love fitting for machines to embody, as historically it has been generative to explore whether a nebulous concept, such as life or intelligence, can be thoughtfully abstracted and reimagined, as in the fields of machine intelligence or artificial life.
  • PAC-Bayesian Generalization Bounds for Adversarial Generative Models. [paper]

    • Sokhna Diarra Mbacke, Florence Clerc, Pascal Germain.
    • Key Word: PAC-Bayes; Generative Model Generalization Bound.
    • Digest We extend PAC-Bayesian theory to generative models and develop generalization bounds for models based on the Wasserstein distance and the total variation distance. Our first result on the Wasserstein distance assumes the instance space is bounded, while our second result takes advantage of dimensionality reduction. Our results naturally apply to Wasserstein GANs and Energy-Based GANs, and our bounds provide new training objectives for these two.
  • SAM operates far from home: eigenvalue regularization as a dynamical phenomenon. [paper]

    • Atish Agarwala, Yann N. Dauphin.
    • Key Word: Sharpness-Aware Minimization.
    • Digest Our work reveals that SAM provides a strong regularization of the eigenvalues throughout the learning trajectory. We show that in a simplified setting, SAM dynamically induces a stabilization related to the edge of stability (EOS) phenomenon observed in large learning rate gradient descent. Our theory predicts the largest eigenvalue as a function of the learning rate and SAM radius parameters.
  • Interpolation Learning With Minimum Description Length. [paper]

    • Naren Sarayu Manoj, Nathan Srebro.
    • Key Word: Minimum Description Length; Benign Overfitting; Tempered Overfitting.
    • Digest We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.
  • A modern look at the relationship between sharpness and generalization. [paper]

    • Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, Nicolas Flammarion.
    • Key Word: Sharpness; Generalization.
    • Digest We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup.
  • A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity. [paper]

    • Hongkang Li, Meng Wang, Sijia Liu, Pin-yu Chen. ICLR 2023
    • Key Word: Vision Transformer; Token Sparsification; Sample Complexity Bound.
    • Digest Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error.
  • Tighter PAC-Bayes Bounds Through Coin-Betting. [paper]

    • Kyoungseok Jang, Kwang-Sung Jun, Ilja Kuzborskij, Francesco Orabona.
    • Key Word: PAC-Bayes Bounds.
    • Digest Recently, the PAC-Bayes framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve even tighter guarantees. Our approach is based on the coin-betting framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms.
  • A unified recipe for deriving (time-uniform) PAC-Bayes bounds. [paper]

    • Ben Chugg, Hongjian Wang, Aaditya Ramdas.
    • Key Word: PAC-Bayes Bounds.
    • Digest We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality.
  • The SSL Interplay: Augmentations, Inductive Bias, and Generalization. [paper]

    • Vivien Cabannes, Bobak T. Kiani, Randall Balestriero, Yann LeCun, Alberto Bietti.
    • Key Word: Self-Supervised Learning; Data Augmentation; Inductive Bias.
    • Digest Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.
  • A Stability Analysis of Fine-Tuning a Pre-Trained Model. [paper]

    • Zihao Fu, Anthony Man-Cho So, Nigel Collier.
    • Key Word: Fine-Tuning; Stability Analysis.
    • Digest We propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure.
  • Strong inductive biases provably prevent harmless interpolation. [paper] [code]

    • Michael Aerni, Marco Milanta, Konstantin Donhauser, Fanny Yang.
    • Key Word: Benign Overfitting; Inductive Bias.
    • Digest This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well.

Others: 2022

  • PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. [paper] [code]

    • Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew Gordon Wilson.
    • Key Word: PAC-Bayes; Model Compression.
    • Digest We develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning.
  • Instance-Dependent Generalization Bounds via Optimal Transport. [paper]

    • Songyan Hou, Parnian Kassraie, Anastasis Kratsios, Jonas Rothfuss, Andreas Krause.
    • Key Word: Generalization Bounds; Optimal Transport; Distribution Shifts.
    • Digest We propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function} in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
  • How Does Sharpness-Aware Minimization Minimize Sharpness? [paper]

    • Kaiyue Wen, Tengyu Ma, Zhiyuan Li.
    • Key Word: Sharpness-Aware Minimization.
    • Digest This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.
  • Augmentation Invariant Manifold Learning. [paper]

    • Shulei Wang.
    • Key Word: Manifold Learning; Data Augmentation.
    • Digest We develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations.
  • The Curious Case of Benign Memorization. [paper]

    • Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, Thomas Hofmann.
    • Key Word: Memorization; Data Augmentation.
    • Digest We show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers.
  • Symmetries, flat minima, and the conserved quantities of gradient flow. [paper]

    • Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, Nima Dehmamy.
    • Key Word: Conserved quantities; Mode Connectivity; Flat Minimia; Parameter Space Symmetry.
    • Digest The paper presents a general framework that identifies continuous symmetries in the parameter space of deep neural networks, which create low-loss valleys and connect local minima. The framework utilizes equivariances of activation functions and introduces nonlinear, data-dependent symmetries for nonlinear neural networks. The authors demonstrate that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. Additionally, they relate these conserved quantities to convergence rate and sharpness of the minimum, shedding light on the limitations of gradient flow exploration.
  • Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup. [paper]

    • Muthu Chidambaram, Xiang Wang, Chenwei Wu, Rong Ge.
    • Key Word: Mixup; Feature Learning.
    • Digest We try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class.
  • A PAC-Bayesian Generalization Bound for Equivariant Networks. [paper]

    • Arash Behboodi, Gabriele Cesa, Taco Cohen. NeurIPS 2022
    • Key Word: PAC-Bayes; Equivariant Networks.
    • Digest We study how equivariance relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where the transformation laws of feature spaces are determined by group representations. By using perturbation analysis of equivariant networks in Fourier domain for each layer, we derive norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of group size, and multiplicity and degree of irreducible representations on the generalization error and thereby provide a guideline for selecting them.
  • Tighter PAC-Bayes Generalisation Bounds by Leveraging Example Difficulty. [paper]

    • Felix Biggs, Benjamin Guedj.
    • Key Word: PAC-Bayes.
    • Digest We introduce a modified version of the excess risk, which can be used to obtain tighter, fast-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for [−1,1]-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around 0. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.
  • How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders. [paper] [code]

    • Qi Zhang, Yifei Wang, Yisen Wang. NeurIPS 2022
    • Key Word: Masked Autoencoders.
    • Digest We propose a theoretical understanding of how masking matters for MAE to learn meaningful features. We establish a close connection between MAE and contrastive learning, which shows that MAE implicit aligns the mask-induced positive pairs. Built upon this connection, we develop the first downstream guarantees for MAE methods, and analyze the effect of mask ratio. Besides, as a result of the implicit alignment, we also point out the dimensional collapse issue of MAE, and propose a Uniformity-enhanced MAE (U-MAE) loss that can effectively address this issue and bring significant improvements on real-world datasets, including CIFAR-10, ImageNet-100, and ImageNet-1K.
  • On the Importance of Gradient Norm in PAC-Bayesian Bounds. [paper]

    • Itai Gat, Yossi Adi, Alexander Schwing, Tamir Hazan. NeurIPS 2022
    • Key Word: PAC-Bayes.
    • Digest Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss and on-average bounded gradient norm assumptions. Following this relaxation, we propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities.
  • SGD with large step sizes learns sparse features. [paper]

    • Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion.
    • Key Word: Stochastic Gradient Descent; Sparse Features.
    • Digest We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors.
  • The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. [paper]

    • Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar.
    • Key Word: Data Augmentation; Spectral Regularization.
    • Digest We develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression.
  • Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. [paper]

    • Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa.
    • Key Word: Gradient Regularization; Implicit Bias.
    • Digest We first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger.
  • The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima. [paper]

    • Peter L. Bartlett, Philip M. Long, Olivier Bousquet.
    • Key Word: Sharpness-Aware Minimization.
    • Digest We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence.
  • SAM as an Optimal Relaxation of Bayes. [paper]

    • Thomas Möllenhoff, Mohammad Emtiyaz Khan.
    • Key Word: Sharpness-Aware Minimization; Bayesian Methods.
    • Digest Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
  • Understanding Influence Functions and Datamodels via Harmonic Analysis. [paper]

    • Nikunj Saunshi, Arushi Gupta, Mark Braverman, Sanjeev Arora.
    • Key Word: Influence Functions; Harmonic Analysis.
    • Digest The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.
  • Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. [paper]

    • Xiang Wang, Annie N. Wang, Mo Zhou, Rong Ge.
    • Key Word: Monotonic Linear Interpolation; Loss Landscapes.
    • Digest We show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain).
  • Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. [paper]

    • Alex Damian, Eshaan Nichani, Jason D. Lee.
    • Key Word: Implicit Bias; Edge of Stability.
    • Digest Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored.
  • Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions. [paper]

    • Arthur Jacot.
    • Key Word: Non-Linear Rank; Implicit Bias.
    • Digest We show that the representation cost of fully connected neural networks with homogeneous nonlinearities - which describes the implicit bias in function space of networks with L2-regularization or with losses such as the cross-entropy - converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the `true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered.
  • Scaling Laws For Deep Learning Based Image Reconstruction. [paper]

    • Tobit Klug, Reinhard Heckel.
    • Key Word: Scaling Laws; Inverse Problems.
    • Digest We study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while optimally scaling the network size. For all three tasks we find that an initially steep power-law scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance.
  • Why neural networks find simple solutions: the many regularizers of geometric complexity. [paper]

    • Benoit Dherin, Michael Munn, Mihaela C. Rosca, David G.T. Barrett. NeurIPS 2022
    • Key Word: Regularization; Geometric Complexity; Dirichlet Energy.
    • Digest In many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.
  • Variational Inference for Infinitely Deep Neural Networks. [paper]

    • Achille Nazaret, David Blei. ICML 2022
    • Key Word: Unbounded Depth Neural Networks; Variational Inference.
    • Digest We develop a novel variational inference algorithm to approximate this posterior, optimizing a distribution of the neural network weights and of the truncation depth L, and without any upper limit on L. To this end, the variational family has a special structure: it models neural network weights of arbitrary depth, and it dynamically creates or removes free variational parameters as its distribution of the truncation is optimized.
  • Deep Linear Networks can Benignly Overfit when Shallow Ones Do. [paper]

    • Niladri S. Chatterji, Philip M. Long.
    • Key Word: Benign Overfitting; Double Descent; Implicit Bias.
    • Digest We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum ℓ2-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum ℓ2-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum ℓ2-norm solution.
  • Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). [paper]

    • Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher. NeurIPS 2022
    • Key Word: Lazy Training; Neural Tangent Kernel.
    • Digest We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness.
  • Git Re-Basin: Merging Models modulo Permutation Symmetries. [paper]

    • Samuel K. Ainsworth, Jonathan Hayase, Siddhartha Srinivasa.
    • Key Word: Mode Connectivity.
    • Digest We argue that neural network loss landscapes contain (nearly) a single basin, after accounting for all possible permutation symmetries of hidden units. We introduce three algorithms to permute the units of one model to bring them into alignment with units of a reference model. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100.
  • Normalization effects on deep neural networks. [paper]

    • Jiahui Yu, Konstantinos Spiliopoulos.
    • Key Word: Normalization.
    • Digest We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the γi's to be equal to one, which is the mean-field scaling. We also find that this is particularly true for the outer layer, in that the neural network's behavior is more sensitive in the scaling of the outer layer as opposed to the scaling of the inner layers. The mechanism for the mathematical analysis is an asymptotic expansion for the neural network's output.
  • Do Quantum Circuit Born Machines Generalize? [paper]

    • Kaitlin Gili, Mohamed Hibat-Allah, Marta Mauri, Chris Ballance, Alejandro Perdomo-Ortiz.
    • Key Word: Quantum Machine Learning; Quantum Circuit Born Machines; Unsupervised Generative Models.
    • Digest There has been little understanding of a model's generalization performance and the relation between such capability and the resource requirements, e.g., the circuit depth and the amount of training data. In this work, we leverage upon a recently proposed generalization evaluation framework to begin addressing this knowledge gap. We first investigate the QCBM's learning process of a cardinality-constrained distribution and see an increase in generalization performance while increasing the circuit depth.
  • Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. [paper]

    • Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.
    • Key Word: Overfitting; Kernel Regression.
    • Digest The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study.
  • Towards understanding how momentum improves generalization in deep learning. [paper]

    • Samy Jelassi, Yuanzhi Li. ICML 2022
    • Key Word: Gradient Descent with Momentum.
    • Digest We adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized.
  • Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent. [paper]

    • Zhiyuan Li, Tianhao Wang, JasonD. Lee, Sanjeev Arora.
    • Key Word: Implicit Bias; Mirror Descent.
    • Digest As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function.
  • A law of adversarial risk, interpolation, and label noise. [paper]

    • Daniel Paleka, Amartya Sanyal. ICLR 2023
    • Key Word: Benign Overfitting; Adversarial Robustness.
    • Digest We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm.
  • Integral Probability Metrics PAC-Bayes Bounds. [paper]

    • Ron Amit, Baruch Epstein, Shay Moran, Ron Meir. NeurIPS 2022
    • Key Word: PAC-Bayes Bound.
    • Digest We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and improved bounds in favorable cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
  • Robustness Implies Generalization via Data-Dependent Generalization Bounds. [paper]

    • Kenji Kawaguchi, Zhun Deng, Kyle Luh, Jiaoyang Huang. ICML 2022
    • Key Word: Algorithmic Robustness Bound.
    • Digest This paper proves that robustness implies generalization via data-dependent generalization bounds. As a result, robustness and generalization are shown to be connected closely in a data-dependent manner. Our bounds improve previous bounds in two directions, to solve an open problem that has seen little development since 2010. The first is to reduce the dependence on the covering number. The second is to remove the dependence on the hypothesis space. We present several examples, including ones for lasso and deep learning, in which our bounds are provably preferable.
  • Learning sparse features can lead to overfitting in neural networks. [paper] [code]

    • Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, Matthieu Wyart.
    • Key Word: Sparse Representation; Neural Tangent Kernel.
    • Digest It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images.
  • Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. [paper]

    • Jiachun Pan, Pan Zhou, Shuicheng Yan.
    • Key Word: Mask-Reconstruction Pretraining; Self-Supervision.
    • Digest Supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task.
  • Why do CNNs Learn Consistent Representations in their First Layer Independent of Labels and Architecture? [paper]

    • Rhea Chowers, Yair Weiss.
    • Key Word: Architecture Inductive Bias.
    • Digest It has previously been observed that the filters learned in the first layer of a CNN are qualitatively similar for different networks and tasks. We extend this finding and show a high quantitative similarity between filters learned by different networks. We consider the CNN filters as a filter bank and measure the sensitivity of the filter bank to different frequencies. We show that the sensitivity profile of different networks is almost identical, yet far from initialization. Remarkably, we show that it remains the same even when the network is trained with random labels. To understand this effect, we derive an analytic formula for the sensitivity of the filters in the first layer of a linear CNN. We prove that when the average patch in images of the two classes is identical, the sensitivity profile of the filters in the first layer will be identical in expectation when using the true labels or random labels and will only depend on the second-order statistics of image patches.
  • A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. [paper]

    • Zhenmei Shi, Junyi Wei, Yingyu Liang. ICLR 2022
    • Key Word: Linearization of Neural Networks; Neural Tangent Kernel.
    • Digest To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution).
  • Realistic Deep Learning May Not Fit Benignly. [paper]

    • Kaiyue Wen, Jiaye Teng, Jingzhao Zhang.
    • Key Word: Benign Overfitting.
    • Digest We examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points.
  • A Model of One-Shot Generalization. [paper]

    • Thomas Laurent, James H. von Brecht, Xavier Bresson.
    • Key Word: One-Shot Generalization; PAC Learning; Neural Tangent Kernel.
    • Digest We provide a theoretical framework to study a phenomenon that we call one-shot generalization. This phenomenon refers to the ability of an algorithm to perform transfer learning within a single task, meaning that it correctly classifies a test point that has a single exemplar in the training set. We propose a simple data model and use it to study this phenomenon in two ways. First, we prove a non-asymptotic base-line -- kernel methods based on nearest-neighbor classification cannot perform one-shot generalization, independently of the choice of the kernel and the size of the training set. Second, we empirically show that the most direct neural network architecture for our data model performs one-shot generalization almost perfectly. This stark differential leads us to believe that the one-shot generalization mechanism is partially responsible for the empirical success of neural networks.
  • Empirical Evaluation and Theoretical Analysis for Representation Learning: A Survey. [paper]

    • Kento Nozawa, Issei Sato. IJCAI 2022
    • Key Word: Representation Learning; Pre-training; Regularization.
    • Digest Representation learning enables us to automatically extract generic feature representations from a dataset to solve another machine learning task. Recently, extracted feature representations by a representation learning algorithm and a simple predictor have exhibited state-of-the-art performance on several machine learning tasks. Despite its remarkable progress, there exist various ways to evaluate representation learning algorithms depending on the application because of the flexibility of representation learning. To understand the current representation learning, we review evaluation methods of representation learning algorithms and theoretical analyses.
  • The Effects of Regularization and Data Augmentation are Class Dependent. [paper]

    • Randall Balestriero, Leon Bottou, Yann LeCun. NeurIPS 2022
    • Key Word: Data Augmentation.
    • Digest We demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes e.g. on Imagenet with a resnet50, the "barn spider" classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
  • Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum. [paper]

    • Kirby Banman, Liam Peet-Pare, Nidhi Hegde, Alona Fyshe, Martha White. ICLR 2022
    • Key Word: Stochastic Gradient Descent; Covariate Shift.
    • Digest We show that SGDm under covariate shift with a fixed step-size can be unstable and diverge. In particular, we show SGDm under covariate shift is a parametric oscillator, and so can suffer from a phenomenon known as resonance. We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system's divergence/convergence as resonant/nonresonant modes.
  • Data Augmentation as Feature Manipulation. [paper]

    • Ruoqi Shen, Sébastien Bubeck, Suriya Gunasekar.
    • Key Word: Data Augmentation; Feature Learning.
    • Digest In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for non-linear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view data model by Allen-Zhu and Li [2020].
  • How Many Data Are Needed for Robust Learning? [paper]

    • Hongyang Zhang, Yihan Wu, Heng Huang.
    • Key Word: Robustness.
    • Digest In this work, we study the sample complexity of robust interpolation problem when the data are in a unit ball. We show that both too many data and small data hurt robustness.
  • A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments. [paper]

    • Randall Balestriero, Ishan Misra, Yann LeCun. NeurIPS 2022
    • Key Word: Data Augmentation.
    • Digest We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge.

Others: 2021

  • Discovering and Explaining the Representation Bottleneck of DNNs. [paper]

    • Huiqi Deng, Qihan Ren, Hao Zhang, Quanshi Zhang. ICLR 2022
    • Key Word: Representation Bottleneck; Explanation.
    • Digest This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and humans, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck.
  • Generalization in quantum machine learning from few training data. [paper]

    • Matthias C. Caro, Hsin-Yuan Huang, M. Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, Patrick J. Coles. Nature Communications
    • Key Word: Quantum Machine Learning; Generalization Bounds.
    • Digest We provide a comprehensive study of generalization performance in QML after training on a limited number N of training data points. We also show that classification of quantum states across a phase transition with a quantum convolutional neural network requires only a very small training data set. Other potential applications include learning quantum error correcting codes or quantum dynamical simulation. Our work injects new hope into the field of QML, as good generalization is guaranteed from few training data.
  • The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks. [paper]

    • Yizhang Lou, Chris Mingard, Soufiane Hayou.
    • Key Word: Implicit Regularization.
    • Digest We provide the first explanation for this alignment hierarchy. We introduce and empirically validate the Equilibrium Hypothesis which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.
  • Understanding Dimensional Collapse in Contrastive Self-supervised Learning. [paper] [code]

    • Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. ICLR 2022
    • Key Word: Self-Supervision; Contrastive Learning; Implicit Regularization; Dimensional Collapse.
    • Digest We show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.
  • Implicit Sparse Regularization: The Impact of Depth and Early Stopping. [paper] [code]

    • Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, Raymond K. W. Wong. NeurIPS 2021
    • Key Word: Implicit Regularization.
    • Digest In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases.
  • The Benefits of Implicit Regularization from SGD in Least Squares Problems. [paper]

    • Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade. NeurIPS 2021
    • Key Word: Implicit Regularization.
    Digest We show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance.
  • Neural Controlled Differential Equations for Online Prediction Tasks. [paper] [code]

    • James Morrill, Patrick Kidger, Lingyi Yang, Terry Lyons.
    • Key Word: Ordinary Differential Equations.
    • Digest Neural controlled differential equations (Neural CDEs) are state-of-the-art models for irregular time series. However, due to current implementations relying on non-causal interpolation schemes, Neural CDEs cannot currently be used in online prediction tasks; that is, in real-time as data arrives. This is in contrast to similar ODE models such as the ODE-RNN which can already operate in continuous time. Here we introduce and benchmark new interpolation schemes, most notably, rectilinear interpolation, which allows for an online everywhere causal solution to be defined.
  • The Principles of Deep Learning Theory. [paper]

    • Daniel A. Roberts, Sho Yaida, Boris Hanin.
    • Key Word: Bayesian Learning; Neural Tangent Kernel; Statistical Physics; Information Theory; Residual Learning; Book.
    • Digest This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics.
  • Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. [paper] [code]

    • Colin Wei, Sang Michael Xie, Tengyu Ma. NeurIPS 2021
    • Key Word: Natural Language Processing; Pre-training; Prompting.
    • Digest We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language.
  • Differentiable Multiple Shooting Layers. [paper] [code]

    • Stefano Massaroli, Michael Poli, Sho Sonoda, Taji Suzuki, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2021
    • Key Word: Ordinary Differential Equations.
    • Digest We detail a novel class of implicit neural models. Leveraging time-parallel methods for differential equations, Multiple Shooting Layers (MSLs) seek solutions of initial value problems via parallelizable root-finding algorithms. MSLs broadly serve as drop-in replacements for neural ordinary differential equations (Neural ODEs) with improved efficiency in number of function evaluations (NFEs) and wall-clock inference time.
  • Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning. [paper] [code]

    • Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth, Yarin Gal. NeurIPS 2021
    • Key Word: Samplie-Wise Self-Attention; Meta Learning; Metric Learning.
    • Digest We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms.
  • Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. [paper]

    • Mikhail Belkin.
    • Key Word: Interpolation; Over-parameterization.
    • Digest In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model.
  • A Universal Law of Robustness via Isoperimetry. [paper]

    • Sébastien Bubeck, Mark Sellke.
    • Key Word: Overparameterized Memorization; Lipschitz Neural Network.
    • Digest A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension.
  • Noether's Learning Dynamics: Role of Symmetry Breaking in Neural Networks. [paper]

    • Hidenori Tanaka, Daniel Kunin.
    • Key Word: Geometry of Learning Dynamics; Symmetry Breaking.
    • Digest The paper develops a theoretical framework to investigate the "geometry of learning dynamics" in neural networks and uncovers the significance of explicit symmetry breaking in achieving efficiency and stability. It introduces "kinetic symmetry breaking" (KSB) as a condition where the kinetic energy breaks the symmetry of the potential function and applies Noether's theorem to derive "Noether's Learning Dynamics" (NLD) as a result.
  • Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes. [paper]

    • James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse.
    • Key Word: Monotonic Linear Interpolation; Loss Landscapes.
    • Digest We evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network - providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization.
  • On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). [paper]

    • Zhiyuan Li, Sadhika Malladi, Sanjeev Arora. NeurIPS 2021
    • Key Word: Stochastic Gradient Descent Dynamics; Stochastic Differential Equations.
    • Digest The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
  • MALI: A memory efficient and reverse accurate integrator for Neural ODEs. [paper] [code]

    • Juntang Zhuang, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan. ICLR 2021
    • Key Word: Ordinary Differential Equations.
    • Digest Based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost w.r.t number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy.

Others: 2020

  • Understanding the Failure Modes of Out-of-Distribution Generalization. [paper] [code]

    • Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur. ICLR 2021
    • Key Word: Out-of-Distribution Generalization.
    • Digest We identify that spurious correlations during training can induce two distinct skews in the training set, one geometric and another statistical. These skews result in two complementary ways by which empirical risk minimization (ERM) via gradient descent is guaranteed to rely on those spurious correlations.
  • Deep Networks from the Principle of Rate Reduction. [paper] [code]

    • Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, Yi Ma.
    • Key Word: Maximal Coding Rate Reduction.
    • Digest This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction of learned features naturally leads to a multi-layer deep network, one iteration per layer. The layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer in a forward propagation fashion by emulating the gradient scheme.
  • Sharpness-Aware Minimization for Efficiently Improving Generalization. [paper] [code]

    • Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur. ICLR 2021
    • Key Word: Flat Minima.
    • Digest In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently.
  • Implicit Gradient Regularization. [paper]

    • David G.T. Barrett, Benoit Dherin. ICLR 2021
    • Key Word: Implicit Regularization.
    • Digest Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations.
  • Neural Rough Differential Equations for Long Time Series. [paper] [code]

    • James Morrill, Cristopher Salvi, Patrick Kidger, James Foster, Terry Lyons. ICML 2021
    • Key Word: Ordinary Differential Equations.
    • Digest Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN. However, as with RNNs, training can quickly become impractical for long time series. Here we use rough path theory to extend this formulation through application of a pre-existing mathematical tool from rough analysis - the log-ODE method - which allows us to take integration steps larger than the discretisation of the data, resulting in significantly faster training times, with retainment (and often even improvements) in model performance.
  • Optimizing Mode Connectivity via Neuron Alignment. [paper] [code]

    • N. Joseph Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, Rongjie Lai. NeurIPS 2020
    • Key Word: Mode Connectivity; Neuron Alignment; Adversarial Training.
    • Digest We propose a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected. To approximate the optimal permutation, we introduce an inexpensive heuristic referred to as neuron alignment. Neuron alignment promotes similarity between the distribution of intermediate activations of models along the curve.
  • Benign Overfitting and Noisy Features. [paper]

    • Zhu Li, Weijie Su, Dino Sejdinovic.
    • Key Word: Benign Overfitting; Random Feature Approximation; Deep Double Descent.
    • Digest We examine the conditions under which Benign Overfitting occurs in the random feature (RF) models, i.e. in a two-layer neural network with fixed first layer weights. We adopt a new view of random feature and show that benign overfitting arises due to the noise which resides in such features (the noise may already be present in the data and propagate to the features or it may be added by the user to the features directly) and plays an important implicit regularization role in the phenomenon.
  • Expressivity of Deep Neural Networks. [paper]

    • Ingo Gühring, Mones Raslan, Gitta Kutyniok.
    • Key Word: Approximation; Expressivity; Function Classes
    • Digest In this review paper, we give a comprehensive overview of the large variety of approximation results for neural networks. Approximation rates for classical function spaces as well as benefits of deep neural networks over shallow ones for specifically structured function classes are discussed. While the mainbody of existing results is for general feedforward architectures, we also depict approximation results for convolutional, residual and recurrent neural networks.
  • How benign is benign overfitting? [paper]

    • Amartya Sanyal, Puneet K Dokania, Varun Kanade, Philip H.S. Torr. ICLR 2021
    • Key Word: Benign Overfitting; Adversarial Robustness.
    • Digest We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don’t fit the noise.
  • On the Theory of Transfer Learning: The Importance of Task Diversity. [paper]

    • Nilesh Tripuraneni, Michael I. Jordan, Chi Jin. NeurIPS 2020
    • Key Word: Transfer Learning; Task Diversity; Generalization Bound.
    • Digest We introduce a problem-agnostic definition of task diversity which can be integrated into a uniform convergence framework to provide generalization bounds for transfer learning problems with general losses, tasks, and features. Our framework puts this notion of diversity together with a common-design assumption across tasks to provide guarantees of a fast convergence rate, decaying with all of the samples for the transfer learning problem.
  • Neural Controlled Differential Equations for Irregular Time Series. [paper] [code]

    • Patrick Kidger, James Morrill, James Foster, Terry Lyons. NeurIPS 2020
    • Key Word: Ordinary Differential Equations.
    • Digest a fundamental issue is that the solution to an ordinary differential equation is determined by its initial condition, and there is no mechanism for adjusting the trajectory based on subsequent observations. Here, we demonstrate how this may be resolved through the well-understood mathematics of controlled differential equations.
  • Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime. [paper]

    • Niladri S. Chatterji, Philip M. Long. JMLR
    • Key Word: Benign Overfitting; Finite-Sample Analysis.
    • Digest We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.
  • Dissecting Neural ODEs. [paper] [code]

    • Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2020
    • Key Word: Ordinary Differential Equations.
    • Digest Continuous deep learning architectures have recently re-emerged as Neural Ordinary Differential Equations (Neural ODEs). This infinite-depth approach theoretically bridges the gap between deep learning and dynamical systems, offering a novel perspective. However, deciphering the inner working of these models is still an open challenge, as most applications apply them as generic black-box modules. In this work we "open the box", further developing the continuous-depth formulation with the aim of clarifying the influence of several design choices on the underlying dynamics.
  • Proving the Lottery Ticket Hypothesis: Pruning is All You Need. [paper]

    • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir. ICML 2020
    • Key Word: Lottery Ticket Hypothesis.
    • Digest The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.
  • Relative Flatness and Generalization. [paper] [code]

    • Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, Mario Boley. NeurIPS 2021
    • Key Word: Relative Flatness; Loss Landscape.
    • Digest The paper investigates the connection between flatness, a property of the loss curve, and generalization ability in machine learning models, particularly neural networks, providing insights into the conditions under which this connection holds and introducing a novel relative flatness measure that correlates strongly with generalization and resolves the reparameterization issue.

Others: 2019

  • Deep Learning via Dynamical Systems: An Approximation Perspective. [paper]

    • Qianxiao Li, Ting Lin, Zuowei Shen.
    • Key Word: Approximation Theory; Controllability.
    • Digest We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems, from the approximation perspective. In particular, we establish general sufficient conditions for universal approximation using continuous-time deep residual networks, which can also be understood as approximation theories in Lp using flow maps of dynamical systems.
  • Why bigger is not always better: on finite and infinite neural networks. [paper]

    • Laurence Aitchison. ICML 2020
    • Key Word: Gradient Dynamics.
    • Digest We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network.
  • Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective. [paper] [code]

    • Guan-Horng Liu, Evangelos A. Theodorou.
    • Key Word: Mean Field Theory.
    • Digest We provide one possible way to align existing branches of deep learning theory through the lens of dynamical system and optimal control. By viewing deep neural networks as discrete-time nonlinear dynamical systems, we can analyze how information propagates through layers using mean field theory.
  • Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. [paper] [code]

    • Yuanzhi Li, Colin Wei, Tengyu Ma. NeurIPS 2019
    • Key Word: Regularization.
    • Digest The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart.
  • Are deep ResNets provably better than linear predictors? [paper]

    • Chulhee Yun, Suvrit Sra, Ali Jadbabaie. NeurIPS 2019
    • Key Word: ResNets; Local Minima.
    • Digest We investigated the question whether local minima of risk function of a deep ResNet are better than linear predictors. We showed two motivating examples showing 1) the advantage of ResNets over fully-connected networks, and 2) difficulty in analysis of deep ResNets.
  • Benign Overfitting in Linear Regression. [paper]

    • Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. PNAS
    • Key Word: Benign Overfitting.
    • Digest The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
  • Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. [paper]

    • Fanny Yang, Zuowen Wang, Christina Heinze-Deml. NeurIPS 2019
    • Key Word: Robustness; Regularization.
    • Digest This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). Evaluated on these adversarially transformed examples, we demonstrate that adding regularization on top of standard or adversarial training reduces the relative error by 20% for CIFAR10 without increasing the computational cost.
  • Augmented Neural ODEs. [paper] [code]

    • Emilien Dupont, Arnaud Doucet, Yee Whye Teh. NeurIPS 2019
    • Key Word: Ordinary Differential Equations.
    • Digest We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.
  • On the Power and Limitations of Random Features for Understanding Neural Networks. [paper]

    • Gilad Yehudai, Ohad Shamir.
    • Key Word: Random Features.
    • Digest Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks.
  • Mean Field Analysis of Deep Neural Networks. [paper]

    • Justin Sirignano, Konstantinos Spiliopoulos.
    • Key Word: Mean Field Theory.
    • Digest We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multi-layer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss.
  • Machine learning meets quantum physics. [paper] [book]

    • Sankar Das Sarma, Dong-Ling Deng, Lu-Ming Duan.
    • Key Word: Physics-based Machine Learning; Quantum Physics; Quantum Chemistry.
    • Digest The marriage of machine learning and quantum physics may give birth to a new research frontier that could transform both.
  • A Mean Field Theory of Batch Normalization. [paper]

    • Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz. ICLR 2019
    • Key Word: Mean Field Theory.
    • Digest We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function.
  • Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. [paper] [code]

    • Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington. NeurIPS 2019
    • Key Word: Mean Field Theory.
    • Digest We show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel.
  • Superposition of many models into one. [paper] [code]

    • Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen. NeurIPS 2019
    • Key Word: Parameter Superposition; Catastrophic Forgetting.
    • Digest We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.
  • On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points. [paper]

    • Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan. ICML 2017
    • Key Word: Gradient Descent; Saddle Points.
    • Digest Traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.
  • Escaping Saddle Points with Adaptive Gradient Methods. [paper]

    • Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra. ICML 2019
    • Key Word: Gradient Descent; Saddle Points.
    • Digest We seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points.

Others: 2018

  • A Spline Theory of Deep Learning. [paper]

    • Randall Balestriero, Richard G. Baraniuk. ICML 2018
    • Key Word: Approximation Theory.
    • Digest We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings.
  • On Lazy Training in Differentiable Programming. [paper] [code]

    • Lenaic Chizat, Edouard Oyallon, Francis Bach. NeurIPS 2019
    • Key Word: Lazy Training.
    • Digest In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths.
  • Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. [paper] [code]

    • Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf. CVPR 2019
    • Key Wrod: ReLU; Adversarial Example.
    • Digest We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data. For bounded domains like images we propose a new robust optimization technique similar to adversarial training which enforces low confidence predictions far away from the training data.
  • Gradient Descent Finds Global Minima of Deep Neural Networks. [paper]

    • Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai. ICML 2019
    • Key Word: Gradient Descent; Gradient Dynamics.
    • Digest Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm.
  • Memorization in Overparameterized Autoencoders. [paper]

    • Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler.
    • Key Word: Autoencoders; Memorization.
    • Digest We show that overparameterized autoencoders exhibit memorization, a form of inductive bias that constrains the functions learned through the optimization process to concentrate around the training examples, although the network could in principle represent a much larger function class. In particular, we prove that single-layer fully-connected autoencoders project data onto the (nonlinear) span of the training examples.
  • Information Geometry of Orthogonal Initializations and Training. [paper]

    • Piotr A. Sokol, Il Memming Park. ICLR 2020
    • Key Word: Mean Field Theory; Information Geometry.
    • Digest We show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster.
  • Gradient Descent Provably Optimizes Over-parameterized Neural Networks. [paper]

    • Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh. ICLR 2019
    • Key Word: Gradient Descent; Gradient Dynamics.
    • Digest One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function.
  • Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function. [paper]

    • Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzębski, Jacek Tabor, Maciej A. Nowak. AISTATS 2019
    • Key Word: Mean Field Theory.
    • Digest We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit.
  • Mean Field Analysis of Neural Networks: A Central Limit Theorem. [paper]

    • Justin Sirignano, Konstantinos Spiliopoulos.
    • Key Word: Mean Field Theory.
    • Digest We rigorously prove a central limit theorem for neural network models with a single hidden layer. The central limit theorem is proven in the asymptotic regime of simultaneously (A) large numbers of hidden units and (B) large numbers of stochastic gradient descent training iterations. Our result describes the neural network's fluctuations around its mean-field limit. The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation.
  • An elementary introduction to information geometry. [paper]

    • Frank Nielsen.
    • Key Word: Survey; Information Geometry.
    • Digest In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity.
  • Deep Convolutional Networks as shallow Gaussian Processes. [paper] [code]

    • Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison. ICLR 2019
    • Key Word: Gaussian Process.
    • Digest We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN.
  • Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data. [paper]

    • Yuanzhi Li, Yingyu Liang. NeurIPS 2018
    • Key Word: Stochastic Gradient Descent.
    • Digest Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels.
  • Neural Ordinary Differential Equations. [paper] [code]

    • Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. NeurIPS 2018
    • Key Word: Ordinary Differential Equations; Normalizing Flow.
    • Digest We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
  • Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. [paper] [code]

    • Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington. ICML 2018
    • Key Word: Mean Field Theory.
    • Digest We demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix.
  • Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. [paper]

    • Ryo Karakida, Shotaro Akaho, Shun-ichi Amari. AISTATS 2019
    • Key Word: Mean Field Theory; Fisher Information.
    • Digest The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value.
  • Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. [paper] [code]

    • Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro. ICLR 2019
    • Key Word: Over-Parametrization.
    • Digest We suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes (within the range reported in the experiments), and could partly explain the improvement in generalization with over-parametrization.
  • Understanding Generalization and Optimization Performance of Deep CNNs. [paper]

    • Pan Zhou, Jiashi Feng. ICML 2018
    • Key Word: Generalization of CNNs.
    • Digest We make multiple contributions to understand deep CNNs theoretically. To our best knowledge, this work presents the first theoretical guarantees on both generalization error bound without exponential growth over network depth and optimization performance for deep CNNs.
  • Geometric Understanding of Deep Learning. [paper]

    • Na Lei, Zhongxuan Luo, Shing-Tung Yau, David Xianfeng Gu.
    • Key Word: Manifold Representation; Learning Capability; Learning Capability; Latent Probability Distribution Control.
    • Digest In this work, we give a geometric view to understand deep learning: we show that the fundamental principle attributing to the success is the manifold structure in data, namely natural high dimensional data concentrates close to a low-dimensional manifold, deep learning learns the manifold and the probability distribution on it.
  • Tropical Geometry of Deep Neural Networks. [paper]

    • Liwen Zhang, Gregory Naitzat, Lek-Heng Lim.
    • Key Word: Tropical Geometry; Geometric Complexity.
    • Digest We establish a novel connection between feedforward neural networks with ReLU activation and tropical geometry. This equivalence allows us to characterize these neural networks using zonotopes, relate decision boundaries to tropical hypersurfaces, and establish a correspondence between linear regions and vertices of polytopes associated with tropical rational functions. Our tropical formulation reveals that deeper networks exhibit exponentially higher expressiveness compared to shallow networks. This work provides new insights into the relationship between neural networks and tropical geometry.
  • Gaussian Process Behaviour in Wide Deep Neural Networks. [paper] [code]

    • Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, Zoubin Ghahramani. ICLR 2018
    • Key Word: Gaussian Process.
    • Digest We study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
  • How to Start Training: The Effect of Initialization and Architecture. [paper]

    • Boris Hanin, David Rolnick. NeurIPS 2018
    • Key Word: Neuron Activation; Weight Initialization.
    • Digest We identify and study two common failure modes for early training in deep ReLU nets. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided.
  • The Emergence of Spectral Universality in Deep Networks. [paper]

    • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli. AISTATS 2018
    • Key Word: Mean Field Theory.
    • Digest We leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity.
  • Generalization in Machine Learning via Analytical Learning Theory. [paper] [code]

    • Kenji Kawaguchi, Yoshua Bengio, Vikas Verma, Leslie Pack Kaelbling.
    • Key Word: Regularization; Measure Theory.
    • Digest This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.
  • Stronger generalization bounds for deep nets via a compression approach [paper]

    • Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang. ICML 2018
    • Key Word: PAC-Bayes; Compression-Based Generalization Bound.
    • Digest A simple compression framework for proving generalization bounds, perhaps a more explicit and intuitive form of the PAC-Bayes work. It also yields elementary short proofs of recent generalization results.
  • Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? [paper]

    • Boris Hanin. NeurIPS 2018
    • Key Word: Network Architectures.
    • Digest We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths.

Others: 2017

  • Mean Field Residual Networks: On the Edge of Chaos. [paper]

    • Greg Yang, Samuel S. Schoenholz. NeurIPS 2017
    • Key Word: Mean Field Theory.
    • Digest The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial.
  • Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. [paper]

    • Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli. NeurIPS 2017
    • Key Word: Mean Field Theory.
    • Digest We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not.
  • Deep Neural Networks as Gaussian Processes. [paper]

    • Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein. ICLR 2018
    • Key Word: Gaussian Process.
    • Digest In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs.
  • Maximum Principle Based Algorithms for Deep Learning. [paper]

    • Qianxiao Li, Long Chen, Cheng Tai, Weinan E. JMLR
    • Key Word: Optimal control; Pontryagin’s Maximum Principle.
    • Digest We discuss the viewpoint that deep residual neural networks can be viewed as discretization of a continuous-time dynamical system, and hence supervised deep learning can be regarded as solving an optimal control problem in continuous time.
  • When is a Convolutional Filter Easy To Learn? [paper]

    • Simon S. Du, Jason D. Lee, Yuandong Tian. ICLR 2018
    • Key Word: Gradient Descent.
    • Digest We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions.
  • Implicit Regularization in Deep Learning. [paper]

    • Behnam Neyshabur. PhD Thesis
    • Key Word: Implicit Regularization.
    • Digest In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implicitly regularize complexity measures.
  • Exploring Generalization in Deep Learning. [paper] [code]

    • Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro. NeurIPS 2017
    • Key Word: PAC-Bayes.
    • Digest With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.
  • Gradient Descent Can Take Exponential Time to Escape Saddle Points. [paper]

    • Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, Aarti Singh. NeurIPS 2017
    • Key Word: Gradient Descent; Saddle Points.
    • Digest We established the failure of gradient descent to efficiently escape saddle points for general non-convex smooth functions. We showed that even under a very natural initialization scheme, gradient descent can require exponential time to converge to a local minimum whereas perturbed gradient descent converges in polynomial time. Our results demonstrate the necessity of adding perturbations for efficient non-convex optimization.
  • Stochastic Gradient Descent as Approximate Bayesian Inference. [paper]

    • Stephan Mandt, Matthew D. Hoffman, David M. Blei. JMLR
    • Key Word: Stochastic Gradient Descent; Stochastic Differential Equations; Ornstein-Uhlenbeck Process.
    • Digest The article discusses the use of Stochastic Gradient Descent (SGD) with a constant learning rate as a simulation of a Markov chain with a stationary distribution. This perspective leads to several new findings, including using constant SGD as an approximate Bayesian posterior inference algorithm by adjusting tuning parameters to match the stationary distribution to a posterior. Additionally, constant SGD can optimize hyperparameters in complex probabilistic models and be used for sampling with momentum. The article also analyzes MCMC algorithms and provides a proof of why Polyak averaging is optimal. Finally, a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler, is proposed based on this stochastic process perspective.
  • How to Escape Saddle Points Efficiently. [paper]

    • Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. ICML 2017
    • Key Word: Gradient Descent; Saddle Points.
    • Digest This paper presents the first (nearly) dimension-free result for gradient descent in a general nonconvex setting. We present a general convergence result and show how it can be further strengthened when combined with further structure such as strict saddle conditions and/or local regularity/convexity.

Others: 2016

  • Understanding Deep Neural Networks with Rectified Linear Units. [paper]

    • Raman Arora, Amitabh Basu, Poorya Mianjy, Anirbit Mukherjee. ICLR 2018
    • Key Word: ReLU.
    • Digest In this paper we investigate the family of functions representable by deep neural networks (DNN) with rectified linear units (ReLU). We give an algorithm to train a ReLU DNN with one hidden layer to *global optimality* with runtime polynomial in the data size albeit exponential in the input dimension. Further, we improve on the known lower bounds on size (from exponential to super exponential) for approximating a ReLU deep net function by a shallower ReLU net.
  • Deep Information Propagation. [paper]

    • Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein. ICLR 2017
    • Key Word: Mean Field Theory.
    • Digest We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters.
  • Why Deep Neural Networks for Function Approximation? [paper]

    • Shiyu Liang, R. Srikant. ICLR 2017
    • Key Word: Function Approximation.
    • Digest Recently there has been much interest in understanding why deep neural networks are preferred to shallow networks. We show that, for a large class of piecewise smooth functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation. First, we consider univariate functions on a bounded interval and require a neural network to achieve an approximation error of ε uniformly over the interval. We show that shallow networks (i.e., networks whose depth does not depend on ε) require Ω(poly(1/ε)) neurons while deep networks (i.e., networks whose depth grows with 1/ε) require O(polylog(1/ε)) neurons.
  • Why does deep and cheap learning work so well? [paper]

    • Henry W. Lin, Max Tegmark, David Rolnick. Journal of Statistical Physics
    • Key Word: Physics.
    • Digest We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through "cheap learning" with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks.
  • Exponential expressivity in deep neural networks through transient chaos. [paper] [code]

    • Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli. NeurIPS 2016
    • Key Word: Mean Field Theory; Riemannian Geometry.
    • Digest We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in deep neural networks with random weights. Our results reveal a phase transition in the expressivity of random deep networks, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth, but not with width. We prove that this generic class of random functions cannot be efficiently computed by any shallow network, going beyond prior work that restricts their analysis to single functions.

Related Resources

Contributing

Welcome to recommend papers that you find interesting and focused on deep phenomena. You can submit an issue or contact me via [email]. Also, if there are any errors in the paper information, please feel free to correct me.

Formatting (The order of the papers is reversed based on the initial submission time to arXiv)

  • Paper Title [paper]
    • Authors. Published Conference or Journal
    • Key Word: XXX.
    • Digest XXXXXX