Distributional Generalization in Natural Language Processing.

Yao Fu, University of Edinburgh,

**Update**: How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

**Update**: A Closer Look at Language Model Emergent Abilities

**Update**: Reasoning with large language models


Although seemingly trivial and being easily used everyday, our observation and knowledge of human language is restricted, biased and ultimately finite. Yet the variant of human language is at least combinatorially large, and potentially exponential or even infinite. How can we generalize to such large space with such limited observation? Are the current models capable enough to generalize on unseen linguistic scenarios? If one feed all possible data on the internet to a gigantic language model, can it learn everything of human language? If not, what is (still) missed? These are the problems that we would like to study through the lens of generalization.

Table of Content

Development of Theory

Theory Model Scale #Param Data
Classical Supervised Learning Theory Small model (e.g., SVMs) <1M i.i.d. data
Deep Supervised Learning Theory Small neural networks < 1M i.i.d. data
Generalization in Transfer Setting Mid neural networks < 1G non-i.i.d. data, structures within data, data distributional shift
Compositional Generalization in NLP Pretrained language model < 11G non-i.i.d. data, linguistic structures within data, domain transfer, task transfer, cross-lingual transfer
Emergent Abilities in Large Language Models Largest models so far (GPT-3, PaLM, .etc) > 100G i.i.d. or non-i.i.d., Few-shot in-context learning
  • Classical Supervised Learning Theory
    • Small model (e.g., SVMs), i.i.d. data
    • PAC, Rademancher Complexity, VC Dimension
  • Deep Supervised Learning Theory
    • Large model, neural networks, i.i.d. data
    • Over-parameterization, Regularization, Non-convex Optimization, Neural Tangent Kernel
  • Generalization in Transfer Setting
    • Large model, neural networks, non-i.i.d. data
    • Distribution Shift, Domain Adaptation, Robustness, Invariance
  • Compositional Generalization in NLP
    • Even Larger model (pretrained language model), non-i.i.d., but with intrinsic structures, data
    • Semantic Parsing, Question Answering, Language Generation
  • Emergent Abilities in Large Language Models
    • Largest models so far (GPT-3, PaLM), few-shots in-context learning, reasoning with rationales
  • Practical Techniques
    • Datasets, Data Augmentation, Architecture Design, Distributionally Robust Optimization, .etc.



Classical Theory


Deep Learning Theory

Training Dynamics


Gradient Descent


Neural Tangent Kernel

Mean-Field Analysis


Generalization Bounds


Reasoning with Large Language Models


  • Stanford CS324 - Large Language Models [link]
  • Princeton COS 597G - Understanding Large Language Models [link]
  • UNC COMP790-101 - Large Language Models [link]

Chain of Thoughts Series

Scratch pad

Transfer Setting

  • Distributional Generalization: A New Kind of Generalization. Preetum Nakkiran and Yamini Bansal

Domain Adaptation & Generalization

Generalization in Natural Language Processing


Semantic Parsing

Question Answering

Reading Comprehension


Adversarial Perturbation


NLP Architecture Learnability

Distributionally Robust Optimization

Sharpness-aware Minimization


Practical Techniques


Data Augmentation

