Just helping myself keep track of LLM papers that I‘m reading, with an emphasis on inference and model compression.

Transformer Architectures

Attention Is All You Need
Fast Transformer Decoding: One Write-Head is All You Need - Multi-Query Attention
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Hyena Hierarchy: Towards Larger Convolutional Language Models

Foundation Models

LLaMA: Open and Efficient Foundation Language Models
PaLM: Scaling Language Modeling with Pathways
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Language Models are Unsupervised Multitask Learners (OpenAI) - GPT-2
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
OpenLLaMA: An Open Reproduction of LLaMA
Llama 2: Open Foundation and Fine-Tuned Chat Models
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Position Encoding

KV Cache

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Jun. 2023)
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Activation

Pruning

Optimal Brain Damage (1990)
Optimal Brain Surgeon (1993)
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (Jan. 2023) - Introduces Optimal Brain Quantization based on the Optimal Brain Surgeon
Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
A Simple and Effective Pruning Approach for Large Language Models - Introduces Wanda (pruning with Weights and Activations)

Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - Quantization with outlier handling. Might be solving the wrong problem - see "Quantizable Transformers" below.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Another approach to quantization with outliers
Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Normalization

Root Mean Square Layer Normalization
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - Introduces gated attention and argues that outliers are a consequence of normalization

Sparsity and rank compression

Compressing Pre-trained Language Models by Decomposition - vanilla SVD composition to reduce matrix sizes
Language model compression with weighted low-rank factorization - Fisher information-weighted SVD
Numerical Optimizations for Weighted Low-rank Estimation on Language Model - Iterative implementation for the above
Weighted Low-Rank Approximation (2003)
Transformers learn through gradual rank increase
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
TRP: Trained Rank Pruning for Efficient Deep Neural Networks - Introduces energy-pruning ratio

Fine-tuning

LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation - works over a range of ranks
Full Parameter Fine-tuning for Large Language Models with Limited Resources

Sampling

Scaling

Efficiently Scaling Transformer Inference (Google Nov. 2022) - Pipeline and tensor parallelization for inference
Megatron-LM (Nvidia Mar. 2020) - Intra-layer parallelism for training

Watermarking

More

Provide feedback

Saved searches