Skip to content

horseee/Awesome-Efficient-LLM

Repository files navigation

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models

Full List

Please check out all the papers by selecting the sub-area you're interested in. On this main page, only papers released in the past 90 days are shown.

🚀 Updates

  • May 29, 2024: We've had this awesome list for a year now 🥰!
  • Sep 6, 2023: Add a new subdirectory project/ to organize efficient LLM projects.
  • July 11, 2023: A new subdirectory efficient_plm/ is created to house papers that are applicable to PLMs.

💮 Contributing

If you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

⭐ Recommended Paper

For each topic, we have curated a list of recommended papers that have garnered a lot of GitHub stars or citations.

Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 here)

Quick Link

Network Pruning / Sparsity

Title & Authors Introduction Links
Star Publish Type
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar, Dan Alistarh
image Github paper
Star Publish Type
LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma, Gongfan Fang, Xinchao Wang
image Github paper
Star Publish Type
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
image Github
Paper
Star Publish Type
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
image Github
Paper
Star Publish Type
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang
image Github
Paper
HashAttention: Semantic Sparsity for Faster Inference
Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
image Paper
Adaptive Pruning for Large Language Models with Structural Importance Awareness
Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han
image Paper
SlimGPT: Layer-wise Structured Pruning for Large Language Models
Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu
image Paper
Less is More: Towards Green Code Large Language Models via Unified Structural Pruning
Guang Yang, Yu Zhou, Xiangyu Zhang, Wei Cheng, Ke Liu, Xiang Chen, Terry Yue Zhuo, Taolue Chen
image Paper
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough
image Paper
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah et al
image Paper
Star
Reassessing Layer Pruning in LLMs: New Insights and Methods
Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu
image Github
Paper
Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity
Zichen Song, Sitan Huang, Yuxin Wu, Zhongfeng Kang
image Paper
StarPublish
AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment
Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin
image Github
Paper
Scaling Law for Post-training after Model Pruning
Xiaodong Chen, Yuxuan Hu, Jing Zhang, Xiaokang Zhang, Cuiping Li, Hong Chen
Paper
Star
DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu
image Github
Paper
Star
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
image Github
Paper
AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis
Zichen Song, Yuxin Wu, Sitan Huang, Zhongfeng Kang
image Paper
Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts
Danyal Aftab, Steven Davy
image Paper
Star
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment
Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu
image Github
Paper
Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen
image Paper
Star
EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search
Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
image Github
Paper
FedSpaLLM: Federated Pruning of Large Language Models
Guangji Bai, Yijiang Li, Zilinghan Li, Liang Zhao, Kibaek Kim
image Paper
Star
Pruning Foundation Models for High Accuracy without Retraining
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin
Github
Paper
Self-calibration for Language Model Quantization and Pruning
Miles Williams, George Chrysostomou, Nikolaos Aletras
image Paper
Beware of Calibration Data for Pruning Large Language Models
Yixin Ji, Yang Xiang, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Paper
StarPublish
AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models
Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang
image Github
Paper
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix
Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou
image Paper
Publish
DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models
Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu
image Paper
Publish
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Vithursan Thangarasa, Ganesh Venkatesh, Nish Sinnadurai, Sean Lie
image Paper
LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models
David Hoffmann, Kailash Budhathoki, Matthaeus Kleindessner
image Paper
StarPublish
Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning
Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu
image Github
Paper
Mitigating Copy Bias in In-Context Learning through Neuron Pruning
Ameen Ali, Lior Wolf, Ivan Titov
image Paper
StarPublish Type Type
SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models
Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain
image Github
Paper

Knowledge Distillation

Title & Authors Introduction Links
Knowledge Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
image Github
Paper
Publish
Self-Evolution Knowledge Distillation for LLM-based Machine Translation
Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang
image Paper
Large Language Models Compression via Low-Rank Feature Distillation
Yaya Sy, Christophe Cerisara, Irina Illina
image Paper
Star
Distilling Fine-grained Sentiment Understanding from Large Language Models
Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu
image Github
Paper
Star
Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting
Vijay Goyal, Mustafa Khan, Aprameya Tirupati, Harveer Saini, Michael Lam, Kevin Zhu
image Github
Paper
Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation
Xunyu Zhu, Jian Li, Can Ma, Weiping Wang
image Paper
Star
Generative Context Distillation
Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo
image Github
Paper
SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung
image Paper
Star
Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
Justin Deschenaux, Caglar Gulcehre
image Github
Paper
Pre-training Distillation for Large Language Models: A Design Space Exploration
Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li
Paper
Star
MiniPLM: Knowledge Distillation for Pre-Training Language Models
Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
image Github
Paper
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling
Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
image Paper
Evolutionary Contrastive Distillation for Language Model Alignment
Julian Katz-Samuels, Zheng Li, Hyokun Yun, Priyanka Nigam, Yi Xu, Vaclav Petricek, Bing Yin, Trishul Chilimbi
image Paper

Quantization

Title & Authors Introduction Links
StarPublish
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
image Github
Paper
StarPublish
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han
image Github
Paper
Star
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han
image Github
Paper
StarPublish
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo
image Github
Paper
Star
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
image Github
Paper
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Zhen Zheng, Xiaonan Song, Chuanjie Liu
image Paper
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu
image Paper
LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong
image Paper
SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization
Runsheng Bai, Qiang Liu, Bo Liu
image Paper
CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
Amitash Nanda, Sree Bhargavi Balija, Debashis Sahoo
image Paper
Publish
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst
image Paper
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu
image Paper
StarPublish
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah
image Github
Paper
AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, Jungwook Choi
image Paper
Bi-Mamba: Towards Accurate 1-Bit State Space Models
Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen
image Paper
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh
Paper
GWQ: Gradient-Aware Weight Quantization for Large Language Models
Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu et al
image Paper
A Comprehensive Study on Quantization Techniques for Large Language Models
Jiedong Lang, Zhehao Guo, Shuyu Huang
Paper
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Hongyu Wang, Shuming Ma, Furu Wei
image Paper
Star
TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction
Yuhang Li, Priyadarshini Panda
image Github
Paper
Star
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu
image Github
Paper
The Impact of Inference Acceleration Strategies on Bias of LLMs
Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, Muhammad Bilal Zafar
Paper
Understanding the difficulty of low-precision post-training quantization of large language models
Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang
image Paper
Star
1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei
image Github
Paper
QuAILoRA: Quantization-Aware Initialization for LoRA
Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan
Paper
Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks
Enkhbold Nyamsuren
Paper
Star
SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
image Github
Paper
Pyramid Vector Quantization for LLMs
Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James Hensman
image Paper
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi
image Paper
Star
FlatQuant: Flatness Matters for LLM Quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao
image Github
Paper
Star
SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs
Mohammad Mozaffari, Maryam Mehri Dehnavi
image Github
Paper
Scaling laws for post-training quantized large language models
Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang
image Paper
Continuous Approximations for Improving Quantization Aware Training of LLMs
He Li, Jianhang Hong, Yuanzhuo Wu, Snehal Adbol, Zonglin Li
Paper
Star
DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs
Yingsong Luo, Ling Chen
image Github
Paper
Star
Quamba: A Post-Training Quantization Recipe for Selective State Space Models
Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu
image Github
Paper
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Qian Tao, Wenyuan Yu, Jingren Zhou
image Paper
Channel-Wise Mixed-Precision Quantization for Large Language Models
Zihan Chen, Bike Xie, Jundong Li, Cong Shen
image Paper
Progressive Mixed-Precision Decoding for Efficient LLM Inference
Hao Mark Chen, Fuwen Tan, Alexandros Kouris, Royson Lee, Hongxiang Fan, Stylianos I. Venieris
image Paper
Star
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy
image Github
Paper
Star
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo
image Github
Paper
Star
Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
image Github
Paper
Scaling Laws for Mixed quantization in Large Language Models
Zeyu Cao, Cheng Zhang, Pedro Gimenes, Jianqiao Lu, Jianyi Cheng, Yiren Zhao
image Paper
PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms
Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, Suman Banerjee
image Paper
CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
Wenyuan Liu, Xindian Ma, Peng Zhang, Yan Wang
image Paper
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen
image Paper
Addition is All You Need for Energy-efficient Language Models
Hongyin Luo, Wei Sun
image Paper

Inference Acceleration

Title & Authors Introduction Links
StarPublish
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
image Github
Paper
Star
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia
image Github
paper
Star
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
image Github
Paper
Star
EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation
Yuhui Li, Chao Zhang, and Hongyang Zhang
image Github
Blog
Star
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
image Github
Paper
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Zhuofan Wen, Shangtong Gui, Yang Feng
image Paper
PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena
image Paper
Publish
FastDraft: How to Train Your Draft
Ofir Zafrir, Igor Margulis, Dorin Shteyman, Guy Boudoukh
Paper
Star
SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents
Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen
image Github
Paper
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto
Paper
Accelerated AI Inference via Dynamic Execution Methods
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu
Paper
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao
image Paper
Dynamic Strategy Planning for Efficient Question Answering with Large Language Models
Tanmay Parekh, Pradyot Prakash, Alexander Radovic, Akshay Shekher, Denis Savenkov
image Paper
Star
MagicPIG: LSH Sampling for Efficient LLM Generation
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
image Github
Paper
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
Artem Basharin, Andrei Chertkov, Ivan Oseledets
image Paper
Efficient Inference for Augmented Large Language Models
Rana Shahout, Cong Liang, Shiji Xin, Qianru Lao, Yong Cui, Minlan Yu, Michael Mitzenmacher
image Paper
Star
Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec
image Github
Paper
Star
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation
Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen
image Github
Paper
Star
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
image Github
Paper
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, Lei Zou
image Paper
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu
image Paper
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
image Paper
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu
image Paper
Star
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li
image Github
Paper
Star
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang
image Github
Paper
A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng
image Paper

Efficient MOE

Title & Authors Introduction Links
Star
Fast Inference of Mixture-of-Experts Language Models with Offloading
Artyom Eliseev, Denis Mazur
image Github
Paper
Star
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin
image Github
Paper
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi
image Paper
Star
MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li
image Github
Paper
Star
MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan
image Github
Paper
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, Minyi Guo
image Paper
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Xiaoniu Song, Zihang Zhong, Rong Chen
image Paper
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
Xin He, Shunkang Zhang, Yuxin Wang, Haiyan Yin, Zihao Zeng, Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Ivor Tsang, Ong Yew Soon
image Paper
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai
Paper
Star
MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi
image Github
Paper

Efficient Architecture of LLM

Title & Authors Introduction Links
Star
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan
image Github
Paper
Model
Star
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
image Github
Paper
Taipan: Efficient and Expressive State Space Language Models with Selective Attention
Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen
image Paper
Star
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang
image Github
Paper
Star
Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression
Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang
image Github
Paper
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin
image Paper

KV Cache Compression

Title & Authors Introduction Links
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao
image Paper
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, Minyi Guo
image Paper
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen
image Paper
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu
image Paper
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang
image Paper
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, Hui Xiong
image Paper
Star
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao
image Github
Paper
Star
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He
image Github
Paper
Star
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
You Wu, Haoyi Wu, Kewei Tu
image Github
Paper
Lossless KV Cache Compression to 2%
Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang
image Paper
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng
image Paper
Star
Residual vector quantization for KV cache compression in large language model
Ankur Kumar
Github
Paper
Star
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen
image Github
Paper
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, Yelong Shen
image Paper
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He
image Paper
Publish
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
image Paper
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
Isaac Rehg
image Paper
Star
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou
image Github
Paper

Text Compression

Title & Authors Introduction Links
StarPublish
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
Star
L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression
Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song
image Github
Paper
Star
PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics
Daniil Larionov, Steffen Eger
image Github
Paper
Star
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou
image Paper
JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services
Feiran You, Hongyang Du, Kaibin Huang, Abbas Jamalipour
image Paper
Star
Generative Context Distillation
Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo
image Github
Paper
Star
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard
image Github
Paper
Publish
Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability
Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung
image Paper
Publish
From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression
Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee
image Paper
Perception Compressor:A training-free prompt compression method in long context scenarios
Jiwei Tang, Jin Xu, Tingwei Lu, Hai Lin, Yiming Zhao, Hai-Tao Zheng
image Paper

Low-Rank Decomposition

Title & Authors Introduction Links
Star
Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning
Arijit Das
Github
Paper
CompAct: Compressed Activations for Memory-Efficient LLM Training
Yara Shamshoum, Nitzan Hodos, Yuval Sieradzki, Assaf Schuster
image Paper
Publish
ESPACE: Dimensionality Reduction of Activations for Model Compression
Charbel Sakr, Brucek Khailany
image Paper

Hardware/System/Serving

Title & Authors Introduction Links
KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management
Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen
image Paper
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
Ao Shen, Zhiyao Li, Mingyu Gao
image Paper
CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
Hongpeng Jin, Yanzhao Wu
image Paper
Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management
Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren
image Paper
Publish
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
Youpeng Zhao, Jun Wang
image Paper
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
image Paper
Publish
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao
image Paper
FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan et al
image Paper
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar
image Paper
Star
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
image Github
Paper

Tuning

Title & Authors Introduction Links
HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization
Huaqin Zhao, Jiaxi Li, Yi Pan, Shizhe Liang, Xiaofeng Yang, Wei Liu, Xiang Li, Fei Dou, Tianming Liu, Jin Lu
image Paper
Star
Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation
Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty
image Github
Paper
Publish
MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning
Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, Wei Zhu
image Paper
Star
RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates
Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi
image Github
Paper
StarPublish
Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models
Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu
image Github
Paper
Publish
Parameter-Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning
Nusrat Jahan Prottasha, Asif Mahmud, Md. Shohanur Islam Sobuj, Prakash Bhat, Md Kowsher, Niloofar Yousefi, Ozlem Ozmen Garibay
image Paper
StarPublish
QEFT: Quantization for Efficient Fine-Tuning of LLMs
Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
image Github
Paper
StarPublish
BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models
Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma
image Github
Paper
Star
SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers
Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets
image Github
Paper
SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Tianyi Zhang, Junda Su, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava
image Paper

Efficient Training

Title & Authors Introduction Links
Star
LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks
Evgeny Hershkovitch Neiterman, Gil Ben-Artzi
image Github
Paper
AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning
Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Zekai Liu, Shichao Weng
image Paper
StarPublish
Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention
Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou
image Github
Paper
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs
Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, Irwin King
image Paper
Star
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han
image Github
Paper
Star
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
Houming Wu, Ling Chen, Wenjie Yu
image Github
Paper

Survey (or Benchmark)

Title & Authors Introduction Links
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Hyun Ryu, Eric Kim
image Paper
Star
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al
Github
Paper
Star
Prompt Compression for Large Language Models: A Survey
Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier
image Github
Paper
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai
image Paper