Starred repositories
Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
lightweight, standalone C++ inference engine for Google's Gemma models.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
The road to hack SysML and become an system expert
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
PyTorch native quantization and sparsity for training and inference
Advanced Quantization Algorithm for LLMs/VLMs.
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Netflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated AI video subtitle team | Netflix级字幕切割、翻译、对齐、甚至加上配音,一键全自动视频搬运AI字幕组
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
VPTQ, A Flexible and Extreme low-bit quantization algorithm
A bunch of coding tutorials for my Youtube videos on Neural Network Quantization.
this repository accompanies the book "Grokking Deep Learning"
Introduction to Machine Learning Systems
Efficient Deep Learning Systems course materials (HSE, YSDA)
我的 ComfyUI 工作流合集 | My ComfyUI workflows collection
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Protocol Buffers - Google's data interchange format
FlatBuffers: Memory Efficient Serialization Library
Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
CVNets: A library for training computer vision networks
Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.