- Support for act-order models
(a bit slow for now) -
Support for v1 models without groupsizeNah. - Test more models
- Consider support for loading GGML models (not feasible)
- Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)
- Support for ROCm/AMD GPUs
- Optimize more for ROCm
- Test that CUDA code works on GTX 10-series and RTX 20-series at some point
- Test performance on P40 (would be a good GPU to support)
- Improve performance on P40
- Tunable kernel parameters
- More tunable kernel parameters
- Test on Windows
- Easier extension loading on Windows
- Setup instructions for Windows
- Figure out an apples-to-apples way of comparing perplexity with other implementations
- Compile charts of inference speed vs context length for variety of models, compare to other implementations
- Test a bunch of LoRAs to make sure all combinations of rank and target layers work
-
Fix layer streaming so it isn't unusably slow(removed) -
Allow layer streaming to integrate with other features like device splittingNope -
Provide alternative backend to allow layers on CPUNah
- Support for de-quantizing select matrices at load time
-
Better vector-matrix multiplication for de-quantized matrices(dequant was a dead end) - Fused QKV projection
- Fused MLP
- Fused RoPE
-
Build attention mask in CUDA rather than PyTorch -
Disable attention mask when it isn't needed(not possible with SDP) - Figure out why inference appears to be CPU-bound (kernel launch overhead)
- Reduce no. kernel launches to minimum (tail launch, fusion etc.)
- Measure PyTorch module overhead (negligible in eval mode)
- Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
- Implement attention in CUDA
- Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
- Experiment with concurrent streams where possible (fused MLP and QKV proj.)
- Faster low-rank matmul to speed up LoRAs
- Memory-efficient beam search implementation
- Optimized beam search
- Multi-token censoring/de-censoring
- Multi-token repetition penalties
- (Multi) LoRA support
- Allow stackable LoRAs
- Guided generation (chat with multiple bots at once, etc.)
- Multiple chat modes with prompt templates (instruct, etc.)
- Batched generation
- Simple web interface?
- API server
- Controls to enable beam search
- Rewrite/refactor all the JavaScript and CSS
- Support for prompt formats/instruct mode
- Make it a little prettier
- Test various edge cases
- Better error handling
- LoRA controls
- FP8/FP16 overlays
- Allow for backpropagation
- LoRA training features
- Soft prompt training