Model compatibility

Support for act-order models ~~(a bit slow for now)~~
~~Support for v1 models without groupsize~~ Nah.
Test more models
Consider support for loading GGML models (not feasible)
Figure out if there are quantized models with irregular groupsize (there are some at least with no groupsize)

GPU compatibility (etc.)

Figure out an apples-to-apples way of comparing perplexity with other implementations
Compile charts of inference speed vs context length for variety of models, compare to other implementations
Test a bunch of LoRAs to make sure all combinations of rank and target layers work

~~Fix layer streaming so it isn't unusably slow~~ (removed)
~~Allow layer streaming to integrate with other features like device splitting~~ Nope
~~Provide alternative backend to allow layers on CPU~~ Nah

Support for de-quantizing select matrices at load time
~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
Fused QKV projection
Fused MLP
Fused RoPE
~~Build attention mask in CUDA rather than PyTorch~~
~~Disable attention mask when it isn't needed~~ (not possible with SDP)
Figure out why inference appears to be CPU-bound (kernel launch overhead)
Reduce no. kernel launches to minimum (tail launch, fusion etc.)
Measure PyTorch module overhead (negligible in eval mode)
Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
Implement attention in CUDA
Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
Experiment with concurrent streams where possible (fused MLP and QKV proj.)
Faster low-rank matmul to speed up LoRAs