This repository contains the code for a decoder-only transformer, similar to Llama or GPT. It was trained on an English corpus built from the seven Harry Potter books and has roughly 75M trainable parameters.
- Tokenization: Byte pair encoding (sentencepiece)
- FlashAttention, Grouped Query Attention
- Rotary Position Embeddings
- Key Value Cache
- Sampling: top-p, top-k
Parameter | Value |
---|---|
Layer | 4 |
Model Dimension | 768 |
Context Length | 1024 |
Attention Heads | 8 |
Key/Value Heads | 4 |
Vocabulary Size | 32000 |
RoPE Theta | 10000 |
-
Grouped Query Attention -
Rotary Position Embeddings -
Key Value Cache - Distributed training
- Finetuning with (Q)LoRA
- Add Mixture of Experts model
TODO