unsloth will reduce a large amount of VRAM usage and also speedup training even on low-rank consumer GPU devices. I'm using a Nvidia RTX-3060 with 12 GB VRAM to fine-tune this model and just used 8 GB of VRAM with batch siez=32
per device. You can see all the hyperparameters below:
from transformers import TrainingArguments
STEP = 100
BATCH_SIZE = 8
MAX_LENGTH = 2048
per_device_train_batch_size=BATCH_SIZE
gradient_accumulation_steps=int(16 / BATCH_SIZE)
optim="adamw_bnb_8bit"
learning_rate=2e-3
lr_scheduler_type="cosine"
weight_decay=0.01
warmup_steps=STEP
fp16=not is_bfloat16_supported()
bf16=is_bfloat16_supported() # True
- Install
requirements.txt
:
pip install -r requirements.txt
Note: You have to install cuda and other dependencies to run this script on GPU, otherwise it would be too slow to finish.
- Run:
python fine-tune.py