Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

mertunsall · 2025-02-14T11:59:39Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.31
Python version: 3.12.7
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.0.2
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H100 80GB HBM3
DeepSpeed version: 0.15.3
vLLM version: 0.6.5

Reproduction

I don't understand how I should set the configs in order to fine-tune 32B or 70B model when I want to do full SFT on 1 8xH100 node. I need to use a specific deepspeed z3 config?

Currently, I run CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train config.yaml where config.yaml looks like:

### model
model_name_or_path: Qwen/Qwen2.5-Coder-32B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: config/deepspeed/ds_z3_offload.json # copied from llama-factory repo

### dataset
dataset: mydataset
template: qwen
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 64

### output
output_dir: /mnt/models/Qwen2.5-Coder-32B-Instruct-reasoning-prover-ft-120225
logging_steps: 1000
save_strategy: epoch
save_steps: 1
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 100
per_device_eval_batch_size: 1
eval_strategy: epoch
eval_steps: 1

and I get OOM.

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-02-14T12:25:58Z

Have you tried enable_liger_kernel: true?

mertunsall · 2025-02-14T13:41:51Z

I haven't - I will try. Either way, shouldn't 8xH100 memory be enough to train a 32B model?

Also - what does enable_liger_kernel do?

mertunsall · 2025-02-14T14:05:45Z

I seem to get another error this time

AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

will try fixing this

mertunsall added bug Something isn't working pending This problem is yet to be addressed labels Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

mertunsall commented Feb 14, 2025

hiyouga commented Feb 14, 2025 •

edited

Loading

mertunsall commented Feb 14, 2025 •

edited

Loading

mertunsall commented Feb 14, 2025

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

Comments

mertunsall commented Feb 14, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Feb 14, 2025 • edited Loading

mertunsall commented Feb 14, 2025 • edited Loading

mertunsall commented Feb 14, 2025

hiyouga commented Feb 14, 2025 •

edited

Loading

mertunsall commented Feb 14, 2025 •

edited

Loading