Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

Open
1 task done
mertunsall opened this issue Feb 14, 2025 · 3 comments
Open
1 task done

Training Qwen/Qwen2.5-Coder-32B-Instruct model OOM #6942

mertunsall opened this issue Feb 14, 2025 · 3 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@mertunsall
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.31
  • Python version: 3.12.7
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.0.2
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA H100 80GB HBM3
  • DeepSpeed version: 0.15.3
  • vLLM version: 0.6.5

Reproduction

I don't understand how I should set the configs in order to fine-tune 32B or 70B model when I want to do full SFT on 1 8xH100 node. I need to use a specific deepspeed z3 config?

Currently, I run CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train config.yaml where config.yaml looks like:

### model
model_name_or_path: Qwen/Qwen2.5-Coder-32B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: config/deepspeed/ds_z3_offload.json # copied from llama-factory repo

### dataset
dataset: mydataset
template: qwen
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 64

### output
output_dir: /mnt/models/Qwen2.5-Coder-32B-Instruct-reasoning-prover-ft-120225
logging_steps: 1000
save_strategy: epoch
save_steps: 1
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 100
per_device_eval_batch_size: 1
eval_strategy: epoch
eval_steps: 1

and I get OOM.

Others

No response

@mertunsall mertunsall added bug Something isn't working pending This problem is yet to be addressed labels Feb 14, 2025
@hiyouga
Copy link
Owner

hiyouga commented Feb 14, 2025

Have you tried enable_liger_kernel: true?

@mertunsall
Copy link
Author

mertunsall commented Feb 14, 2025

I haven't - I will try. Either way, shouldn't 8xH100 memory be enough to train a 32B model?

Also - what does enable_liger_kernel do?

@mertunsall
Copy link
Author

I seem to get another error this time

AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

will try fixing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants