Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

ZYM66 · 2025-02-16T08:49:34Z

System Info

- `Accelerate` version: 1.3.0
- Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.28
- `accelerate` bash location: /data1/zym/miniconda3/envs/openr1/bin/accelerate
- Python version: 3.12.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 2266.76 GB
- GPU type: NVIDIA H20
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

The script I use

    model, optimizer, train_dataloader, test_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, test_dataloader, lr_scheduler
    )
    
    # evaluate(model, test_dataloader, accelerator, max_length=args.max_length)
    # train(model, tokenizer, optimizer, lr_scheduler, train_dataloader, test_dataloader, accelerator, 
    #       max_length=args.max_length, num_epochs=args.num_epochs)
    
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        args.output_dir,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
    )
    if accelerator.is_main_process:
        tokenizer.save_pretrained(args.output_dir)
    accelerator.end_training()
    accelerator.print("Training finished")

I omit the training code because it doesn't affect we reproduction this bug.

I got two different configurations from the official Accelerate example: one for stage 2 optimization and the other for stage 3 optimization.

for stage 2

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

for stage 3

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when I use stage 2 config to run this code, I can successfully load the trained model(the model has been saved properly)

but when I use stage 3 config to run this code, I can't load the trained model

the load code

def load_model_and_tokenizer(path, device):
    tokenizer = AutoTokenizer.from_pretrained(path, padding_side="left", use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(path, ignore_mismatched_sizes=True).to(device)
    return model, tokenizer

Expected behavior

I tried this in qwen2.5-7B and llama3.2-3B. Both of them have this save problems when using the stage 3 optimization config

The text was updated successfully, but these errors were encountered:

saurav935 · 2025-02-18T11:14:17Z

I am also facing the same. Can we get some help please? @sgugger

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

ZYM66 commented Feb 16, 2025 •

edited

Loading

saurav935 commented Feb 18, 2025

Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

Comments

ZYM66 commented Feb 16, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

saurav935 commented Feb 18, 2025

ZYM66 commented Feb 16, 2025 •

edited

Loading