Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something WRONG when I saving the trained model with deepspeed stage 3 optimization config #3399

Open
2 of 4 tasks
ZYM66 opened this issue Feb 16, 2025 · 1 comment
Open
2 of 4 tasks

Comments

@ZYM66
Copy link

ZYM66 commented Feb 16, 2025

System Info

- `Accelerate` version: 1.3.0
- Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.28
- `accelerate` bash location: /data1/zym/miniconda3/envs/openr1/bin/accelerate
- Python version: 3.12.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 2266.76 GB
- GPU type: NVIDIA H20
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

The script I use

    model, optimizer, train_dataloader, test_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, test_dataloader, lr_scheduler
    )
    
    # evaluate(model, test_dataloader, accelerator, max_length=args.max_length)
    # train(model, tokenizer, optimizer, lr_scheduler, train_dataloader, test_dataloader, accelerator, 
    #       max_length=args.max_length, num_epochs=args.num_epochs)
    
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        args.output_dir,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
    )
    if accelerator.is_main_process:
        tokenizer.save_pretrained(args.output_dir)
    accelerator.end_training()
    accelerator.print("Training finished")

I omit the training code because it doesn't affect we reproduction this bug.

I got two different configurations from the official Accelerate example: one for stage 2 optimization and the other for stage 3 optimization.

  • for stage 2
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  • for stage 3
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

when I use stage 2 config to run this code, I can successfully load the trained model(the model has been saved properly)
Image

but when I use stage 3 config to run this code, I can't load the trained model
Image

the load code

def load_model_and_tokenizer(path, device):
    tokenizer = AutoTokenizer.from_pretrained(path, padding_side="left", use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(path, ignore_mismatched_sizes=True).to(device)
    return model, tokenizer

Expected behavior

I tried this in qwen2.5-7B and llama3.2-3B. Both of them have this save problems when using the stage 3 optimization config

@saurav935
Copy link

I am also facing the same. Can we get some help please? @sgugger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants