Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers test_cpu_offload tests fail with KeyError: 'xpu:0' #3402

Open
dvrogozh opened this issue Feb 20, 2025 · 1 comment · May be fixed by #3403
Open

Transformers test_cpu_offload tests fail with KeyError: 'xpu:0' #3402

dvrogozh opened this issue Feb 20, 2025 · 1 comment · May be fixed by #3403

Comments

@dvrogozh
Copy link
Contributor

With:

On:

  • Intel Data Center GPU Max (single device)
FAILED tests/models/blip/test_modeling_blip.py::BlipVQAModelTest::test_cpu_offload - KeyError: 'xpu:0'
FAILED tests/models/blip/test_modeling_blip.py::BlipVQAModelTest::test_disk_offload_bin - KeyError: 'xpu:0'
FAILED tests/models/blip/test_modeling_blip.py::BlipVQAModelTest::test_disk_offload_safetensors - KeyError: 'xpu:0'
FAILED tests/models/blip/test_modeling_blip.py::BlipTextImageModelTest::test_cpu_offload - KeyError: 'xpu:0'
FAILED tests/models/blip/test_modeling_blip.py::BlipTextImageModelTest::test_disk_offload_bin - KeyError: 'xpu:0'
FAILED tests/models/blip/test_modeling_blip.py::BlipTextImageModelTest::test_disk_offload_safetensors - KeyError: 'xpu:0'
FAILED tests/models/dab_detr/test_modeling_dab_detr.py::DabDetrModelTest::test_cpu_offload - KeyError: 'xpu:0'
FAILED tests/models/dab_detr/test_modeling_dab_detr.py::DabDetrModelTest::test_disk_offload_bin - KeyError: 'xpu:0'
FAILED tests/models/dab_detr/test_modeling_dab_detr.py::DabDetrModelTest::test_disk_offload_safetensors - KeyError: 'xpu:0'
FAILED tests/models/vilt/test_modeling_vilt.py::ViltModelTest::test_cpu_offload - KeyError: 'xpu:0'
FAILED tests/models/vilt/test_modeling_vilt.py::ViltModelTest::test_disk_offload_bin - KeyError: 'xpu:0'
FAILED tests/models/vilt/test_modeling_vilt.py::ViltModelTest::test_disk_offload_safetensors - KeyError: 'xpu:0'

Transformers test_cpu_offload and few other tests fail for blip, dab_detr, roberta and vilt models running with XPU backend. Note: I can't reproduce this issue with CUDA on A10.

# ZE_AFFINITY_MASK=0 TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/blip/test_modeling_blip.py::BlipVQAModelTest::test_cpu_offload
...
                    elif is_xpu_available():
                        device = f"xpu:{device}"
>               del self.tied_params_map[value_pointer][device]
E               KeyError: 'xpu:0'

../accelerate/src/accelerate/hooks.py:399: KeyError

Issue happens here:

elif is_xpu_available():
device = f"xpu:{device}"
del self.tied_params_map[value_pointer][device]

When issue happens self.tied_params_map[value_pointer] set is empty. The trivial if condition to check whether it's not empty allows to avoid the issue and tests pass. As I noted, I don't see this issue happening for CUDA. I also see that self.tied_pointers_to_remove is populated twice with the same values for XPU and then post_forward() is also called twice in a row with the issue happening on the second pass.

CC: @SunMarc @faaany @zucchini-nlp

dvrogozh added a commit to dvrogozh/accelerate that referenced this issue Feb 20, 2025
Fixes: huggingface#3402

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
@dvrogozh dvrogozh linked a pull request Feb 20, 2025 that will close this issue
@dvrogozh
Copy link
Contributor Author

The trivial if condition to check whether it's not empty allows to avoid the issue and tests pass.

See #3403 with such a fix. if condition helps to avoid KeyError.... However, I am not sure why this situation happens. I afraid that I might not have addressed actual issue and just fixed symptom.... Can someone help suggest a better fix or explain why this fix would be correct one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant