Support run trainer locally #111

zpcore · 2025-02-13T21:56:27Z

Support to run train.py locally:

tp dbrun --use-hf torchprime/hf_models/train.py # huggingface run

Or

tp dbrun  torchprime/torch_xla_models/train.py

This makes it easier to debug model code.

Currently it is still using the torch_xla from the base docker image. In the future, we may want to give the option to run with local build torch_xla.

torchprime/launcher/cli.py

tengyifei · 2025-02-13T22:06:23Z

I usually run a model locally like this:

python3 torchprime/torch_xla_models/train.py model=llama-3-8b mesh.fsdp=8 profile_step=3 max_steps=50

Is this insufficient for your use case?

zpcore · 2025-02-13T22:15:36Z

I usually run a model locally like this:
python3 torchprime/torch_xla_models/train.py model=llama-3-8b mesh.fsdp=8 profile_step=3 max_steps=50
Is this insufficient for your use case?

Yes, this should also work. The thing is that I saw many permission issues when run train.py directly. This just helps to run in a container.

README.md

torchprime/launcher/cli.py

README.md

torchprime/launcher/cli.py

tengyifei · 2025-02-15T00:31:26Z

torchprime/launcher/buildpush.py

@@ -61,7 +61,8 @@ def buildpush(
    _run(
      f"{sudo_cmd} docker tag {docker_tag} {docker_url}",
    )
-    _run(f"{sudo_cmd} docker push {docker_url}")
+    if torchprime_docker_tag != "local_run":


This introduces a magical constant. I think it's simpler if we add a "push=True" function argument, and have the other file call this with push=False

Support run trainer locally

2a81ec2

zpcore commented Feb 13, 2025

View reviewed changes

torchprime/launcher/cli.py Show resolved Hide resolved

zpcore requested a review from tengyifei February 13, 2025 21:58

zpcore marked this pull request as ready for review February 13, 2025 21:58

nit

9a56545

tengyifei reviewed Feb 13, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

tengyifei requested changes Feb 13, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

torchprime/launcher/cli.py Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

torchprime/launcher/cli.py Show resolved Hide resolved

torchprime/launcher/cli.py Show resolved Hide resolved

update docker command

4a24ba6

tengyifei approved these changes Feb 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support run trainer locally #111

Support run trainer locally #111

zpcore commented Feb 13, 2025 •

edited

Loading

tengyifei commented Feb 13, 2025

zpcore commented Feb 13, 2025

tengyifei Feb 15, 2025

Support run trainer locally #111

Are you sure you want to change the base?

Support run trainer locally #111

Conversation

zpcore commented Feb 13, 2025 • edited Loading

tengyifei commented Feb 13, 2025

zpcore commented Feb 13, 2025

tengyifei Feb 15, 2025

Choose a reason for hiding this comment

zpcore commented Feb 13, 2025 •

edited

Loading