Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix last step is not executed #236

Merged
merged 4 commits into from
Oct 17, 2024

Conversation

rayandrew
Copy link
Contributor

If we have 730 steps, DLIO benchmark only executes until 729

The bug also persists when user specified total_training_steps

Fix: #235

If we have 730 steps, DLIO benchmark only executes until 729

The bug also persists when user specified `total_training_steps`
Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

If `total_training_steps` is not specified, the default will be -1.
Thus checking whether it is > 0 is needed
@rayandrew
Copy link
Contributor Author

There is one bug. if total_training_steps is not specified, the default will be -1.
I added check for that as well

Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is failing. Can u run pytest and check.

@rayandrew
Copy link
Contributor Author

I think the last commit fixed the CI @hariharan-devarajan

Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good.

  1. We make sure we complete the step correctlyly.
  2. For DALI, we need to handle the stop iteration gracefully.

@hariharan-devarajan
Copy link
Collaborator

@zhenghh04 This is ready for merge as well.

@zhenghh04 zhenghh04 merged commit cda4df3 into argonne-lcf:main Oct 17, 2024
6 checks passed
@rayandrew rayandrew deleted the fix/wrong-step-count branch October 17, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix last step is not executed with/without user specifying total_training_steps
3 participants