Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

Open
echoix opened this issue Feb 14, 2025 · 9 comments
Open

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

echoix opened this issue Feb 14, 2025 · 9 comments

Comments

@echoix
Copy link
Member

echoix commented Feb 14, 2025

Some tests are hanging and taking 6+ hours before getting killed when running on CI.

Name of the addon
The name of the GRASS GIS addon.
The last line of the log shown is r.futures.calib, but maybe the problem is elsewhere.

Describe the bug
A clear and concise description of what the bug of the addon is.

A couple mornings this week, plus this weekend, I manually stopped some jobs in grass-addons as they were on their way to taking 6h.

To Reproduce
Steps to reproduce the behavior:

See the logs for build and test, most often (if not only) of the hanging jobs are for main+3.11.

  1. Go to '...'
  2. Click on '....'
  3. Compute '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Tests are fast, or at least, not hanging.

Screenshots
If applicable, add screenshots to help explain your problem.

System description (please complete the following information):

On CI runners.

  • Operating System: [e.g. Windows, Linux, ... incl. version]
  • GRASS GIS version [e.g. 7.8.1]

Additional context
Add any other context about the problem here.

I don't think my work last weekend is the cause, as some of the PRs weren't timing out, but some did. Maybe I just allowed more tests to pass (with more dependencies or better translatable strings) and we end up with hanging tests. There were some times out in the last weeks before my last work here, but not as constant as this last week. (Even before the caching. One day I tried deleting all cache here, and the next daily run timed out too).

It may be a while loop (that I didn't find in the first two r.futures tests and code), but there's a lot of nested for loops).

It could be a case where there is some input expected and it's waiting, but it's more common with the cmd.exe on windows.

Could a stdin/stdout/stderr be closed and the gunittest test runner doesn't handle it well?

We can give a try with 8.4.0 with Python 3.11 or the 8.5.0dev with the Python version used in the 8.4.0 job (if it is 3.9+), to see if it is a Python problem, or a regression in the development grass in the last month(s) or so.

@echoix
Copy link
Member Author

echoix commented Feb 14, 2025

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

@echoix
Copy link
Member Author

echoix commented Feb 14, 2025

On my fork, I saw some timeouts 3 months ago and 2 weeks ago, but for 8.4.0. So maybe not a regression but something from addons specifically.

I also see that r.random.walk could be really long too. https://github.com/echoix/grass-addons/actions/runs/13048290473/job/36402778863

Are we hitting a memory/swap limit? The job could get killed in that case no?

@petrasovaa
Copy link
Contributor

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

Well, it is a non-trivial model, so it's possible something is wrong, but where do you even see it and how often it happens?

@echoix
Copy link
Member Author

echoix commented Feb 14, 2025

@echoix
Copy link
Member Author

echoix commented Feb 14, 2025

A harder to debug situation would be if this and/or r.random.walk are stochastic/gradient descent and they don't converge, ie, convergence depends on the initial state or seed.

@echoix
Copy link
Member Author

echoix commented Feb 14, 2025

We could always limit the test step at 45 minutes and save ourselves 5 hours of CI time in the European mornings

@wenzeslaus
Copy link
Member

Limit sounds good, but the per test limit seems more informative for s failures.

@echoix
Copy link
Member Author

echoix commented Feb 15, 2025

Limit sounds good, but the per test limit seems more informative for s failures.

We might as well use both, as if we're effectively hung, the timeout might not work

@echoix
Copy link
Member Author

echoix commented Feb 16, 2025

I've got to a point where I'm able to run successfully r.futures.calib and r.futures.parallelpga with pytest, but not with gunittest, as r.futures.calib still hangs with gunittest.

I overcame one of our previous limitations of using pytest: I managed to write an autouse fixture that is enabled when the test directory is "testsuite", and contains a data directory. If so, it copies the data directory to a pytest tmp directory, and changes the current working directory with monkeypatch (so it gets reverted), and it really works.

A problem, is that since it is not pytest tests, the setUpClass() methods get executed before the autouse fixture can change the directory, so I changed to setUp() instead (per test), and it is still way faster than waiting for timeouts.

If we find a way to have the data directory copied for a setUpClass(), we would likely have everything solved for running everything with pytest (even in the main repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants