[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

echoix · 2025-02-14T19:16:57Z

Some tests are hanging and taking 6+ hours before getting killed when running on CI.

Name of the addon
The name of the GRASS GIS addon.
The last line of the log shown is r.futures.calib, but maybe the problem is elsewhere.

Describe the bug
A clear and concise description of what the bug of the addon is.

A couple mornings this week, plus this weekend, I manually stopped some jobs in grass-addons as they were on their way to taking 6h.

To Reproduce
Steps to reproduce the behavior:

See the logs for build and test, most often (if not only) of the hanging jobs are for main+3.11.

Go to '...'
Click on '....'
Compute '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Tests are fast, or at least, not hanging.

Screenshots
If applicable, add screenshots to help explain your problem.

System description (please complete the following information):

On CI runners.

Operating System: [e.g. Windows, Linux, ... incl. version]
GRASS GIS version [e.g. 7.8.1]

Additional context
Add any other context about the problem here.

I don't think my work last weekend is the cause, as some of the PRs weren't timing out, but some did. Maybe I just allowed more tests to pass (with more dependencies or better translatable strings) and we end up with hanging tests. There were some times out in the last weeks before my last work here, but not as constant as this last week. (Even before the caching. One day I tried deleting all cache here, and the next daily run timed out too).

It may be a while loop (that I didn't find in the first two r.futures tests and code), but there's a lot of nested for loops).

It could be a case where there is some input expected and it's waiting, but it's more common with the cmd.exe on windows.

Could a stdin/stdout/stderr be closed and the gunittest test runner doesn't handle it well?

We can give a try with 8.4.0 with Python 3.11 or the 8.5.0dev with the Python version used in the 8.4.0 job (if it is 3.9+), to see if it is a Python problem, or a regression in the development grass in the last month(s) or so.

echoix · 2025-02-14T19:17:56Z

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

echoix · 2025-02-14T19:25:50Z

On my fork, I saw some timeouts 3 months ago and 2 weeks ago, but for 8.4.0. So maybe not a regression but something from addons specifically.

I also see that r.random.walk could be really long too. https://github.com/echoix/grass-addons/actions/runs/13048290473/job/36402778863

Are we hitting a memory/swap limit? The job could get killed in that case no?

petrasovaa · 2025-02-14T19:34:14Z

@petrasovaa I saw that you were involved in some r.futures work. Do you know if there's something there to check out?

Well, it is a non-trivial model, so it's possible something is wrong, but where do you even see it and how often it happens?

echoix · 2025-02-14T19:38:12Z

Pick any one of the 6h ones of https://github.com/OSGeo/grass-addons/actions/workflows/ci.yml?query=is%3Afailure
Or some cancelled of this week: https://github.com/OSGeo/grass-addons/actions/workflows/ci.yml?query=is%3Acancelled

echoix · 2025-02-14T19:44:29Z

A harder to debug situation would be if this and/or r.random.walk are stochastic/gradient descent and they don't converge, ie, convergence depends on the initial state or seed.

echoix · 2025-02-14T19:53:29Z

We could always limit the test step at 45 minutes and save ourselves 5 hours of CI time in the European mornings

wenzeslaus · 2025-02-15T14:34:25Z

Limit sounds good, but the per test limit seems more informative for s failures.

echoix · 2025-02-15T15:57:41Z

Limit sounds good, but the per test limit seems more informative for s failures.

We might as well use both, as if we're effectively hung, the timeout might not work

echoix · 2025-02-16T22:56:57Z

I've got to a point where I'm able to run successfully r.futures.calib and r.futures.parallelpga with pytest, but not with gunittest, as r.futures.calib still hangs with gunittest.

I overcame one of our previous limitations of using pytest: I managed to write an autouse fixture that is enabled when the test directory is "testsuite", and contains a data directory. If so, it copies the data directory to a pytest tmp directory, and changes the current working directory with monkeypatch (so it gets reverted), and it really works.

A problem, is that since it is not pytest tests, the setUpClass() methods get executed before the autouse fixture can change the directory, so I changed to setUp() instead (per test), and it is still way faster than waiting for timeouts.

If we find a way to have the data directory copied for a setUpClass(), we would likely have everything solved for running everything with pytest (even in the main repo)

This was referenced Feb 15, 2025

CI(tests): Limit test step duration with timeout-minutes #1314

Merged

tests: Add timeout to .gunittest.cfg file #1318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

petrasovaa commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

wenzeslaus commented Feb 15, 2025

echoix commented Feb 15, 2025

echoix commented Feb 16, 2025

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

[Bug] Tests timeouts 6h+, maybe r.futures.calib hangs #1312

Comments

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

petrasovaa commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

echoix commented Feb 14, 2025

wenzeslaus commented Feb 15, 2025

echoix commented Feb 15, 2025

echoix commented Feb 16, 2025