-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Windows] Ungraceful shutdown of node corrupts the database #3608
Comments
This could definitely use a CI test that exposes the problem; that might be a good place to start. |
I think we may have actually tracked down the Or at least one situation where it can surface. tl;dr the sync MMR gets "out of sync" relative to the header MMR and ends up in a state where it continues to refer to a header that no longer exists. I'm not sure this actually resolves the issue identified in this PR. Its also not technically data corruption as we can recover from it and its more a case of us just handling it badly on startup. To be clear we definitely do still have situations where it is possible to legitimately corrupt the data on disk, but this missing header is (I believe) a less severe problem that can be handled more robustly during node startup (and I think we do that now with the fix linked above). |
So yes this is happening during "header sync" so in this specific case we are not deleting a header for it go go missing, but we are shutting the node down during the period where it is writing the sync MMR files to disk, but has not yet committed the batch of headers to the db. So on next startup the sync MMR is out beyond the headers in the db and the "sync head" points to a non-existent header. |
Describe the bug
Running the node on Windows and closing by clicking on "X" in the top right corner of the windows app has a high probability of corrupting the database.
To Reproduce
It doesn't always corrupt the data, but every 3rd try it usually ends in a bad state from which it can't recover.
Relevant Information
Logs when trying to run the node after a corruption occured.
Screenshots
/
Desktop (please complete the following information):
I've noticed this on Windows.
Additional context
It works if you quit the "Grin way" by pressing 'q' and waiting for the cleanup prior to shutdown.
Here's probably some context around the problem.
https://stackoverflow.com/questions/26658707/windows-console-application-signal-for-closing-event
I believe the ctrlc package we use may not support the SIGBREAK signal. Perhaps using a different library to catch these signals and then reacting on it also in this case would solve the issue, but I didn't dive too deep.
The text was updated successfully, but these errors were encountered: