Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Ungraceful shutdown of node corrupts the database #3608

Open
phyro opened this issue Mar 20, 2021 · 3 comments
Open

[Windows] Ungraceful shutdown of node corrupts the database #3608

phyro opened this issue Mar 20, 2021 · 3 comments

Comments

@phyro
Copy link
Member

phyro commented Mar 20, 2021

Describe the bug
Running the node on Windows and closing by clicking on "X" in the top right corner of the windows app has a high probability of corrupting the database.

To Reproduce

  1. Download the Window grin-node release
  2. Remove the .grin folder if you have it
  3. Open grin node and let it start sync
  4. Close it after 20 seconds or so by clicking 'X'

It doesn't always corrupt the data, but every 3rd try it usually ends in a bad state from which it can't recover.

Relevant Information
Logs when trying to run the node after a corruption occured.

20210109 18:59:05.601 INFO grin_util::logger - log4rs is initialized, file level: Info, stdout level: Warn, min. level: Info
20210109 18:59:05.603 INFO grin - Using configuration file at C:\Users\phyro\.grin\main\grin-server.toml
20210109 18:59:05.603 INFO grin - This is Grin version 5.0.1 (git v5.0.1), built for x86_64-pc-windows-msvc by rustc 1.48.0 (7eac88abb 2020-11-16).
20210109 18:59:05.603 INFO grin - Chain: Mainnet
20210109 18:59:05.603 INFO grin - Accept Fee Base: 500000
20210109 18:59:05.603 INFO grin - Future Time Limit: 300
20210109 18:59:05.603 INFO grin - Feature: NRD kernel enabled: false
20210109 18:59:05.603 WARN grin::cmd::server - Starting GRIN in UI mode...
20210109 18:59:05.603 INFO grin_servers::grin::server - Starting server, genesis block: 40adad0aec27
20210109 18:59:05.779 ERROR grin_util::logger - 
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Chain(Error { inner: NotFoundErr("BLOCK HEADER: 000007f722ca")

Store Error: NotFoundErr("BLOCK HEADER: 000007f722ca"), reason: DB Not Found Error: BLOCK HEADER: 000007f722ca })': src/bin\cmd\server.rs:58   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
  16: <unknown>
  17: <unknown>
  18: BaseThreadInitThunk
  19: RtlUserThreadStart

Screenshots
/

Desktop (please complete the following information):
I've noticed this on Windows.

Additional context

It works if you quit the "Grin way" by pressing 'q' and waiting for the cleanup prior to shutdown.

Here's probably some context around the problem.

https://stackoverflow.com/questions/26658707/windows-console-application-signal-for-closing-event

I believe the ctrlc package we use may not support the SIGBREAK signal. Perhaps using a different library to catch these signals and then reacting on it also in this case would solve the issue, but I didn't dive too deep.

@trevyn
Copy link
Contributor

trevyn commented Mar 20, 2021

This could definitely use a CI test that exposes the problem; that might be a good place to start.

@antiochp
Copy link
Member

antiochp commented Mar 22, 2021

I think we may have actually tracked down the NotFoundErr("BLOCK HEADER: error when we were frantically implementing the "rewind bad block" logic for #3605.

Or at least one situation where it can surface.
Its related to the "sync MMR" and I believe fixed here -

https://github.com/mimblewimble/grin/pull/3603/files#diff-29a4dbf8cadda7c8ed1b4e35a237dfebe170b3703d7af2c78c065ae65ab04100

tl;dr the sync MMR gets "out of sync" relative to the header MMR and ends up in a state where it continues to refer to a header that no longer exists.

I'm not sure this actually resolves the issue identified in this PR. Its also not technically data corruption as we can recover from it and its more a case of us just handling it badly on startup.

To be clear we definitely do still have situations where it is possible to legitimately corrupt the data on disk, but this missing header is (I believe) a less severe problem that can be handled more robustly during node startup (and I think we do that now with the fix linked above).

@antiochp
Copy link
Member

antiochp commented Mar 22, 2021

  • Remove the .grin folder if you have it
  • Open grin node and let it start sync
  • Close it after 20 seconds or so by clicking 'X'

So yes this is happening during "header sync" so in this specific case we are not deleting a header for it go go missing, but we are shutting the node down during the period where it is writing the sync MMR files to disk, but has not yet committed the batch of headers to the db. So on next startup the sync MMR is out beyond the headers in the db and the "sync head" points to a non-existent header.
We want to reset the sync MMR to an earlier state which I believe we now correctly do in this situation, we just need to get past the initial startup where it makes an assumption that the header exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants