Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where validate_json referential integrity check ignored documents in database #861

Merged

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Jan 16, 2025

On this branch, I fixed a bug where the real-time referential integrity checker in the validate_json function was ignoring documents that were already in the database.

Details

On this branch, I changed the real-time referential integrity checker code to no longer use the OverlayDB class. When I first wrote the real-time referential integrity checker code, I had made an incorrect assumption about how the OverlayDB class worked.

Instead of it using the OverlayDB class, I updated the checker to use the real database directly and to use the JSON payload directly. Checking now happens in two stages (one for each of those potential "residences" of each referenced document).

The two stages are (in short—as a more detailed explanation is in the code comments):

  1. Check the database
  2. If the document isn't found there, check the JSON payload

With those changes in place, the checker no longer ignores documents that already exist in the database. I added a unit test that demonstrates this.

Related issue(s)

Fixes #860

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by adding some unit tests that target them, and ensuring that those tests pass. I will leave it to GitHub Actions to tell me whether all pre-existing tests still pass.

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@eecavanna eecavanna self-assigned this Jan 16, 2025
@eecavanna eecavanna linked an issue Jan 16, 2025 that may be closed by this pull request
@eecavanna eecavanna marked this pull request as ready for review January 16, 2025 06:01
@eecavanna
Copy link
Collaborator Author

Props to @mbthornton-lbl for reporting this issue via Slack earlier today—thank you!

@eecavanna
Copy link
Collaborator Author

I'm ready for this to be merged in, even in my absence (in case the reviews take place before I'm online).

aclum
aclum previously approved these changes Jan 16, 2025
Copy link
Collaborator

@shreddd shreddd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Possible future enhancements:

  • We should also look at optimizations to queries / caching / memoization as a way to speed up multiple queries looking for the same doc.
  • Seeing if we can do this as a transaction that can be rolled back.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 17, 2025

Thanks, @shreddd!

Regarding:

Seeing if we can do this as a transaction that can be rolled back.

I want to expand on that by saying (as we discussed over Zoom earlier today) the transaction would allow us to "stage" the proposed operations on the real database (i.e. "perform" them, but without committing the transaction), do the referential integrity scan on that single database (on which the operations have been "staged"), and then roll back the transaction.

I'll merge this in now in order to fix the bug currently present in the development environment.

@eecavanna eecavanna merged commit 27e3f0c into main Jan 17, 2025
2 checks passed
@eecavanna eecavanna deleted the 860-real-time-referential-integrity-checking-is-broken branch January 17, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Real-time referential integrity checking is broken
3 participants