Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add class with helper methods to fill in missing records in database #856

Merged
merged 9 commits into from
Jan 11, 2025

Conversation

sujaypatil96
Copy link
Collaborator

@sujaypatil96 sujaypatil96 commented Dec 26, 2024

We are trying to address the general problem of "database stitching" in this PR. It will often be the case that we have certain records in the database (say biosamples), but records that are needed for the next bits of processing of these "entities" might be missing.

For example, the first and most common use case that we are trying to address is filling in missing data_generation_set records for biosamples that have been brought into the database through the SubmissionPortalTranslator mechanism.

  • Add Dagster harness to support calling of helper methods in DatabaseUpdater class (general interface which we will use to collect helper methods to facilitate all use cases of "database stitching")
  • Add helper methods to class
    • Create missing DataGeneration records given Biosamples that are already present in the database

Details

...

Related issue(s)

Fixes #756 microbiomedata/issues#951 microbiomedata/issues#813

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by...

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@sujaypatil96 sujaypatil96 requested a review from aclum January 8, 2025 00:07
aclum
aclum previously approved these changes Jan 8, 2025
Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks nice and I appreciate the documentation you've added. I have some comments and questions, but none of it is so critical that it needs to hold up merging.


I guess my main concern, design-wise, is that I'd prefer to be more explicit in terms of naming things and avoid vague (and somewhat misleading) wording like "missing" or "repair".

For example, there's a Dagster Op named missing_data_generation_repair. What it actually seems to do is produce NucleotideSequencing records based on information fetched from the GOLD API for a given Study. That's all well and good. But it would do the same thing whether or not there are existing NucleotideSequencing records associated with the given Study. So the "missing" part of the name seems misleading. Similarly, to me, the word "repair" implies that the Study input is going to be modified in some way. It also implies that the study was "broken" before 😂. All that being said, I probably would have named it something more explicit like generate_data_generation_set_from_gold_api_for_study. It's long, but, hey, characters are free.



@graph
def fill_missing_data_generation_data_object_records():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me "fill" implies that it's going to actually commit the results to Mongo. Is there any reason this Graph doesn't do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so we want to see the JSON export that this job produces, visually QC it and then ingest it into the system. So it's not doing the "filling" automatically. Is "stitch_" a good prefix?



@op
def nmdc_study_id_filename(nmdc_study_id: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a more specific name since it produces a very specific file name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address it!

from nmdc_schema import nmdc


class DatabaseUpdater:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in this being a class? Since it's not inheriting from anything and the internal state never really gets modified, could create_missing_dg_records just be a top-level function instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there will be other methods that I add to this "interface" (class) in the future, so we have to decide between "collecting" all of these methods under that class or have them be top level functions?

"""
return self.gold_api_client.fetch_projects_by_biosample(gold_biosample_id)

def create_missing_dg_records(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comment, I'm a little hesitant about the "missing" part of this name. To me this is generate_data_generation_set_from_gold_api.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! "missing" is a little misleading, "generate" is better.

"get_database_updater_inputs": {
"config": {
"nmdc_study_id": "",
"gold_nmdc_instrument_mapping_file_url": "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/refs/heads/main/assets/misc/gold_seqMethod_to_nmdc_instrument_set.tsv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have already explained this to me and I've forgotten, but why do we keep this file in the nmdc-schema repository?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, that's a good question. I actually don't know why we keep it in the nmdc-schema repo. We were debating between having those files be "close" to the source files that use them in runtime or keep them in the schema repo, and I guess we just left the conversation there. We can talk about potentially moving them into this repo and have them be "close" to the Python scripts consuming them at our internal BBPO NMDC call.

@sujaypatil96
Copy link
Collaborator Author

Awesome!! this is such a wonderful review!! thank you @pkalita-lbl. I have another PR coming out next week that adds more functionality to this module, I'll make sure to address all the review comments you've left here in a commit on that PR if that works?

@sujaypatil96 sujaypatil96 merged commit 3e7a36a into main Jan 11, 2025
2 checks passed
@sujaypatil96 sujaypatil96 deleted the issue-756 branch January 11, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modify GOLD ETL code to check existing records and only create missing records.
3 participants