-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add class with helper methods to fill in missing records in database #856
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks nice and I appreciate the documentation you've added. I have some comments and questions, but none of it is so critical that it needs to hold up merging.
I guess my main concern, design-wise, is that I'd prefer to be more explicit in terms of naming things and avoid vague (and somewhat misleading) wording like "missing" or "repair".
For example, there's a Dagster Op named missing_data_generation_repair
. What it actually seems to do is produce NucleotideSequencing
records based on information fetched from the GOLD API for a given Study
. That's all well and good. But it would do the same thing whether or not there are existing NucleotideSequencing
records associated with the given Study
. So the "missing" part of the name seems misleading. Similarly, to me, the word "repair" implies that the Study
input is going to be modified in some way. It also implies that the study was "broken" before 😂. All that being said, I probably would have named it something more explicit like generate_data_generation_set_from_gold_api_for_study
. It's long, but, hey, characters are free.
|
||
|
||
@graph | ||
def fill_missing_data_generation_data_object_records(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me "fill" implies that it's going to actually commit the results to Mongo. Is there any reason this Graph doesn't do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so we want to see the JSON export that this job produces, visually QC it and then ingest it into the system. So it's not doing the "filling" automatically. Is "stitch_" a good prefix?
|
||
|
||
@op | ||
def nmdc_study_id_filename(nmdc_study_id: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a more specific name since it produces a very specific file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address it!
from nmdc_schema import nmdc | ||
|
||
|
||
class DatabaseUpdater: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value in this being a class? Since it's not inheriting from anything and the internal state never really gets modified, could create_missing_dg_records
just be a top-level function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, there will be other methods that I add to this "interface" (class) in the future, so we have to decide between "collecting" all of these methods under that class or have them be top level functions?
""" | ||
return self.gold_api_client.fetch_projects_by_biosample(gold_biosample_id) | ||
|
||
def create_missing_dg_records(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above comment, I'm a little hesitant about the "missing" part of this name. To me this is generate_data_generation_set_from_gold_api
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! "missing" is a little misleading, "generate" is better.
"get_database_updater_inputs": { | ||
"config": { | ||
"nmdc_study_id": "", | ||
"gold_nmdc_instrument_mapping_file_url": "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/refs/heads/main/assets/misc/gold_seqMethod_to_nmdc_instrument_set.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have already explained this to me and I've forgotten, but why do we keep this file in the nmdc-schema
repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, that's a good question. I actually don't know why we keep it in the nmdc-schema
repo. We were debating between having those files be "close" to the source files that use them in runtime or keep them in the schema repo, and I guess we just left the conversation there. We can talk about potentially moving them into this repo and have them be "close" to the Python scripts consuming them at our internal BBPO NMDC call.
Awesome!! this is such a wonderful review!! thank you @pkalita-lbl. I have another PR coming out next week that adds more functionality to this module, I'll make sure to address all the review comments you've left here in a commit on that PR if that works? |
We are trying to address the general problem of "database stitching" in this PR. It will often be the case that we have certain records in the database (say biosamples), but records that are needed for the next bits of processing of these "entities" might be missing.
For example, the first and most common use case that we are trying to address is filling in missing data_generation_set records for biosamples that have been brought into the database through the
SubmissionPortalTranslator
mechanism.DatabaseUpdater
class (general interface which we will use to collect helper methods to facilitate all use cases of "database stitching")Details
...
Related issue(s)
Fixes #756 microbiomedata/issues#951 microbiomedata/issues#813
Related subsystem(s)
docs
directory)Testing
I tested these changes by...
Documentation
docs
directory)Maintainability
study_id: str
)# TODO
or# FIXME
black
to format all the Python files I created/modified