Add class with helper methods to fill in missing records in database #856

sujaypatil96 · 2024-12-26T21:11:20Z

We are trying to address the general problem of "database stitching" in this PR. It will often be the case that we have certain records in the database (say biosamples), but records that are needed for the next bits of processing of these "entities" might be missing.

For example, the first and most common use case that we are trying to address is filling in missing data_generation_set records for biosamples that have been brought into the database through the SubmissionPortalTranslator mechanism.

Add Dagster harness to support calling of helper methods in DatabaseUpdater class (general interface which we will use to collect helper methods to facilitate all use cases of "database stitching")
Add helper methods to class
- Create missing DataGeneration records given Biosamples that are already present in the database

Details

...

Related issue(s)

Fixes #756 microbiomedata/issues#951 microbiomedata/issues#813

Related subsystem(s)

Testing

I tested these changes (explain below)
I did not test these changes

I tested these changes by...

Documentation

I have not checked for relevant documentation yet (e.g. in the docs directory)
I have updated all relevant documentation so it will remain accurate
Other (explain below)

Maintainability

Every Python function I defined includes a docstring (test functions are exempt from this)
Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
I used black to format all the Python files I created/modified
The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

nmdc_runtime/site/repair/database_updater.py

nmdc_runtime/site/resources.py

pkalita-lbl

Overall this looks nice and I appreciate the documentation you've added. I have some comments and questions, but none of it is so critical that it needs to hold up merging.

I guess my main concern, design-wise, is that I'd prefer to be more explicit in terms of naming things and avoid vague (and somewhat misleading) wording like "missing" or "repair".

For example, there's a Dagster Op named missing_data_generation_repair. What it actually seems to do is produce NucleotideSequencing records based on information fetched from the GOLD API for a given Study. That's all well and good. But it would do the same thing whether or not there are existing NucleotideSequencing records associated with the given Study. So the "missing" part of the name seems misleading. Similarly, to me, the word "repair" implies that the Study input is going to be modified in some way. It also implies that the study was "broken" before 😂. All that being said, I probably would have named it something more explicit like generate_data_generation_set_from_gold_api_for_study. It's long, but, hey, characters are free.

pkalita-lbl · 2025-01-11T00:09:40Z

nmdc_runtime/site/graphs.py

+
+
+@graph
+def fill_missing_data_generation_data_object_records():


To me "fill" implies that it's going to actually commit the results to Mongo. Is there any reason this Graph doesn't do that?

Hm, so we want to see the JSON export that this job produces, visually QC it and then ingest it into the system. So it's not doing the "filling" automatically. Is "stitch_" a good prefix?

pkalita-lbl · 2025-01-11T00:11:38Z

nmdc_runtime/site/ops.py

+
+
+@op
+def nmdc_study_id_filename(nmdc_study_id: str) -> str:


I think this needs a more specific name since it produces a very specific file name.

Will address it!

pkalita-lbl · 2025-01-11T00:14:52Z

nmdc_runtime/site/repair/database_updater.py

+from nmdc_schema import nmdc
+
+
+class DatabaseUpdater:


Is there any value in this being a class? Since it's not inheriting from anything and the internal state never really gets modified, could create_missing_dg_records just be a top-level function instead?

Hm, there will be other methods that I add to this "interface" (class) in the future, so we have to decide between "collecting" all of these methods under that class or have them be top level functions?

pkalita-lbl · 2025-01-11T00:16:35Z

nmdc_runtime/site/repair/database_updater.py

+        """
+        return self.gold_api_client.fetch_projects_by_biosample(gold_biosample_id)
+
+    def create_missing_dg_records(self):


Similar to the above comment, I'm a little hesitant about the "missing" part of this name. To me this is generate_data_generation_set_from_gold_api.

Makes sense! "missing" is a little misleading, "generate" is better.

pkalita-lbl · 2025-01-11T00:17:20Z

nmdc_runtime/site/repository.py

+                    "get_database_updater_inputs": {
+                        "config": {
+                            "nmdc_study_id": "",
+                            "gold_nmdc_instrument_mapping_file_url": "https://raw.githubusercontent.com/microbiomedata/nmdc-schema/refs/heads/main/assets/misc/gold_seqMethod_to_nmdc_instrument_set.tsv",


You might have already explained this to me and I've forgotten, but why do we keep this file in the nmdc-schema repository?

No no, that's a good question. I actually don't know why we keep it in the nmdc-schema repo. We were debating between having those files be "close" to the source files that use them in runtime or keep them in the schema repo, and I guess we just left the conversation there. We can talk about potentially moving them into this repo and have them be "close" to the Python scripts consuming them at our internal BBPO NMDC call.

sujaypatil96 · 2025-01-11T00:47:28Z

Awesome!! this is such a wonderful review!! thank you @pkalita-lbl. I have another PR coming out next week that adds more functionality to this module, I'll make sure to address all the review comments you've left here in a commit on that PR if that works?

sujaypatil96 added 5 commits December 26, 2024 12:56

dagster harness for missing records updater

3fe47a0

stub of DatabaseUpdater class

8af1493

harness updates to accommodate DatabaseUpdater

ae220be

logic to make DataGeneration records based on GOLD ids

47320db

added tests for DatabaseUpdater

4434739

sujaypatil96 mentioned this pull request Jan 2, 2025

Ingest and process data for nmdc:sty-11-e4yb9z58 (GLBRC) microbiomedata/issues#813

Open

6 tasks

sujaypatil96 added 2 commits January 2, 2025 12:30

add documentation for logic in the DatabaseUpdater class

faf4bf7

improve caching in DatabaseUpdater

b028c8f

sujaypatil96 marked this pull request as ready for review January 3, 2025 01:09

sujaypatil96 requested a review from pkalita-lbl January 3, 2025 01:09

sujaypatil96 mentioned this pull request Jan 3, 2025

Create data_generation_set records for nmdc:sty-11-8ws97026 (PI Blanchard) microbiomedata/issues#951

Closed

2 tasks

aclum reviewed Jan 6, 2025

View reviewed changes

nmdc_runtime/site/repair/database_updater.py Show resolved Hide resolved

This was referenced Jan 6, 2025

n=2 Microbes Persist samples without omic data microbiomedata/issues#684

Open

missing data_generation_set record for nmdc:bsm-11-7v0s5h20 microbiomedata/issues#1006

Open

aclum reviewed Jan 7, 2025

View reviewed changes

nmdc_runtime/site/resources.py Outdated Show resolved Hide resolved

modify method that gets biosamples based on study

128906f

sujaypatil96 requested a review from aclum January 8, 2025 00:07

aclum previously approved these changes Jan 8, 2025

View reviewed changes

sujaypatil96 linked an issue Jan 8, 2025 that may be closed by this pull request

Update GOLD ETL logic for populating PI information #855

Open

sujaypatil96 removed a link to an issue Jan 8, 2025

Update GOLD ETL logic for populating PI information #855

Open

sujaypatil96 mentioned this pull request Jan 8, 2025

Ingest remaining 458 biosamples for nmdc:sty-11-547rwq94 (EMP500) microbiomedata/issues#940

Open

1 task

remove cache decorator on create_missing_dg_records()

dfbb1fe

sujaypatil96 dismissed aclum’s stale review via dfbb1fe January 10, 2025 02:04

sujaypatil96 requested a review from aclum January 10, 2025 19:52

aclum approved these changes Jan 11, 2025

View reviewed changes

pkalita-lbl approved these changes Jan 11, 2025

View reviewed changes

sujaypatil96 merged commit 3e7a36a into main Jan 11, 2025
2 checks passed

sujaypatil96 deleted the issue-756 branch January 11, 2025 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add class with helper methods to fill in missing records in database #856

Add class with helper methods to fill in missing records in database #856

sujaypatil96 commented Dec 26, 2024 •

edited

Loading

pkalita-lbl left a comment

pkalita-lbl Jan 11, 2025

sujaypatil96 Jan 11, 2025

pkalita-lbl Jan 11, 2025

sujaypatil96 Jan 11, 2025

pkalita-lbl Jan 11, 2025

sujaypatil96 Jan 11, 2025

pkalita-lbl Jan 11, 2025

sujaypatil96 Jan 11, 2025

pkalita-lbl Jan 11, 2025

sujaypatil96 Jan 11, 2025

sujaypatil96 commented Jan 11, 2025



		@graph
		def fill_missing_data_generation_data_object_records():

Add class with helper methods to fill in missing records in database #856

Add class with helper methods to fill in missing records in database #856

Conversation

sujaypatil96 commented Dec 26, 2024 • edited Loading

Details

Related issue(s)

Related subsystem(s)

Testing

Documentation

Maintainability

pkalita-lbl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sujaypatil96 commented Jan 11, 2025

sujaypatil96 commented Dec 26, 2024 •

edited

Loading