Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of FDA-approvals in va-spec~ish (in progress) format #36

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

brendanreardon
Copy link
Collaborator

@brendanreardon brendanreardon commented Jan 27, 2025

We aim to have a next version of the underlying database comply with GA4GH's Genomic Knowledge Standards' Variation Annotation (va-spec) format. This pull request does not comply with the current draft or trial version of va-spec. Notably:

  • referenced/biomarkers.json more closely follows our current format for biomarker representation, this will be eventually refactored to align with categorical variant (cat-vrs) representation.
  • referenced/therapies.json also follows our our current format for therapy representation.
  • referenced biomarkers and therapies are represented in referenced/propositions.json as lists, whereas va-spec only allows a cardinality of 0...1 for the analogous fields of subjectVariant and objectTherapeutic.
  • referenced/documents.json has several keys that should be refactored as extensions.
  • We store underlying or "raw" statements in referenced/indications.json. At the moment va-spec doesn't have a place for something like this this, I think.
  • referenced/statements.json contains a key "evidence" that is not in their data model for statements, and we've generalized the value from "FDA-Approval" to "Regulatory approval for use".

The schema and content within this format are not stable and are in active development.

To dos before merging this branch

  • Add installation instructions for repository
  • Add documentation about this version being a draft and in progress
  • Add documentation for utils/dereference.py
  • Add additional utils and tests
  • Add documentation for the roadmap of building out this version (maybe make a Discussion?)

Copy link

@sabrinacamp2 sabrinacamp2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vibe check = this is really impressive work. Geez, seems like a lot of effort to create those referenced/*.json files from the original database. I can see how this will be easier to maintain and expand upon, though.

This script appears to replace all 'references' (e.g. pointers to other files) with their full record from the file that the reference referred to. You have two main functions to do this, Dereference.integer and Dereference.list, which vary depending on whether the key's value references one record (integer) or can reference multiple (list). You dereference in multiple steps, using prior dereferenced lists of dictionaries in future steps to end up with a fully dereferenced database. I can't say I intuitively understand the sequence yet and I may need to draw out a diagram of all the connections, haha.

My comments are minor. Some aimed at improving interpretability (which may not be necessary, but was top of mind since this is the first time I've looked at this), others noting unused functions, and other misc things.

Thanks for bringing me into the loop on this work! Happy to be the one to incorporate any changes that you agree with.

utils/dereference.py Outdated Show resolved Hide resolved
utils/dereference.py Outdated Show resolved Hide resolved
referenced_key (str): name of the key in records to dereference.
referenced_records (str): list of dictionaries that the referenced_key refers to.
new_key_name (str): name of the key in records to replace referenced_key after dereference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding a Returns: section for this function and the list function

utils/dereference.py Outdated Show resolved Hide resolved
utils/dereference.py Outdated Show resolved Hide resolved
utils/dereference.py Outdated Show resolved Hide resolved
"""
Dereferences a key for each record in records, presuming that the corresponding value is a list.

Args:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's only the option for a new key name in the integer function and not the list function. Likely intentional, perhaps you only wanted a new key name for keys that store one record

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was working trying to get an older version of this into a SQL database, I was able to dereference without replacing the key name when the value was a list, but not an integer. So I ended up having to add _id as a suffix to some keys. I'm still not entirely sure why this was the case though 😵‍💫



@staticmethod
def rename_key(dictionary: dict, old_key: str, new_key: str) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function used anywhere, so I wanted to note that.

raise IOError(f"Failed to write to file {file}: {e}")


def write_json_records(data: list[dict], file:str) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used currently, so I wanted to note that. I see how it could be useful for writing out outputs of the Dereference.integer and list functions for debugging and what not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is precisely what ended up happening. I was debating leaving it in or not, and am tempted to leave it in to make it easier to reuse further down the line 🫠

brendanreardon and others added 12 commits January 28, 2025 17:34
Update description of Dereference.integer function per Sabrina's recommendation

Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
Argument description update in Dereference.integer function, recommended by Sabrina

Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
Updating description of Dereference.list function, as recommended by Sabrina

Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
@brendanreardon
Copy link
Collaborator Author

I created a diagram of the schema and added it to docs/assets, and created a README for the directory.

moalmanac-db-schema-diagram

"id": 0,
"conceptType": "Disease",
"label": "Acute Lymphoid Leukemia",
"extensions": [
Copy link

@korikuzma korikuzma Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend storing oncotree_term and oncotree_code as mappings, similar to how you're representing genes

{
                "coding": {
                    "id": "oncotree:ALL",
                    "code": "ALL",
                    "label": "Acute Lymphoid Leukemia",
                    "system": "https://oncotree.mskcc.org/?version=oncotree_latest_stable&field=CODE&search=",
                },
                "relation": "exactMatch",
                "extensions": [
                    {
                        "name": "solid_tumor",
                        "value": false
                    }
                ]
            }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo thank you, Kori!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants