-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of FDA-approvals in va-spec~ish (in progress) format #36
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vibe check = this is really impressive work. Geez, seems like a lot of effort to create those referenced/*.json
files from the original database. I can see how this will be easier to maintain and expand upon, though.
This script appears to replace all 'references' (e.g. pointers to other files) with their full record from the file that the reference referred to. You have two main functions to do this, Dereference.integer
and Dereference.list
, which vary depending on whether the key's value references one record (integer) or can reference multiple (list). You dereference in multiple steps, using prior dereferenced lists of dictionaries in future steps to end up with a fully dereferenced database. I can't say I intuitively understand the sequence yet and I may need to draw out a diagram of all the connections, haha.
My comments are minor. Some aimed at improving interpretability (which may not be necessary, but was top of mind since this is the first time I've looked at this), others noting unused functions, and other misc things.
Thanks for bringing me into the loop on this work! Happy to be the one to incorporate any changes that you agree with.
referenced_key (str): name of the key in records to dereference. | ||
referenced_records (str): list of dictionaries that the referenced_key refers to. | ||
new_key_name (str): name of the key in records to replace referenced_key after dereference. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding a Returns:
section for this function and the list
function
""" | ||
Dereferences a key for each record in records, presuming that the corresponding value is a list. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's only the option for a new key name in the integer
function and not the list
function. Likely intentional, perhaps you only wanted a new key name for keys that store one record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was working trying to get an older version of this into a SQL database, I was able to dereference without replacing the key name when the value was a list, but not an integer. So I ended up having to add _id
as a suffix to some keys. I'm still not entirely sure why this was the case though 😵💫
utils/dereference.py
Outdated
|
||
|
||
@staticmethod | ||
def rename_key(dictionary: dict, old_key: str, new_key: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this function used anywhere, so I wanted to note that.
utils/dereference.py
Outdated
raise IOError(f"Failed to write to file {file}: {e}") | ||
|
||
|
||
def write_json_records(data: list[dict], file:str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used currently, so I wanted to note that. I see how it could be useful for writing out outputs of the Dereference.integer
and list
functions for debugging and what not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is precisely what ended up happening. I was debating leaving it in or not, and am tempted to leave it in to make it easier to reuse further down the line 🫠
Update description of Dereference.integer function per Sabrina's recommendation Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
Argument description update in Dereference.integer function, recommended by Sabrina Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
Updating description of Dereference.list function, as recommended by Sabrina Co-authored-by: sabrinacamp2 <52207849+sabrinacamp2@users.noreply.github.com>
I created a diagram of the schema and added it to docs/assets, and created a README for the directory. |
"id": 0, | ||
"conceptType": "Disease", | ||
"label": "Acute Lymphoid Leukemia", | ||
"extensions": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend storing oncotree_term
and oncotree_code
as mappings
, similar to how you're representing genes
{
"coding": {
"id": "oncotree:ALL",
"code": "ALL",
"label": "Acute Lymphoid Leukemia",
"system": "https://oncotree.mskcc.org/?version=oncotree_latest_stable&field=CODE&search=",
},
"relation": "exactMatch",
"extensions": [
{
"name": "solid_tumor",
"value": false
}
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo thank you, Kori!
We aim to have a next version of the underlying database comply with GA4GH's Genomic Knowledge Standards' Variation Annotation (va-spec) format. This pull request does not comply with the current draft or trial version of va-spec. Notably:
referenced/biomarkers.json
more closely follows our current format for biomarker representation, this will be eventually refactored to align with categorical variant (cat-vrs) representation.referenced/therapies.json
also follows our our current format for therapy representation.referenced/propositions.json
as lists, whereas va-spec only allows a cardinality of0...1
for the analogous fields ofsubjectVariant
andobjectTherapeutic
.referenced/documents.json
has several keys that should be refactored as extensions.referenced/indications.json
. At the moment va-spec doesn't have a place for something like this this, I think.referenced/statements.json
contains a key "evidence" that is not in their data model for statements, and we've generalized the value from "FDA-Approval" to "Regulatory approval for use".The schema and content within this format are not stable and are in active development.
To dos before merging this branch
utils/dereference.py