-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of bundled data through designated separate repo #150
Conversation
R/load_data.R
Outdated
latest_version <- max(versions) | ||
|
||
# Which version did the user requested in config? | ||
requested_version <- config::get("bundled_data_versions")[[mode]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be bundled_data_version
(not plural) since only one is ever specified by the user, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! I'll fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I started adding other objects and looking at this, I realized here that we would have this in plural - because different data types develop more than the others and iterate through versions differently. For example, for aSHM sites we already have versions 0, 1, 2, and 3 - but for lymphoma genes only 0 and 1. So config here will have different keys depending on the data type.
@@ -160,3 +160,7 @@ default: | |||
hg38-nci: "04-24937N-Schmitz" | |||
hg38: "BLGSP-71-06-00286-99A-01D" | |||
hg19-clc: "PA011-G" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use this to specify a default metadata for example data sets bundled with the package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking of it the same way as well - and if the concept in this draft is something we will move forward with I will switch the metadata (+ ssm, SV, CNV) to the helper package as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also omitted from this PR bundling the example metadata and datasets - I think this can be it;s own dedicated PR since we would also want to generate examples for each function using that example data and for metadata we also need to work out minimal required columns
This is now tested and is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for a great update Kosta. I have a few comments related to how you handle the function documentation. Let me know if these make sense to you. Thanks!
R/load_data.R
Outdated
@@ -0,0 +1,106 @@ | |||
# Helper functions not for export | |||
|
|||
#' Check for a version of data to load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a descriptive title for this function. Currently, this function has the title "Check for a version of data to load". The first line of the function doc, is where the title is extracted from.
R/load_data.R
Outdated
|
||
#' Check for a version of data to load | ||
#' | ||
#' This function determines if a user is requesting the latest version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I would specify that this is a helper function (to have it consistent with other helper functions).
#' | ||
#' @return data frame | ||
#' @import config dplyr readr GAMBLR.data | ||
#' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gone over all the GAMBLR helper functions on my branch. This was done as a step in preparing the documentation for building a site from the source code with build_site
from pkgdown
. This function takes all .Rd
files that live in the man folder and generates function-specific HTML files. This is not something we want for helper functions (since they are not exported in NAMESPACE).
The fix for this is to add @noRd
in the documentation for such functions. This prevents the .Rd
file from being generated.
I think you should also specify this for this function to keep things consistent. Let me know if this doesn't make sense, or if you need me to further clarify.
Thanks,
Thanks for the updates Kostia, for some reason Git didn't notify me on email that you've updated this branch. However, I cloned this branch to test the newly added
I am not sure if this error is related to Thanks, |
Thanks Adam for catching this! Yes it was because of df was undefined but I have also specified the
|
Thanks Kostia! I will clone this branch and try it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update Kostia, it sure feels nice to finally have capture support for the CN functions. I have some comments before I approve this PR. Let me know if there is anything you want me to clarify. Thanks again,
-
There are also a few
fancy_x_plot
functions (fancy_v_chrcount
,fancy_snv_chrdistplot
,fancy_v_count
,fancy_v_sizedis
,fancy_ideogram
,fancy_circos_plot
) that are internally callingassign_cn_to_ssm
. These functions should also have the recently addedthis_seq_type
parameter added. -
Other functions that also internally call
assign_cn_to_ssm
are;estimate_purity
,copy_number_vaf_plot
, (these functions should also have the new parameter added). -
The snakefile used for retrieving data (to run GAMBLR remote) should probably also be updated to fetch the capture version of the merged CNV data file. Otherwise, the remote functionality will be broken whenever seq type capture is requested.
-
As a side note, I ran my recently developed
get_missing_from_merge
withmerge = "cnv"
andthis_seq_type = "capture"
and 1 sample appears to be missing (1669-06-05PD). Is this expected? -
The first example in
get_cn_states
does not seem to be working for me. I am interested if this works for you. I've tried this example on my own branch and on the recently cloned Kostia branch, but errors out in both cases. This is the example I am referring tocn_matrix = get_cn_states(regions_bed = grch37_lymphoma_genes_bed)
-
I tested out all the affected functions with different parameters to get a good sense of how they can handle this update. The results look great, with one exception.
assign_cn_to_ssm
errors out when capture is specified as the seq type. What about adding a working example for this function with the seq type set to capture? Or detail the limitations of the function in the docs. What do you think? -
collate_pga
now works as intended, nice! I am wondering, do you want to update the cached files (forcollate_results
) as well? I checked, and PGA seems to missing from these results. If not, this can probably be done in the near future.
Thanks!
@@ -1887,11 +1894,6 @@ get_cn_segments = function(region, | |||
#checks | |||
remote_session = check_remote_configuration(auto_connect = TRUE) | |||
|
|||
#check seq type and return a message if anything besides "genome" is called. To be updated once capture samples have been processed through CNV protocols. | |||
if(this_seq_type!="genome"){ | |||
stop("Currently, only genome samples are available for this function. Please select a valid seq type (i.e genome). Compatibility for capture samples will be added soon...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must've felt nice to remove 😎
@@ -538,7 +538,6 @@ region_to_gene = function(region, | |||
#' @noRd | |||
#' | |||
#' @rawNamespace import(data.table, except = c("last", "first", "between", "transpose")) | |||
#' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, there it is! I've been looking for this. Thanks for fixing it!
Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
Thanks Adam! Here is what I think:
It of course is not producing a desired output (column in the matrix) and therefore there is mismatch in the length of names and columns - so the function errors. I created an issue related to this issue #193 and this can be addressed in a separate PR. 🏳️
|
Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
Thanks for answering my questions Kostia. I can indeed test and address the remote functionality related to capture CNVs. I'll add support for this in my next PR. I can also look into the other outstanding items that you created issues for. I think this PR looks good, I'll approve the changes. |
Regarding the Question 5 above, I have resolved it and closed the issue #193 . I will wait till the GitHub Actions pass on the latest commit for this branch and will then merge it to master. Thanks Adam! |
This is related to issues #147 and #149
Currently this set up supports use of ashm coordinates of specific version as GAMBLR is first loaded. The version is stored in config and can be modified on session-by-session, project-by-project basis.
If this set up proves to be efficient, it will be extended to handle also lymphoma genes and other bundled GAMBLR data.