Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of bundled data through designated separate repo #150

Merged
merged 34 commits into from
May 4, 2023

Conversation

Kdreval
Copy link
Collaborator

@Kdreval Kdreval commented Dec 29, 2022

This is related to issues #147 and #149
Currently this set up supports use of ashm coordinates of specific version as GAMBLR is first loaded. The version is stored in config and can be modified on session-by-session, project-by-project basis.
If this set up proves to be efficient, it will be extended to handle also lymphoma genes and other bundled GAMBLR data.

@Kdreval Kdreval marked this pull request as draft January 30, 2023 19:53
R/load_data.R Outdated
latest_version <- max(versions)

# Which version did the user requested in config?
requested_version <- config::get("bundled_data_versions")[[mode]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be bundled_data_version (not plural) since only one is ever specified by the user, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I'll fix this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I started adding other objects and looking at this, I realized here that we would have this in plural - because different data types develop more than the others and iterate through versions differently. For example, for aSHM sites we already have versions 0, 1, 2, and 3 - but for lymphoma genes only 0 and 1. So config here will have different keys depending on the data type.

@@ -160,3 +160,7 @@ default:
hg38-nci: "04-24937N-Schmitz"
hg38: "BLGSP-71-06-00286-99A-01D"
hg19-clc: "PA011-G"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use this to specify a default metadata for example data sets bundled with the package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking of it the same way as well - and if the concept in this draft is something we will move forward with I will switch the metadata (+ ssm, SV, CNV) to the helper package as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also omitted from this PR bundling the example metadata and datasets - I think this can be it;s own dedicated PR since we would also want to generate examples for each function using that example data and for metadata we also need to work out minimal required columns

@Kdreval Kdreval marked this pull request as ready for review March 24, 2023 15:43
@Kdreval
Copy link
Collaborator Author

Kdreval commented Mar 24, 2023

This is now tested and is ready for review

Copy link
Contributor

@mattssca mattssca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a great update Kosta. I have a few comments related to how you handle the function documentation. Let me know if these make sense to you. Thanks!

R/load_data.R Outdated
@@ -0,0 +1,106 @@
# Helper functions not for export

#' Check for a version of data to load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a descriptive title for this function. Currently, this function has the title "Check for a version of data to load". The first line of the function doc, is where the title is extracted from.

R/load_data.R Outdated

#' Check for a version of data to load
#'
#' This function determines if a user is requesting the latest version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I would specify that this is a helper function (to have it consistent with other helper functions).

#'
#' @return data frame
#' @import config dplyr readr GAMBLR.data
#'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone over all the GAMBLR helper functions on my branch. This was done as a step in preparing the documentation for building a site from the source code with build_site from pkgdown. This function takes all .Rd files that live in the man folder and generates function-specific HTML files. This is not something we want for helper functions (since they are not exported in NAMESPACE).

The fix for this is to add @noRd in the documentation for such functions. This prevents the .Rd file from being generated.

I think you should also specify this for this function to keep things consistent. Let me know if this doesn't make sense, or if you need me to further clarify.

Thanks,

@mattssca
Copy link
Contributor

Thanks for the updates Kostia, for some reason Git didn't notify me on email that you've updated this branch.

However, I cloned this branch to test the newly added collate_pga and was given an error in return when running the example in the docs. This is the error message that was returned:

> pga_metrics <- collate_pga(these_samples_metadata = meta)
Collating the PGA results ...

Error in UseMethod("filter") : 
no applicable method for 'filter' applied to an object of class "function" 
  
5. filter(., start <= arm_end & arm_start <= end)

4. arrange(., sample, chrom, start)

3. df %>% filter(start <= arm_end & arm_start <= end) %>% arrange(sample, 
    chrom, start) at utilities.R#3886
    
2. calculate_pga(this_seg = multi_sample_seg) at utilities.R#3726

1. collate_pga(these_samples_metadata = meta) 

I am not sure if this error is related to dplyr not being specified at the call of filter (line 3886) or if the reason is that the df(same line) is previously undefined, or if it's something completely different. But you might want to look into this. Let me know if you need any help testing.

Thanks,

@Kdreval
Copy link
Collaborator Author

Kdreval commented May 1, 2023

Thanks Adam for catching this! Yes it was because of df was undefined but I have also specified the dplyr::filter() in the next line as it is a best practice. I have pushed the working version and it is going through the GitHub Actions - when passed the PR is ready 😄

Thanks for the updates Kostia, for some reason Git didn't notify me on email that you've updated this branch.

However, I cloned this branch to test the newly added collate_pga and was given an error in return when running the example in the docs. This is the error message that was returned:

> pga_metrics <- collate_pga(these_samples_metadata = meta)
Collating the PGA results ...

Error in UseMethod("filter") : 
no applicable method for 'filter' applied to an object of class "function" 
  
5. filter(., start <= arm_end & arm_start <= end)

4. arrange(., sample, chrom, start)

3. df %>% filter(start <= arm_end & arm_start <= end) %>% arrange(sample, 
    chrom, start) at utilities.R#3886
    
2. calculate_pga(this_seg = multi_sample_seg) at utilities.R#3726

1. collate_pga(these_samples_metadata = meta) 

I am not sure if this error is related to dplyr not being specified at the call of filter (line 3886) or if the reason is that the df(same line) is previously undefined, or if it's something completely different. But you might want to look into this. Let me know if you need any help testing.

Thanks,

@mattssca
Copy link
Contributor

mattssca commented May 1, 2023

Thanks Kostia! I will clone this branch and try it out!

Copy link
Contributor

@mattssca mattssca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update Kostia, it sure feels nice to finally have capture support for the CN functions. I have some comments before I approve this PR. Let me know if there is anything you want me to clarify. Thanks again,

  1. There are also a few fancy_x_plot functions (fancy_v_chrcount, fancy_snv_chrdistplot, fancy_v_count, fancy_v_sizedis, fancy_ideogram, fancy_circos_plot) that are internally calling assign_cn_to_ssm. These functions should also have the recently added this_seq_type parameter added.

  2. Other functions that also internally call assign_cn_to_ssm are; estimate_purity, copy_number_vaf_plot, (these functions should also have the new parameter added).

  3. The snakefile used for retrieving data (to run GAMBLR remote) should probably also be updated to fetch the capture version of the merged CNV data file. Otherwise, the remote functionality will be broken whenever seq type capture is requested.

  4. As a side note, I ran my recently developed get_missing_from_merge with merge = "cnv" and this_seq_type = "capture" and 1 sample appears to be missing (1669-06-05PD). Is this expected?

  5. The first example in get_cn_states does not seem to be working for me. I am interested if this works for you. I've tried this example on my own branch and on the recently cloned Kostia branch, but errors out in both cases. This is the example I am referring to cn_matrix = get_cn_states(regions_bed = grch37_lymphoma_genes_bed)

  6. I tested out all the affected functions with different parameters to get a good sense of how they can handle this update. The results look great, with one exception. assign_cn_to_ssm errors out when capture is specified as the seq type. What about adding a working example for this function with the seq type set to capture? Or detail the limitations of the function in the docs. What do you think?

  7. collate_pga now works as intended, nice! I am wondering, do you want to update the cached files (for collate_results) as well? I checked, and PGA seems to missing from these results. If not, this can probably be done in the near future.

Thanks!

@@ -1887,11 +1894,6 @@ get_cn_segments = function(region,
#checks
remote_session = check_remote_configuration(auto_connect = TRUE)

#check seq type and return a message if anything besides "genome" is called. To be updated once capture samples have been processed through CNV protocols.
if(this_seq_type!="genome"){
stop("Currently, only genome samples are available for this function. Please select a valid seq type (i.e genome). Compatibility for capture samples will be added soon...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must've felt nice to remove 😎

@@ -538,7 +538,6 @@ region_to_gene = function(region,
#' @noRd
#'
#' @rawNamespace import(data.table, except = c("last", "first", "between", "transpose"))
#'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, there it is! I've been looking for this. Thanks for fixing it!

Kdreval added 2 commits May 3, 2023 20:30
Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
@Kdreval
Copy link
Collaborator Author

Kdreval commented May 4, 2023

Thanks Adam! Here is what I think:

  1. These have been added and documentation updated/regenerated ✅

  2. Added the new parameter to estimate_purity, copy_number_vaf_plot, and updated the documentation ✅

  3. I don't have local R/Rstudio and could not test the remote functionality. Could you help with this if you have the remote functionality setup and have testing case? 🏳️

  4. Yes, this is expected. That sample is problematic and will be eventually kicked out of GAMBL. Here are more details. ✅

  5. I noticed this example was removed too - yes. It does generates an error for me too when I test it as is - but it is not capture related or cnv related. The reason for this is that the region_bed argument supplied in this example contains a corrupted entry:

171 7                               151832010    152133090 KMT2C      
172 8                                 1993155      2113475 MYOM2      
173 8   8640864 8751155 MFHAS1             NA           NA NA         
174 8                                20054878     20084330 ATP6V1B2   
175 8                                35092975     35654068 UNC5D   

It of course is not producing a desired output (column in the matrix) and therefore there is mismatch in the length of names and columns - so the function errors. I created an issue related to this issue #193 and this can be addressed in a separate PR. 🏳️

  1. That function is broken in a sense that it only works with battenberg outputs and needs a revamp on how it works. I have also created issue Extend CNV tool support in assign_cn_to_ssm #194 and it can be addressed separately. 🏳️

  2. Sure, I can update the flatfiles as well - only thinking it should be done after PR is merged so the code how those files are generated is on master and accessible to everyone/reproducible. 🏳️

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca
@mattssca
Copy link
Contributor

mattssca commented May 4, 2023

Thanks for answering my questions Kostia. I can indeed test and address the remote functionality related to capture CNVs. I'll add support for this in my next PR. I can also look into the other outstanding items that you created issues for. I think this PR looks good, I'll approve the changes.

@Kdreval
Copy link
Collaborator Author

Kdreval commented May 4, 2023

Regarding the Question 5 above, I have resolved it and closed the issue #193 . I will wait till the GitHub Actions pass on the latest commit for this branch and will then merge it to master. Thanks Adam!

@Kdreval Kdreval merged commit 7e9f5d6 into master May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants