Handling of bundled data through designated separate repo #150

Kdreval · 2022-12-29T02:37:52Z

This is related to issues #147 and #149
Currently this set up supports use of ashm coordinates of specific version as GAMBLR is first loaded. The version is stored in config and can be modified on session-by-session, project-by-project basis.
If this set up proves to be efficient, it will be extended to handle also lymphoma genes and other bundled GAMBLR data.

… bundled data

rdmorin · 2023-03-16T21:42:26Z

R/load_data.R

+    latest_version <- max(versions)
+
+    # Which version did the user requested in config?
+    requested_version <- config::get("bundled_data_versions")[[mode]]


I think this should be bundled_data_version (not plural) since only one is ever specified by the user, right?

Right! I'll fix this!

When I started adding other objects and looking at this, I realized here that we would have this in plural - because different data types develop more than the others and iterate through versions differently. For example, for aSHM sites we already have versions 0, 1, 2, and 3 - but for lymphoma genes only 0 and 1. So config here will have different keys depending on the data type.

rdmorin · 2023-03-16T21:43:53Z

config.yml

@@ -160,3 +160,7 @@ default:
                hg38-nci: "04-24937N-Schmitz"
                hg38: "BLGSP-71-06-00286-99A-01D"
                hg19-clc: "PA011-G"
+


Can we also use this to specify a default metadata for example data sets bundled with the package?

Yes, I was thinking of it the same way as well - and if the concept in this draft is something we will move forward with I will switch the metadata (+ ssm, SV, CNV) to the helper package as well

I have also omitted from this PR bundling the example metadata and datasets - I think this can be it;s own dedicated PR since we would also want to generate examples for each function using that example data and for metadata we also need to work out minimal required columns

Kdreval · 2023-03-24T15:43:49Z

This is now tested and is ready for review

mattssca

Thanks for a great update Kosta. I have a few comments related to how you handle the function documentation. Let me know if these make sense to you. Thanks!

mattssca · 2023-03-31T18:33:49Z

R/load_data.R

@@ -0,0 +1,106 @@
+# Helper functions not for export
+
+#' Check for a version of data to load


Consider adding a descriptive title for this function. Currently, this function has the title "Check for a version of data to load". The first line of the function doc, is where the title is extracted from.

mattssca · 2023-03-31T18:34:20Z

R/load_data.R

+
+#' Check for a version of data to load
+#'
+#' This function determines if a user is requesting the latest version


This is where I would specify that this is a helper function (to have it consistent with other helper functions).

mattssca · 2023-03-31T18:38:50Z

R/load_data.R

+#'
+#' @return data frame
+#' @import config dplyr readr GAMBLR.data
+#'


I have gone over all the GAMBLR helper functions on my branch. This was done as a step in preparing the documentation for building a site from the source code with build_site from pkgdown. This function takes all .Rd files that live in the man folder and generates function-specific HTML files. This is not something we want for helper functions (since they are not exported in NAMESPACE).

The fix for this is to add @noRd in the documentation for such functions. This prevents the .Rd file from being generated.

I think you should also specify this for this function to keep things consistent. Let me know if this doesn't make sense, or if you need me to further clarify.

Thanks,

mattssca · 2023-04-28T19:29:24Z

Thanks for the updates Kostia, for some reason Git didn't notify me on email that you've updated this branch.

However, I cloned this branch to test the newly added collate_pga and was given an error in return when running the example in the docs. This is the error message that was returned:

> pga_metrics <- collate_pga(these_samples_metadata = meta)
Collating the PGA results ...

Error in UseMethod("filter") : 
no applicable method for 'filter' applied to an object of class "function" 
  
5. filter(., start <= arm_end & arm_start <= end)

4. arrange(., sample, chrom, start)

3. df %>% filter(start <= arm_end & arm_start <= end) %>% arrange(sample, 
    chrom, start) at utilities.R#3886
    
2. calculate_pga(this_seg = multi_sample_seg) at utilities.R#3726

1. collate_pga(these_samples_metadata = meta)

I am not sure if this error is related to dplyr not being specified at the call of filter (line 3886) or if the reason is that the df(same line) is previously undefined, or if it's something completely different. But you might want to look into this. Let me know if you need any help testing.

Thanks,

Kdreval · 2023-05-01T16:54:56Z

Thanks Adam for catching this! Yes it was because of df was undefined but I have also specified the dplyr::filter() in the next line as it is a best practice. I have pushed the working version and it is going through the GitHub Actions - when passed the PR is ready 😄

Thanks for the updates Kostia, for some reason Git didn't notify me on email that you've updated this branch.

However, I cloned this branch to test the newly added collate_pga and was given an error in return when running the example in the docs. This is the error message that was returned:
> pga_metrics <- collate_pga(these_samples_metadata = meta)
Collating the PGA results ...

Error in UseMethod("filter") : 
no applicable method for 'filter' applied to an object of class "function" 
  
5. filter(., start <= arm_end & arm_start <= end)

4. arrange(., sample, chrom, start)

3. df %>% filter(start <= arm_end & arm_start <= end) %>% arrange(sample, 
    chrom, start) at utilities.R#3886
    
2. calculate_pga(this_seg = multi_sample_seg) at utilities.R#3726

1. collate_pga(these_samples_metadata = meta) 
I am not sure if this error is related to dplyr not being specified at the call of filter (line 3886) or if the reason is that the df(same line) is previously undefined, or if it's something completely different. But you might want to look into this. Let me know if you need any help testing.

Thanks,

mattssca · 2023-05-01T16:57:49Z

Thanks Kostia! I will clone this branch and try it out!

mattssca

Thanks for the update Kostia, it sure feels nice to finally have capture support for the CN functions. I have some comments before I approve this PR. Let me know if there is anything you want me to clarify. Thanks again,

There are also a few fancy_x_plot functions (fancy_v_chrcount, fancy_snv_chrdistplot, fancy_v_count, fancy_v_sizedis, fancy_ideogram, fancy_circos_plot) that are internally calling assign_cn_to_ssm. These functions should also have the recently added this_seq_type parameter added.
Other functions that also internally call assign_cn_to_ssm are; estimate_purity, copy_number_vaf_plot, (these functions should also have the new parameter added).
The snakefile used for retrieving data (to run GAMBLR remote) should probably also be updated to fetch the capture version of the merged CNV data file. Otherwise, the remote functionality will be broken whenever seq type capture is requested.
As a side note, I ran my recently developed get_missing_from_merge with merge = "cnv" and this_seq_type = "capture" and 1 sample appears to be missing (1669-06-05PD). Is this expected?
The first example in get_cn_states does not seem to be working for me. I am interested if this works for you. I've tried this example on my own branch and on the recently cloned Kostia branch, but errors out in both cases. This is the example I am referring to cn_matrix = get_cn_states(regions_bed = grch37_lymphoma_genes_bed)
I tested out all the affected functions with different parameters to get a good sense of how they can handle this update. The results look great, with one exception. assign_cn_to_ssm errors out when capture is specified as the seq type. What about adding a working example for this function with the seq type set to capture? Or detail the limitations of the function in the docs. What do you think?
collate_pga now works as intended, nice! I am wondering, do you want to update the cached files (for collate_results) as well? I checked, and PGA seems to missing from these results. If not, this can probably be done in the near future.

Thanks!

mattssca · 2023-05-03T22:55:41Z

R/database.R

@@ -1887,11 +1894,6 @@ get_cn_segments = function(region,
  #checks
  remote_session = check_remote_configuration(auto_connect = TRUE)

-  #check seq type and return a message if anything besides "genome" is called. To be updated once capture samples have been processed through CNV protocols.
-  if(this_seq_type!="genome"){
-    stop("Currently, only genome samples are available for this function. Please select a valid seq type (i.e genome). Compatibility for capture samples will be added soon...")


This must've felt nice to remove 😎

mattssca · 2023-05-03T22:58:23Z

R/utilities.R

@@ -538,7 +538,6 @@ region_to_gene = function(region,
 #' @noRd
 #'
 #' @rawNamespace import(data.table, except = c("last", "first", "between", "transpose"))
-#'


Aha, there it is! I've been looking for this. Thanks for fixing it!

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca

Kdreval · 2023-05-04T04:43:01Z

Thanks Adam! Here is what I think:

These have been added and documentation updated/regenerated ✅
Added the new parameter to estimate_purity, copy_number_vaf_plot, and updated the documentation ✅
I don't have local R/Rstudio and could not test the remote functionality. Could you help with this if you have the remote functionality setup and have testing case? 🏳️
Yes, this is expected. That sample is problematic and will be eventually kicked out of GAMBL. Here are more details. ✅
I noticed this example was removed too - yes. It does generates an error for me too when I test it as is - but it is not capture related or cnv related. The reason for this is that the region_bed argument supplied in this example contains a corrupted entry:

171 7                               151832010    152133090 KMT2C      
172 8                                 1993155      2113475 MYOM2      
173 8   8640864 8751155 MFHAS1             NA           NA NA         
174 8                                20054878     20084330 ATP6V1B2   
175 8                                35092975     35654068 UNC5D

It of course is not producing a desired output (column in the matrix) and therefore there is mismatch in the length of names and columns - so the function errors. I created an issue related to this issue #193 and this can be addressed in a separate PR. 🏳️

That function is broken in a sense that it only works with battenberg outputs and needs a revamp on how it works. I have also created issue Extend CNV tool support in assign_cn_to_ssm #194 and it can be addressed separately. 🏳️
Sure, I can update the flatfiles as well - only thinking it should be done after PR is merged so the code how those files are generated is on master and accessible to everyone/reproducible. 🏳️

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca

mattssca · 2023-05-04T20:19:31Z

Thanks for answering my questions Kostia. I can indeed test and address the remote functionality related to capture CNVs. I'll add support for this in my next PR. I can also look into the other outstanding items that you created issues for. I think this PR looks good, I'll approve the changes.

Kdreval · 2023-05-04T21:22:14Z

Regarding the Question 5 above, I have resolved it and closed the issue #193 . I will wait till the GitHub Actions pass on the latest commit for this branch and will then merge it to master. Thanks Adam!

Kdreval added 7 commits December 28, 2022 18:16

cleanup: drop bundled ashm regions

a55b8a9

new feature: introduce config key for data version tracking

d086a1b

new feature: auto load bundled data of particular version

7315f5c

bug fix: refer to the most recent version

1c3c141

cleanup: add documentation

72ca772

cleanup: update documentation for the new function

2e23d10

cleanup: add examples for new function

7fc02da

Kdreval mentioned this pull request Dec 29, 2022

Allow version-centric use of bundled data #149

Closed

Kdreval requested a review from rdmorin December 29, 2022 03:35

merge master and resolve conflicts

1b28d15

Kdreval marked this pull request as draft January 30, 2023 19:53

Kdreval added 3 commits February 8, 2023 13:52

cleanup: remove dependance on internet to check for latest version of…

2769b34

… bundled data

cleanup: updating NAMESPACE

62e6d10

Merge branch 'master' into kdreval-data_helper

62cb770

rdmorin reviewed Mar 16, 2023

View reviewed changes

Kdreval added 6 commits March 23, 2023 11:20

cleanup: merge master and resolve conflicts

342ebd6

new functionality: switching lymphoma genes to GAMBLR.data helper

ff52877

add fix from Ryan for prettyOncopot

173fbc4

cleanup: update documentation

acd8df5

cleanup: add GAMBLR.data as dependency

4a90f28

cleanup: standardize the config key check

57e28c6

Kdreval marked this pull request as ready for review March 24, 2023 15:43

mattssca suggested changes Mar 31, 2023

View reviewed changes

Kdreval added 4 commits April 21, 2023 11:06

review comments

81f8a06

get rid of data.table dependency when calculating PGA

ed74f3c

init collating PGA results

f21503d

add PGA collating to main function

c976122

bug fix: mistake in calculate_pga

50866a8

Kdreval added 8 commits May 3, 2023 11:31

new functionality: add capture support to

46a6aec

new functionality: add capture support to get_cn_segments

76a48fa

new functionality: add capture support to get_cn_states

16fb2e1

new functionality: add capture support throughout CNV functions

9586403

new functionality: remove capture stop in collate_pga

7672e8d

cleanup: update documentation

f566779

cleanup: merge master and resolve conflicts

e4b60b4

bugfix: remove newline in export statement messing up with NAMESPACE

76ec542

mattssca suggested changes May 3, 2023

View reviewed changes

Kdreval added 2 commits May 3, 2023 20:30

cleanupadd capture support to plotting functions

9ac4959

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca

cleanup: add more capture support

3f5eb75

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca

cleanup: add more documentation

02aea8b

Co-authored-by: Carl-Adam Mattsson cmattsson@bcgsc.ca

mattssca approved these changes May 4, 2023

View reviewed changes

cleabug fix: correct deliminators in corrupted bed file

db2e632

Kdreval merged commit 7e9f5d6 into master May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of bundled data through designated separate repo #150

Handling of bundled data through designated separate repo #150

Kdreval commented Dec 29, 2022

rdmorin Mar 16, 2023

Kdreval Mar 16, 2023

Kdreval Mar 24, 2023

rdmorin Mar 16, 2023

Kdreval Mar 16, 2023

Kdreval Mar 24, 2023

Kdreval commented Mar 24, 2023

mattssca left a comment

mattssca Mar 31, 2023

mattssca Mar 31, 2023

mattssca Mar 31, 2023

mattssca commented Apr 28, 2023

Kdreval commented May 1, 2023

mattssca commented May 1, 2023

mattssca left a comment

mattssca May 3, 2023

mattssca May 3, 2023

Kdreval commented May 4, 2023 •

edited

Loading

mattssca commented May 4, 2023

Kdreval commented May 4, 2023

		@@ -0,0 +1,106 @@
		# Helper functions not for export

		#' Check for a version of data to load

Handling of bundled data through designated separate repo #150

Handling of bundled data through designated separate repo #150

Conversation

Kdreval commented Dec 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kdreval commented Mar 24, 2023

mattssca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattssca commented Apr 28, 2023

Kdreval commented May 1, 2023

mattssca commented May 1, 2023

mattssca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kdreval commented May 4, 2023 • edited Loading

mattssca commented May 4, 2023

Kdreval commented May 4, 2023

Kdreval commented May 4, 2023 •

edited

Loading