-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathderfinderHelper.Rmd
243 lines (180 loc) · 11 KB
/
derfinderHelper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Introduction to derfinderHelper"
author:
- name: Leonardo Collado-Torres
affiliation:
- &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
- &ccb Center for Computational Biology, Johns Hopkins University
email: lcolladotor@gmail.com
output:
BiocStyle::html_document:
self_contained: yes
toc: true
toc_float: true
toc_depth: 2
code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('derfinderHelper')`"
vignette: >
%\VignetteIndexEntry{Introduction to derfinderHelper}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Basics
## Install `r Biocpkg('derfinderHelper')`
`R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinderHelper')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinderHelper) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinderHelper')` by using the following commands in your `R` session:
```{r 'installDer', eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("derfinderHelper")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
```
## Required knowledge
`r Biocpkg('derfinderHelper')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A `r Biocpkg('derfinderHelper')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('derfinder')`.
If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU).
## Asking for help
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` or `derfinderHelper` tags and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
## Citing `r Biocpkg('derfinderHelper')`
We hope that `r Biocpkg('derfinderHelper')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
```{r 'citation'}
## Citation info
citation("derfinderHelper")
```
# Introduction to `r Biocpkg('derfinderHelper')`
```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()
## Bib setup
library("RefManageR")
## Write bibliography information
bib <- c(
derfinderHelper = citation("derfinderHelper")[1],
BiocStyle = citation("BiocStyle")[1],
knitr = citation("knitr")[3],
RefManageR = citation("RefManageR")[1],
rmarkdown = citation("rmarkdown")[1],
R = citation(),
IRanges = citation("IRanges")[1],
Matrix = citation("Matrix")[1],
S4Vectors = RefManageR::BibEntry(
bibtype = "manual", key = "S4Vectors",
author = "Hervé Pagès and Michael Lawrence and Patrick Aboyoun",
title = "S4Vectors: S4 implementation of vector-like and list-like objects",
year = 2017, doi = "10.18129/B9.bioc.S4Vectors"
),
sessioninfo = citation("sessioninfo")[1],
testthat = citation("testthat")[1]
)
```
`r Biocpkg('derfinderHelper')` `r Citep(bib[['derfinderHelper']])` is a small package that was created to speed up the F-statistics approach implemented in the parent package `r Biocpkg('derfinder')`. It contains a single function, `fstats.apply()`, which is used to calculate the F-statistics for a given data matrix, null and an alternative models.
The data is generally arranged in an matrix where the rows ($n$) are the genomic features of interest (gene-level summaries, exon-level summaries, or base-level data) and the columns ($m$) represent the samples. The other two main arguments for `fstats.apply()` are the null and alternative model matrices which are $m \times p_0$ and $m \times p$ where $p_0$ is the number of covariates in the null model and $p$ is the number of covariates in the alternative model. The models have to be nested and thus by definition $p > p_0$. The end result is a vector of F-statistics with length $n$, which is run length encoded for memory saving purposes.
Other arguments of `fstats.apply()` are related to flow in `r Biocpkg('derfinder')` such as the scaling factor (`scalefac`) used, whether to subset the data (`index`), and if the data was separated into chunks and saved to disk to lower the memory load (`lowMemDir`).
Implementation-wise, `adjustF` is useful when the denominator of the F-statistic calculation is too small. Finally, `method` controls how will the F-statistics be calculated.
* `Matrix` is the recommended option because it uses around half the memory load of `regular` and can be faster. Specially if the data was saved in this format previously by `r Biocpkg('derfinder')`.
* `Rle` uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases.
* `regular` uses base `R` to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run `fstats.apply()` on different portions of the data.
The F-statistics for each feature $i$ are calculated using the following formula:
$$ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} $$
# Example
The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via `r Biocpkg('derfinder')`.
## Data
First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D.
```{r 'createData'}
## Create some toy data
suppressPackageStartupMessages(library("IRanges"))
set.seed(20140923)
toyData <- DataFrame(
"sample1" = Rle(sample(0:10, 1000, TRUE)),
"sample2" = Rle(sample(0:10, 1000, TRUE)),
"sample3" = Rle(sample(0:10, 1000, TRUE)),
"sample4" = Rle(sample(0:10, 1000, TRUE)),
"sample5" = Rle(sample(0:15, 1000, TRUE)),
"sample6" = Rle(sample(0:15, 1000, TRUE)),
"sample7" = Rle(sample(0:15, 1000, TRUE)),
"sample8" = Rle(sample(0:15, 1000, TRUE)),
"sample9" = Rle(sample(0:20, 1000, TRUE)),
"sample10" = Rle(sample(0:20, 1000, TRUE)),
"sample11" = Rle(sample(0:20, 1000, TRUE)),
"sample12" = Rle(sample(0:20, 1000, TRUE)),
"sample13" = Rle(sample(0:100, 1000, TRUE)),
"sample14" = Rle(sample(0:100, 1000, TRUE)),
"sample15" = Rle(sample(0:100, 1000, TRUE)),
"sample16" = Rle(sample(0:100, 1000, TRUE))
)
## Lets say that we have 4 groups
group <- factor(rep(toupper(letters[1:4]), each = 4))
## Note that some groups have higher coverage, we can adjust for this in the model
sampleDepth <- sapply(toyData, sum)
sampleDepth
```
## Models
Next we create the model matrices for our example data set. Lets say that we want to calculate F-statistics comparing the alternative hypothesis that the group coefficients are not 0 versus the null hypothesis that they are equal to 0, when adjusting for the sample depth.
To do so, we create the nested models.
```{r 'createModels'}
## Build the model matrices
mod <- model.matrix(~ sampleDepth + group)
mod0 <- model.matrix(~sampleDepth)
## Explore them
mod
mod0
```
## Get F-statistics
Finally, we can calculate the F-statistics using `fstats.apply()`.
```{r 'calculateFstats'}
library("derfinderHelper")
fstats <- fstats.apply(data = toyData, mod = mod, mod0 = mod0, scalefac = 1)
fstats
```
We can then proceed to use this information in `r Biocpkg('derfinder')` or in any way you like.
# Details
We created `r Biocpkg('derfinderHelper')` for calculating F-statistics using `SnowParam()` from `r Biocpkg('BiocParallel')`. Using this form of parallelization requires loading the necessary packages in the child processes. Because `r Biocpkg('derfinder')` takes a long time to load, we shipped off `fstats.apply()` to its own package to improve the speed of the calculations while retaining the memory advantages of `SnowParam()` over `MulticoreParam()`.
Note that transforming the data from a `DataFrame` to a `dgCMatrix` takes some time, so the most efficient performance is achieved when the data is converted at the beginning instead of at every permutation calculation. This is done in `derfinder::preprocessCoverage()` when `lowMemDir` is specified.
# Reproducibility
This package was made possible thanks to:
* R `r Citep(bib[['R']])`
* `r Biocpkg('IRanges')` `r Citep(bib[['IRanges']])`
* `r CRANpkg('Matrix')` `r Citep(bib[['Matrix']])`
* `r Biocpkg('S4Vectors')` `r Citep(bib[['S4Vectors']])`
* `r CRANpkg('sessioninfo')` `r Citep(bib[['sessioninfo']])`
* `r CRANpkg('knitr')` `r Citep(bib[['knitr']])`
* `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])`
* `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
* `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])`
* `r CRANpkg('testthat')` `r Citep(bib[['testthat']])`
Code for creating the vignette
```{r createVignette, eval=FALSE}
## Create the vignette
library("rmarkdown")
system.time(render("derfinderHelper.Rmd", "BiocStyle::html_document"))
## Extract the R code
library("knitr")
knit("derfinderHelper.Rmd", tangle = TRUE)
```
Date the vignette was generated.
```{r reproducibility1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```
Wallclock time spent generating the vignette.
```{r reproducibility2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```
`R` session information.
```{r reproducibility3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```
# Bibliography
This vignette was generated using `r Biocpkg('BiocStyle')` `r Citep(bib[['BiocStyle']])`
with `r CRANpkg('knitr')` `r Citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r Citep(bib[['rmarkdown']])` running behind the scenes.
Citations made with `r CRANpkg('RefManageR')` `r Citep(bib[['RefManageR']])`.
```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```