Skip to content

Commit

Permalink
Add TCGA data
Browse files Browse the repository at this point in the history
Added TCGA data from primary tumors
  • Loading branch information
twbattaglia committed May 30, 2023
1 parent ccea3da commit ca69da7
Show file tree
Hide file tree
Showing 20 changed files with 36,091 additions and 16 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
^data-raw$
^cran-comments\.md$
^\.travis\.yml$
testing.R
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@ vignettes/*.pdf
*.utf8.md
*.knit.md
.Rproj.user
testing.R
10 changes: 7 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
Package: MicrobeDS
Title: Microbiome Datasets
Version: 0.1.0
Version: 0.1.1
Authors@R: person("Tom", "Battaglia", email = "tb1280@nyu.edu",
role = c("aut", "cre"))
Description: A repository for large-scale microbiome datasets formatted for phyloseq.
Depends:
R (>= 2.10)
R (>= 3.3.0)
biocViews:
Imports:
phyloseq
License: CC0
LazyData: true
URL: http://github.com/twbattaglia/MicrobeDS
BugReports: http://github.com/twbattaglia/MicrobeDS/issues
RoxygenNote: 5.0.1
RoxygenNote: 7.2.3
Encoding: UTF-8
11 changes: 11 additions & 0 deletions R/TCGA.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#' @title The Cancer Genome Atlas (TCGA)
#' @details WGS unampped reads with Kraken2 profiling from primary tumors
#' @description The Cancer Genome Atlas (TCGA) of 33 types of cancer from treatment-naive patients (a total of 18,116 samples) for microbial reads
#' @usage data('TCGA')
#' @docType data
#' @source ftp://ftp.microbio.me/pub/cancer_microbiome_analysis
#' @format An object of class \code{"phyloseq"}.
#' @keywords datasets
#' @references Poore et al. (2020) Nature Mar;579(7800):567-574
#' (\href{https://pubmed.ncbi.nlm.nih.gov/32214244/}{PubMed})
"TCGA"
11 changes: 11 additions & 0 deletions R/TCGA.contaminants.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#' @title The Cancer Genome Atlas (TCGA) contaminant list
#' @details WGS unampped reads with Kraken2 profiling from primary tumors
#' @description The Cancer Genome Atlas (TCGA) of 33 types of cancer from treatment-naive patients (a total of 18,116 samples) for microbial reads
#' @usage data('TCGA')
#' @docType data
#' @source ftp://ftp.microbio.me/pub/cancer_microbiome_analysis
#' @format An object of class \code{"phyloseq"}.
#' @keywords datasets
#' @references Poore et al. (2020) Nature Mar;579(7800):567-574
#' (\href{https://pubmed.ncbi.nlm.nih.gov/32214244/}{PubMed})
"TCGA.contaminants"
82 changes: 81 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@

### Install
```R
devtools::install_github("twbattaglia/MicrobeDS")
install.packages("remotes")

remotes::install_github("twbattaglia/MicrobeDS")
```

### Usage
Expand All @@ -26,6 +28,84 @@ sample_data(HMPv35)
### Datasets
This package contains datasets provided by large-scale microbiome studies. Each dataset is formatted for use with phyloseq. (https://joey711.github.io/phyloseq/).

### `TCGA`
**Description:** The Cancer Genome Atlas (TCGA)
**Number of samples:** 17625
**Data source:** ftp://ftp.microbio.me/pub/cancer_microbiome_analysis
**Study:** https://pubmed.ncbi.nlm.nih.gov/32214244/
**Processing** Unmapped reads profiled using Kraken2 (un-normalized)
**Type:** OTU-table, Sample metadata
**Abstract:** Systematic characterization of the cancer microbiome provides a unique opportunity to develop cancer diagnostics that exploit non-human, microbial-derived molecules in a major human disease. Based on recent studies showing significant microbial contributions in select cancer types1–10, we re-examined treatment-naïve whole genome and whole transcriptome sequencing studies (n=18,116 samples) from 33 cancer types in The Cancer Genome Atlas11 (TCGA) for microbial reads, and found unique microbial signatures in tissue and blood within and between most major cancer types.

#### Taxonomy
```
# Taxonomy contains only Kingdom & Genus level
# due to missing intermediate taxonomy annotations
tax_table(TCGA) %>%
as.data.frame() %>%
head()
```

#### List of contaminants
```
data("TCGA.contaminants")
head(TCGA.contaminants)
TCGA.filtered = TCGA %>%
subset_taxa(!(Genus %in% TCGA.contaminants$Genus))
```

#### Normalize & remove batch effects
```
library(microbiome)
library(edgeR)
library(limma)
library(snm)
# Normalized (will take a long time)
covDesignNorm <- model.matrix(~0 + sample_type +
data_submitting_center_label +
platform +
experimental_strategy +
tissue_source_site_label +
portion_is_ffpe,
data = meta(TCGA))
colnames(covDesignNorm) <- gsub('([[:punct:]])|\\s+','',colnames(covDesignNorm))
dge <- DGEList(counts = abundances(TCGA))
keep <- filterByExpr(dge, covDesignNorm)
dge <- dge[keep, keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge, method = "TMM")
vdge <- voom(dge, design = covDesignNorm, plot = TRUE, save.plot = TRUE, normalize.method="none")
# Remove batch effects (runs long)
bio.var.sample.type <- model.matrix(~sample_type, data = meta(TCGA))
adj.var <- model.matrix(~data_submitting_center_label +
platform +
experimental_strategy +
tissue_source_site_label +
portion_is_ffpe,
data = meta(TCGA))
colnames(bio.var.sample.type) <- gsub('([[:punct:]])|\\s+','',colnames(bio.var.sample.type))
colnames(adj.var) <- gsub('([[:punct:]])|\\s+','',colnames(adj.var))
voom.snm <- snm(raw.dat = vdge$E,
bio.var = bio.var.sample.type,
adj.var = adj.var,
rm.adj = TRUE,
verbose = TRUE,
diagnose = TRUE)
voom.snm.data = voom.snm$norm.dat
colnames(voom.snm.data) = colnames(vdge$E)
# Create phyloseq with updated abundance
TCGA.snm = TCGA
otu_table(TCGA.snm) = otu_table(voom.snm.data, taxa_are_rows = T)
```


----

### `HMPv13`
**Description:** Human Microbiome Project (HMP) V1-V3 amplicon
**Number of samples:** 3285
Expand Down
17,626 changes: 17,626 additions & 0 deletions data-raw/TCGA/Kraken-TCGA-Raw-Data-17625-Samples.csv

Large diffs are not rendered by default.

Loading

0 comments on commit ca69da7

Please sign in to comment.