-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05_SV_summaries.Rmd
139 lines (119 loc) · 6.01 KB
/
05_SV_summaries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
title: 'Short-read SV discovery in *Coregonus clupeaformis*'
author: "Jana R. Wold"
date: "May 2023"
output:
html_document:
self_contained: false
---
Congradulations! You've used short-read data to call structural variants (SVs) aligned to a linear reference genome. We had an initial look at how well our three data sets agree (or not) with one another. Now we're ready to learn a little about the unique characteristics of the SVs themselves.
Below we assess variation in the number of SVs discovered by type and size for Delly, Manta and Smoove. We also take a peek at how our filtering thresholds impacted our SV diversity. To start, lets load our relevant packages and set up our environment.
```{r setup, include=FALSE}
library(ggplot2)
library(magrittr)
library(dplyr)
knitr::opts_chunk$set(dev = c("svg", "png"),
dpi = 300,
echo = FALSE,
cache = TRUE)
```
Data loaded as per:
```{r Load Data, include=FALSE}
# Data from each discovery/genotyping tool
delly <-read.table("SV_summaries/delly_summary.tsv", sep = "\t", header = TRUE)
manta <-read.table("SV_summaries/manta_summary.tsv", sep = "\t", header = TRUE)
smoove <-read.table("SV_summaries/smoove_summary.tsv", sep = "\t", header = TRUE)
svs <- rbind(delly, manta, smoove)
# Sorting into filtered and unfiltered data sets
unfiltered <- svs %>%
filter(if_any(Data_set, ~grepl('unfiltered' , .)))
filtered <- svs %>%
filter(if_any(Data_set, ~grepl('_filtered' , .)))
# Data for testing differences in SV size and type among SV discovery tools
filtered_del <- filtered [ filtered$SVTYPE == "DEL" ,]
filtered_dup <- filtered [ filtered$SVTYPE == "DUP" ,]
filtered_ins <- filtered [ filtered$SVTYPE == "INS" ,]
filtered_inv <- filtered [ filtered$SVTYPE == "INV" ,]
counts <- filtered %>%
group_by(Data_set) %>%
count(vars = SVTYPE)
counts <- spread(counts, vars, n)
```
Now that we have our data loaded, let's explore the number of structural variants per data set before and after filtering.
```{r type counts, echo=FALSE, fig.align='center'}
png("SV_summaries/unfiltered_SV_counts.png")
ggplot(unfiltered, aes(x=SVTYPE, fill=Data_set)) +
geom_histogram(stat = "count", position = position_dodge()) +
labs(x = "Structural Variant Type", y = "Count", title = "Count of all Structural Variants") +
scale_fill_manual(values = c("Delly_unfiltered" = "#091A26", "Manta_unfiltered" = "#2F638F", "Smoove_unfiltered" = "#798C8B")) +
theme_light()
dev.off()
png("SV_summaries/filtered_SV_counts.png")
ggplot(filtered, aes(x=SVTYPE, fill=Data_set)) +
geom_histogram(stat = "count", position = position_dodge()) +
labs(x = "Structural Variant Type", y = "Count", title = "Count of Filtered Structural Variants") +
scale_fill_manual(values = c("Delly_filtered" = "#091A26", "Manta_filtered" = "#2F638F", "Smoove_filtered" = "#798C8B")) +
theme_light()
dev.off()
```
Here, we compared the size distribution of SVs called by all three tools. Here we use the natural log of SV size was to graph the size distribution and estimate a mean (geometric mean).
```{r size_distribution, echo=FALSE, fig.align = 'center'}
filtered$logsize <- log(filtered$SVLEN)
head(filtered)
# Not transformed
png("SV_summaries/Size_distributions_density.png")
ggplot(filtered, aes(x=SVLEN, fill=Data_set)) +
geom_density(alpha = 0.7) +
xlim(0,10000) +
labs(x = "Structural Variant Size", y = "Proportion", title = "Size Distribution of Filtered Structural Variants") +
theme_light() +
facet_wrap(~SVTYPE, scales = "free_y") +
scale_fill_manual(values = c("Delly_filtered" = "#091A26", "Manta_filtered" = "#2F638F", "Smoove_filtered" = "#798C8B"))
dev.off()
png("SV_summaries/Size_distributions_histogram.png")
ggplot(filtered, aes(x=SVLEN, fill=Data_set)) +
geom_histogram(alpha = 0.7, position = "dodge") +
xlim(0,10000) +
labs(x = "Structural Variant Size", y = "Proportion", title = "Size Distribution of Filtered Structural Variants") +
theme_light() +
facet_wrap(~SVTYPE, scales = "free_y") +
scale_fill_manual(values = c("Delly_filtered" = "#091A26", "Manta_filtered" = "#2F638F", "Smoove_filtered" = "#798C8B"))
dev.off()
# Size transformed
png("SV_summaries/Log_size_distribution.png")
ggplot(filtered, aes(x=logsize, fill=Data_set)) +
geom_density(alpha = 0.7) +
labs(x = "Structural Variant Size", y = "Proportion", title = "Size Distribution of Filtered Structural Variants") +
theme_light() +
facet_wrap(~SVTYPE, scales = "free_y") +
scale_fill_manual(labels = c("Delly_filtered" = "Delly", "Manta_filtered" = "Manta", "Smoove_filtered" = "Smoove"), values = c("#091A26", "#2F638F","#798C8B"))
dev.off()
```
So now we have a sense of the number and size of the SVs, let's see what proportion of chromosome 1 is impacted by SVs of each type.
```{r proportion, echo=FALSE, fig.align='center'}
filtered$chrom_size <- 121031747
filtered$proportion <- filtered$SVLEN / filtered$chrom_size
# The proportion of chromosome 1 impacted by each filtered data set
png("SV_summaries/chr1_proportion.png")
filtered %>%
ggplot(aes(x = CHROM, y = proportion, fill = SVTYPE)) +
geom_bar(stat = "identity", position = "stack") +
labs(x = "Chromosome", y = "Proportion of impacted base pairs", title = "Proportion of chromosome impacted by structural variants") +
theme_light() +
facet_wrap(~Data_set, scales = "free_y") +
scale_fill_manual(values = c("DEL" = "#091A26", "DUP" = "#2F638F", "INS" = "#798C8B", "INV" = "#4F651D"))
dev.off()
```
And finally, let's try to glean the position of SVs fall along chromosome 1.
```{r SV_location, echo=FALSE, fig.align='center'}
png("SV_summaries/chr1_locations.png")
filtered %>%
ggplot(aes(x = POS, fill = SVTYPE)) +
geom_histogram(stat = "count") +
labs(x = "Chromosome Position", y = "Number of SV", title = "Locations of SVs on chromosome 1") +
theme_light() +
facet_wrap(~Data_set, scales = "free_y") +
xlim(166000, 171000) +
scale_fill_manual(values = c("DEL" = "#091A26", "DUP" = "#2F638F", "INS" = "#798C8B", "INV" = "#4F651D"))
dev.off()
```