-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathredlat_qc.rmd
1619 lines (1327 loc) · 56.2 KB
/
redlat_qc.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
keep_md: yes
editor_options:
markdown:
wrap: 72
---
# Quality control pipeline for the ReD-Lat genomic data
#### Developed by Juliana Acosta-Uribe for the ReD-Lat Consortium 2023-2024
This pipeline is designed to be run as an [R markdown](https://rmarkdown.rstudio.com/lesson-1.html) file in R Studio. This way you can run it in a step-by-step mode. However, you could also run it directly from the r command line if you already have the `sample_data.txt` and the `problematic_relatedness.txt` files in your workspace.
```
library(rmarkdown)
render("path/to/your/file.Rmd")
```
For this pipeline you will need the following:
### Data:
A plink formatted .*bed*, .*fam*, .*bim* set, or a bgzipped *.vcf.gz*
file.
**Always Take a look at your files before beginning:**
If you start with a plink dataset: - Do the Individual IDs (column 2
in *.fam*) match what you were expecting? - Have the families been
given a Family ID (column 1 in *.fam*)? - Do the Individuals have
their sex assigned (column 5 in *.fam*? - Do the variants in the
*.bim* have an identifier (column 2 in *.bim*)? Some analyses will
require this information, and we may have to incorporate it to the
*.fam*/*.bim* if its not already there.\
Make sure your files are properly aligned and the alleles are being
called from the correct strand. INDELs should be [left aligned and
normalized](https://samtools.github.io/bcftools/bcftools.html#norm).
**Sample Data File**
If you are starting with a *.vcf*, or your *.fam* does not have the sex
/family of the samples already specified please provide an additional
file with a header as follows:
**IID** ID of the sample as it is in the plink IID or in the VCF\
**SEX** should be specified (1=male, 2=female, 0=no data)\
**FID** Family ID of the sample\
**PID** Paternal ID of the sample. If this individual is also in the
dataset, the PID should be identical to the father's IID\
**MID** Maternal ID of the sample. If this individual is also in the
dataset, the MID should be identical to the mother's IID\
**PHENO** Optional, if you are interested in any downstream analyses
involving the phenotype. (1=control, 2=case, -9=missing)
⚠️ The **FID** given to the parents needs to match the same **FID**
given to that individual for their genome.
A trio would look like this (order of the columns does not matter)
| FID | IID | PID | MID | SEX | PHENO |
|------|-----|-----|-----|-----|-------|
| FID1 | SON | DAD | MOM | 1 | 2 |
| FID1 | MOM | 0 | 0 | 2 | 1 |
| FID1 | DAD | 0 | 0 | 1 | 1 |
R is expecting a tab delimited file. If your file is delimited by spaces
you can fix it with the following bash command
`sed -i 's/ /\t/g' sample_data.txt`
### Software:
\-[R](https://www.r-project.org/)\
-[RStudio](https://posit.co/download/rstudio-desktop/)\
-[plink](https://www.cog-genomics.org/plink2/)\
-[king](https://www.kingrelatedness.com/)
If you are working with Exome or Genome data, you will also need:
\-[vcftools](https://vcftools.github.io/man_latest.html)\
-[bcftools](https://samtools.github.io/bcftools/bcftools.html)
### Contents Table:
[1. Set up your environment](#section-1)\
[2. Customize the quality control process](#section-2)\
[3. Start Quality Control process](#section-3)\
[4. Genotype Quality control](#section-4)\
[5. Filter for missingness](#section-5)\
[6. Calculate heterozygosity](#section-6)\
[7. Check sex](#section-7)\
[8. Identify duplicates](#section-8)\
[9. Calculate relatedness](#section-9)\
[10. Remove variants & Individuals and with missingness
≥5%](#section-10)
## 1. Set up your environment {#section-1}
In this first step you will install the packages that R needs to run the
quality control process and you will define the paths to your working
directory and software.
```{r environment-setup}
# Ensure "knitr" is installed and loaded
if (!requireNamespace("knitr", quietly = TRUE)) {
install.packages("knitr")
}
library(knitr)
# Set your working directory:
knitr::opts_chunk$set(root.dir = "~/path/to/directory",
dev = "png",
dpi = 300,
echo = FALSE,
cache = TRUE)
setwd("~/path/to/directory")
# Set up path to software:
Sys.setenv(plink='/path/to/plink')
Sys.setenv(king='/path/to/ing')
Sys.setenv(vcftools='/path/to/vcftools')
Sys.setenv(bcftools='/path/to/bcftools')
# Give the name of your starting file without the .bed or .vcf.gz extension
prefix="data"
Sys.setenv(prefix=prefix)
# Define the type of data you will be working with.
# Answers can be "GENOME", "EXOME", "ARRAY"
data_type="ARRAY"
Sys.setenv(data_type=data_type)
# Specify your reference genome
# Answers can be 'hg19', 'hg38'
genome_alignment="hg38"
Sys.setenv(genome_alignment=genome_alignment)
# Import the text file that contains the information for your samples.
# Look at Sample Data File specifications in the introduction. File needs to be tab delimited, with a header
sample_data_file="exome_sample_data.txt"
if (file.exists(sample_data_file)) {
sample_data = read.delim(sample_data_file, header = TRUE, sep = '\t')
print(paste(".fam file will be annotated with:", sample_data_file))
} else {
print("No sample data was provided, assuming .fam file is already annotated")
}
Sys.setenv(sample_data_file=sample_data_file)
```
## 2. Customize the quality control process {#section-2}
Choose what you want to do in the analysis:
```{r customize-pipeline}
# Do you want to do filter variant calls for genotype depth (DP) and Genotype Quality (GQ)?
## This can only be done if you start with a VCF file and requires that the "DP and 'GQ' FORMAT tag is included for each genotype.
## TRUE Includes only sites with DP/GQ (over all included individuals) greater than or equal to the given value
genotype_filter=FALSE
# Define the values you want to use as threshold
DP=10
GQ=20
# Do you want to keep only the variants with PASS in the VQSR filtering?
## TRUE Removes all sites with a FILTER flag other than PASS.
vqsr_PASS=FALSE
# Do you want to check for known and cryptic relatedness among samples?
## For this analyses you will need to provide information of known families as described before.
## TRUE uses KING to perform relatedness check.
check_relatedness=FALSE
# Do you want to create directories and to organize your data as you go?
## TRUE generates directories and organizes your files as you run the pipeline
tidy_up=FALSE
# Make these values into system environment variables
Sys.setenv(genotype_filter=genotype_filter)
Sys.setenv(DP=DP)
Sys.setenv(GQ=GQ)
Sys.setenv(vqsr_PASS=vqsr_PASS)
Sys.setenv(check_relatedness=check_relatedness)
Sys.setenv(tidy_up=tidy_up)
```
## 3. Start Quality Control process {#section-3}
Before starting, take some time to look at your data and calculate some
basic quality metrics. If you are starting with a vcf file from Exome or
Genome data, after running the following chunk you should have at least
the 6 following files:
**I. General statistics**
[.vchk](https://samtools.github.io/bcftools/bcftools.html#stats)\
**II. Sample based metrics**
[metrics](https://vcftools.sourceforge.net/man_latest.html#OUTPUT%20OPTIONS)\
1. missingness per sample (.imiss)\
2. depth per sample (.idepth)\
**III. Site based metrics**\
1. Missingness per site (.lmiss)\
2. Mean depth per site (.ldepth.mean)\
3. VQSR quality
[vqsr](https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR-)
```{bash get-preQC-stats-2}
if [[ "$data_type" == "EXOME" || "$data_type" == "GENOME" ]]; then
# Data description
bcftools stats ${prefix}.vcf.gz > ${prefix}.preqc.vchk.txt
# Individual missingness
vcftools --gzvcf ${prefix}.vcf.gz --missing-indv --out ${prefix}.preqc
# Generates a file reporting the missingness on a per-individual basis.
# The output file has the suffix ".imiss".
# Individuals whose missingness is >10% should be added to 'flagged_samples' file
# Individual depth
vcftools --gzvcf ${prefix}.vcf.gz --depth --out ${prefix}.preqc
# Generates a file containing the mean depth per individual.
# This file has the suffix ".idepth".
# Individuals whose mean depth is <20 should be added to 'flagged_samples' file
# Site missingness
vcftools --gzvcf ${prefix}.vcf.gz --missing-site --out ${prefix}.preqc
# Generates a file reporting the missingness on a per-site basis.
# The file has the suffix ".lmiss".
# Remove the 'chr' prefix if chromosomes are labeled 'chr1'
sed -i 's/chr//g' ${prefix}.preqc.lmiss
# Site depth
vcftools --gzvcf ${prefix}.vcf.gz --site-mean-depth --out ${prefix}.preqc
# Generates a file containing the mean depth per site across all individuals.
# This output file has the suffix ".ldepth.mean"
# Remove the 'chr' prefix if chromosomes are labeled 'chr1'
sed -i 's/chr//g' ${prefix}.preqc.ldepth.mean
# VQRS filtering
vcftools --gzvcf ${prefix}.vcf.gz --FILTER-summary --out ${prefix}.preqc
# Gives the information on the number of variants that passed VQSR filtering, or are in specific tranches
# The output file has the suffix ".FILTER.summary"
fi
```
If your starting dataset is a plink file from a snp array, you will
first have to edit the .fam with the samples sex and the only statistics
you get are missingness per sample and per site.\
If you don't add sex to the fam before calculating missingness, plink
wont calculate missingness values for Y chromosome.
```{r edit-fam-1}
if(data_type == "ARRAY") {
fam = read.delim(paste0(prefix, ".fam"),
sep = ' ',
header = FALSE)
# If your .fam doesn't have the sex already specified, it will add the SEX from sample_data. This a heuristic approach, but it can detect if samples are missing sex data assuming you have males and females in your dataset.
if(sum(fam$V5) < length(fam$V5) * 0.25) {
# Check if 'sample_data' exists in the environment
if (exists("sample_data")) {
fam$V5 = sample_data$SEX[match(fam$V2, sample_data$IID, nomatch = 0)]
write.table(fam,
paste0(prefix, ".fam"),
col.names = FALSE,
row.names = FALSE,
quote = FALSE)
} else {
print("No sample data was provided, cannot annotate .fam")
}
} else {
print(".fam file has sex values")
}
}
```
Then, use plink for Site and individual missingness in SNP Array.\
It will produce .lmiss and .imiss files
```{bash get-preQC-stats-1}
if [[ "$data_type" == "ARRAY" ]]; then
$plink --bfile ${prefix} --missing --out ${prefix}.preqc
fi
```
You can then proceed to get some plots and statistics from your initial
files.
**Sample based metrics**
```{r plot-preQC-stats-ind}
# Ensure "psych" is installed and loaded
if (!requireNamespace("psych", quietly = TRUE)) {
install.packages("psych")
}
library(psych)
# Ensure "dplyr" is installed and loaded
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
library(dplyr)
# Individual Missingness
imiss = read.delim(file = paste0(prefix,".preqc.imiss"),
header = TRUE,
sep = '')
# Generate basic summary statistics
imiss_F = describe(imiss$F_MISS)
rownames(imiss_F) = c("sample_missingness")
print(imiss_F)
# Plot Missingness rate per sample
imiss_hist = hist(imiss$F_MISS,
xlab="Missingness rate",
ylab="Samples",
main="Missingness rate per sample preQC",
col="cyan3",
breaks=20)
# Depth per Individual (Only for Exome or Genome data)
if(data_type=="EXOME" || data_type=="GENOME"){
# Read depth calculations
idepth = read.delim((paste0(prefix,".preqc.idepth")), header = T, sep = "")
# Generate basic summary statistics
idepth_mean = describe(idepth$MEAN_DEPTH)
rownames(idepth_mean) = c("sample_depth")
print(idepth_mean)
# Individuals whose mean depth is <20 should be identified
low_mean_depth = filter(idepth, MEAN_DEPTH<20)
if (nrow(low_mean_depth) > 0) {
write.table(low_mean_depth,
"samples_low_mean_depth_raw.txt",
col.names = FALSE,
row.names = FALSE,
quote = FALSE,
sep = 't')
} else {
print("All samples have a mean depth higher than 20. 'samples_low_mean_depth_raw.txt' wont be generated")
}
# Plot depth per sample
idepth_hist = hist(idepth$MEAN_DEPTH,
xlab="Mean Depth ",
ylab="Samples",
main="Mean Depth per sample preQC",
col="cyan3",
breaks=50)
}
```
Summarize your sample statistics
```{r generate-sample-statistics}
# Take the information of individuals from the .idepth file.
# We are changing the names of the columns to specify these values come from raw data
if (data_type == "ARRAY"){
sample_statistics = subset(imiss, select = c(IID, F_MISS, N_GENO))
sample_statistics = rename(sample_statistics, Initial_missingness = F_MISS, Initial_n_sites = N_GENO)
} else if (exists("idepth")){
sample_statistics = rename(idepth, IID = INDV, Initial_n_sites = N_SITES, Initial_mean_depth = MEAN_DEPTH)
# Using the "match" function, we will create a new column in 'sample_statistics' with the missingness per sample.
# imiss and idepth should have the same samples in the same order, but using the "match" function will be useful when we start dropping samples
sample_statistics$Initial_missingness = imiss$F_MISS[match(sample_statistics$IID, imiss$INDV)]
} else {
sample_statistics = subset(imiss, select = c(INDV, F_MISS, N_DATA))
sample_statistics = rename(imiss, IID = INDV, Initial_missingness = F_MISS, Initial_n_sites = N_DATA)
}
# Save as a file
write.table(sample_statistics,
"sample_statistics.txt",
col.names = TRUE,
row.names = FALSE,
quote = FALSE)
```
**Site based metrics**
```{r plot-preQC-stats-site}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
# Site Missingness
lmiss = read.delim(file = paste0(prefix,".preqc.lmiss"),
header = TRUE,
sep = '')
# Generate basic summary statistics
lmiss_F = describe(lmiss$F_MISS)
rownames(lmiss_F) = c("site_missingness")
print(lmiss_F)
# Plot Missingness rate site per chromosome
sorted_chr = unique(lmiss$CHR) #organize the chromosomes in ascending order
lmiss$CHR = factor(lmiss$CHR , levels=sorted_chr)
lmiss_box = boxplot(F_MISS ~ CHR,
data=lmiss,
xlab="Chromosome",
ylab="Missingness rate",
cex.axis = 0.5 ,
main="Missingness rate per site preQC",
col="cyan3")
# Depth per site - chromosome (Only for Exome or Genome data)
if(data_type=="EXOME" || data_type=="GENOME"){
# Read depth calculations
ldepth = read.delim((paste0(prefix,".preqc.ldepth.mean")), header = T, sep = "")
# Generate basic summary statistics
ldepth_mean = describe(ldepth$MEAN_DEPTH)
rownames(ldepth_mean) = c("site_depth")
print(ldepth_mean)
### Plot depth per site
sorted_chr = unique(ldepth$CHR) #organize the chromosomes in ascending order
ldepth$CHR = factor(ldepth$CHR , levels=sorted_chr)
ldepth_box = boxplot(MEAN_DEPTH ~ CHR,
data=ldepth,
xlab="Chromosome",
ylab="Mean Depth per site",
cex.axis = 0.5 ,
main="Mean Depth per site preQC",
col="cyan3")
}
# Variant Quality Score Recalibration ranks (Only for Exome or Genome data)
if(data_type=="EXOME" || data_type=="GENOME"){
vqsr = read.delim((paste0(prefix,".preqc.FILTER.summary")), header = T, sep = "")
print(vqsr)
#Plot it
vqsr %>%
filter(!is.na(N_VARIANTS)) %>%
arrange(N_VARIANTS) %>%
mutate(FILTER=factor(FILTER, FILTER)) %>%
ggplot( aes(x=FILTER, y=N_VARIANTS, label = N_VARIANTS) ) +
geom_segment( aes(x=FILTER ,xend=FILTER, y=0, yend=N_VARIANTS), color="grey") +
geom_point(size=3, color="#69b3a2") +
geom_text(vjust=-1, size = 3) +
coord_flip() +
theme_minimal() +
theme(
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
legend.position="none") +
scale_y_continuous(name = "Number of variants") +
labs(title = "Variant Quality Score Recalibration ranks preQC")
#ggsave("VQSR-preQC.png")
}
```
Summarize statistical measures
```{r generate-statistical-measures}
if(data_type=="EXOME" || data_type=="GENOME"){
dataset_statistics = bind_rows(imiss_F,
idepth_mean,
lmiss_F,
ldepth_mean)
} else {
dataset_statistics = bind_rows(imiss_F,
lmiss_F)
}
write.table(dataset_statistics,
"dataset_statistics.txt",
col.names = TRUE,
row.names = TRUE,
quote = FALSE)
print(dataset_statistics)
```
Organize all the generated files
```{bash organize-files-1}
if [ "$tidy_up" == "TRUE" ]; then
mkdir -p Pre-QC_stats
mv ${prefix}.preqc.* ./Pre-QC_stats
fi
```
## 4. Genotype Quality control {#section-4}
This step is only possible to do when data is in a VCF file that has
been properly annotated with genotype depth (DP) and genotype quality
(GQ). Notice that no sites will be eliminated, as this only affects
individual calls. For example if an individuals genotype for a given
variant is below desired thresholds, it's turned into a missing value in
that individual.
```{bash genotype-qc}
if [[ "$data_type" == "EXOME" || "$data_type" == "GENOME" ]]; then
if [[ "$genotype_filter" == "TRUE" ]]; then
if [[ "$vqsr_PASS" == "TRUE" ]]; then
$vcftools --gzvcf ${prefix}.vcf.gz --minDP $DP --minGQ $GQ --remove-filtered-all --recode --recode-INFO-all --out ${prefix}.DP$DP.GQ$GQ
else
$vcftools --gzvcf ${prefix}.vcf.gz --minDP $DP --minGQ $GQ --recode --recode-INFO-all --out ${prefix}.DP$DP.GQ$GQ
fi
mv ${prefix}.DP$DP.GQ$GQ.recode.vcf ${prefix}.DP$DP.GQ$GQ.vcf
bgzip ${prefix}.DP$DP.GQ$GQ.vcf
elif [[ "$genotype_filter" == "FALSE" ]]; then
echo "genotype_filter was set to FALSE, this step is skipped"
fi
elif [[ "$data_type" == "ARRAY" ]]; then
echo "Plink data cannot be checked for depth or genotype quality"
fi
```
**Note**: VCFtools output can be compressed directly into a .gz file by
adding `–stdout | gzip -c > newname.gz`, but it won't generate a .log
file, which are useful for debugging.
Organize files
```{bash organize-files-2}
if [ "$tidy_up" == "TRUE" ]; then
mkdir -p File_evolution
if [[ "$data_type" == "EXOME" || "$data_type" == "GENOME" ]] && [[ "$genotype_filter" == "TRUE" ]]; then
mv ${prefix}.vcf.gz ./File_evolution
fi
fi
```
Update prefix variable
```{r update-prefix-1}
if((data_type=="EXOME" || data_type=="GENOME") && (genotype_filter=="TRUE")){
prefix=(paste0(prefix,".DP",DP,".GQ",GQ))
Sys.setenv(prefix=prefix)
}
```
## 5. Filter for missingness {#section-5}
Identify samples and variants missing more than 10% of data and remove
them. The 10% or 0.1 threshold can be modified as desired.
```{bash filter-missingness}
if [[ "$data_type" == "EXOME" || "$data_type" == "GENOME" ]]; then
# Identify individuals with high missingness
$vcftools --gzvcf ${prefix}.vcf.gz --missing-indv --out ${prefix}.mind
awk '{if ($5 > 0.1) {print $1} }' ${prefix}.mind.imiss > ${prefix}.mind.irem
awk '{if ($5 <= 0.1) {print $1} }' ${prefix}.mind.imiss > ${prefix}.mind.ikeep
# Remove variants and sites missing more than 10% of data
$vcftools --gzvcf ${prefix}.vcf.gz --keep ${prefix}.mind.ikeep --max-missing 0.9 --recode --recode-INFO-all --out ${prefix}.miss
mv ${prefix}.miss.recode.vcf ${prefix}.miss.vcf
bgzip ${prefix}.miss.vcf
else
# Remove samples missing more than 10% of data
$plink --bfile ${prefix} --mind 0.1 --make-bed --out ${prefix}.miss
fi
```
> Individuals missing more than 10% of the data get removed in this
> step. Removed individuals will be written to *prefix.mind.irem*
Update your sample_statistics file
```{r update-sample-metrics-1}
if(data_type=="EXOME" || data_type=="GENOME"){
imiss_qc = read.delim(paste0(prefix,".mind.imiss"))
sample_statistics$GQDP_missingness = imiss_qc$F_MISS[match(sample_statistics$IID, imiss_qc$INDV)]
}
```
Tidy up
```{bash organize-files-3}
if [ "$tidy_up" == "TRUE" ]; then
mkdir -p genotype_qc Pre-QC_stats File_evolution
if [[ "$data_type" == "GENOME" || "$data_type" == "EXOME" ]]; then
if [[ "$genotype_filter" == "TRUE" ]]; then
mv ${prefix}.mind.* ./genotype_qc
else
mv ${prefix}.mind.* ./Pre-QC_stats
fi
mv ${prefix}.vcf.gz ./File_evolution
else
mv ${prefix}.bed ./File_evolution
mv ${prefix}.bim ./File_evolution
mv ${prefix}.fam ./File_evolution
mv ${prefix}.log ./File_evolution
# Check if ${prefix}.miss.irem exists before moving
if [ -e "${prefix}.miss.irem" ]; then
mv ${prefix}.miss.irem ./File_evolution
fi
fi
fi
```
Update prefix variable
```{r update-prefix-2}
prefix=(paste0(prefix,".miss"))
Sys.setenv(prefix=prefix)
```
## 6. Calculate heterozygosity {#section-6}
This is a recommended step when you are analyzing cohorts with either
SNP array or Whole genome sequencing.
We prefer plink to calculate sample heterozygosity. Plink uses a
*sliding window* approach to identify variants in linkage
disequilibrium. There are many options to modify the behavior or this
approach in [plink's
documentation](https://www.cog-genomics.org/plink/1.9/ld#indep). The LD
pruning requires that the *.bim* file has variant IDs in the second
column. If no variants have been assigned, you could do a preliminary
step using
[--set-missing-var-ids](https://www.cog-genomics.org/plink/1.9/data#set_missing_var_ids).
```{bash import-vcf, echo=FALSE}
if [ "$data_type" == "GENOME" ] || [ "$data_type" == "EXOME" ]; then
# Import VCF and assign IDs to SNV
$plink --vcf ${prefix}.vcf.gz --keep-allele-order --double-id --vcf-half-call m --set-missing-var-ids @:#\$1,\$2 --new-id-max-allele-len 1 --make-bed --out ${prefix}
# Assign IDs to indels
mv ${prefix}.bim ${prefix}.bim-original
awk 'BEGIN {count=1}; {if ($2 ~ /\./) {sub(/\./,"INDEL"(count++));print} else {print} }' ${prefix}.bim-original > ${prefix}.bim
fi
```
```{bash calculate-heterozygosity, echo=FALSE}
# I have this as an if statement because im still deciding if this is a good measure for exomes.
if [ "$data_type" == "GENOME" ] || [ "$data_type" == "ARRAY" ] || [ "$data_type" == "EXOME" ]; then
# Retain variants with MAF > 10% and individuals with low missing %
$plink --bfile ${prefix} --maf 0.1 --make-bed --out ${prefix}.maf
# Calculate LD
#--indep-pairwise <window size>['kb'] <step size (variant ct)> <r^2 threshold>
$plink --bfile ${prefix}.maf --indep-pairwise 50 10 0.2
# Retain variants not in LD (independent markers)
$plink --bfile ${prefix}.maf --extract plink.prune.in --make-bed --out ${prefix}.maf.ld
# Check heterozygosity
$plink --bfile ${prefix}.maf.ld --het --out ${prefix}.het
# Calculate missingness rates
$plink --bfile ${prefix} --missing --out ${prefix}
rm ${prefix}.maf.*
rm plink.*
fi
```
**Identify heterozygosity outliers**
```{r plot-heterozygosity}
# Load data into R
het = read.delim(file = paste0(prefix,".het.het"),
header = TRUE,
sep = '')
# Generate basic summary statistics
heterozygosity_sample = describe(het$F)
rownames(heterozygosity_sample) = c("sample_heterozygosity")
print(heterozygosity_sample)
# Calculate limits for excluding samples [3 standard deviations]
## Low threshold
heterozygosity_low_limit = mean(het$F)-(3*(sd(het$F)))
print(paste("Mean heterozygosity F value -3 Standard Deviations =", heterozygosity_low_limit))
## High threshold
heterozygosity_high_limit = mean(het$F)+(3*(sd(het$F)))
print(paste("Mean heterozygosity F value +3 Standard Deviations =", heterozygosity_high_limit))
# Individuals whose heterozygosity deviated more than 3 SD from the mean should be identified
het_outlier_low = filter(het, F<heterozygosity_low_limit)
het_outlier_high = filter(het, F>heterozygosity_high_limit)
het_outlier_both = bind_rows(het_outlier_low,
het_outlier_high)
het_outlier_id = select(het_outlier_both, FID, IID)
# Save your list. Useful for debugging
write.table(het_outlier_id,
"heterozygosity_outliers.txt",
col.names = FALSE,
row.names = FALSE,
quote = FALSE)
print(paste(nrow(het_outlier_id), "samples are heterozygosity outliers. IDs are written in heterozygosity_outliers.txt"))
# Plot heterozygosity per sample
heterozygosity_hist = hist(het$F,
freq=TRUE,
xlab="Heterozygosity F coefficient",
ylab="Samples",
main="Heterozygosity rate per sample",
col="cyan3",
breaks=100)
abline(v = (heterozygosity_low_limit), col="red")
abline(v = (heterozygosity_high_limit), col="red")
abline(v = (mean(het$F)), col="blue")
legend("topright",
c("+/-3 SD","mean"),
col=c("red","blue"),
pch=16)
```
> Heterozygosity outliers will be removed along with those that fail
> sex-check on next step.
It is useful to visualize heterozygosity F statistic vs. missingness per
sample
```{r heterozygosity-vs-missing}
# Load your data:
miss_imiss = read.delim(file = paste0(prefix,".imiss"),
header = TRUE,
sep = '')
plot(miss_imiss$F_MISS, het$F,
xlab="Missingness",
ylab="Heterozygosity",
main="Heterozygosity rate per sample - Data before QC")
abline(h = (heterozygosity_low_limit), col="red")
abline(h = (heterozygosity_high_limit), col="red")
abline(h = (mean(het$F)), col="blue")
legend("bottomright",
legend = c("+/-3 SD", "mean"),
col = c("red", "blue"),
pch = 1)
```
Update your Sample Metrics dataframe
```{r update-sample-metrics-2}
sample_statistics$heterozygosity = het$F[match(sample_statistics$IID, het$IID)]
write.table(sample_statistics,
"sample_statistics.txt",
col.names = TRUE,
row.names = FALSE,
quote = FALSE)
# Update your dataset_statistics dataframe
dataset_statistics = bind_rows(dataset_statistics,
heterozygosity_sample)
# Save as a file
write.table(dataset_statistics,
"dataset_statistics.txt",
col.names = TRUE,
row.names = TRUE,
quote = FALSE)
print(dataset_statistics)
```
## 7. Check sex {#section-7}
**Preprocess sex chromosomes**
A. Verify that *prefix.fam* has a 'sex' value on the 5 column.
If the values are 0 or -9, you can assign 'sex value' in plink with the
--update-sex command. You can also use R to edit the .fam by running the
following chunk
```{r edit-fam-2}
# Load your dataframes
fam = read.delim(paste0(prefix,".fam"),
sep=' ',
header=FALSE)
# If your .fam doesn't have the sex already specified, it will add the SEX from sample_data
if(sum(fam$V5) < length(fam$V5) * 0.25) {
# Check if 'sample_data' exists in the environment
if (exists("sample_data")) {
fam$V5 = sample_data$SEX[match(fam$V2, sample_data$IID, nomatch = 0)]
write.table(fam,
paste0(prefix, ".fam"),
col.names = FALSE,
row.names = FALSE,
quote = FALSE)
} else {
print("No sample data was provided, cannot annotate .fam")
}
} else {
print("fam file has sex data")
}
```
B. Split Pseudo-Autosomic region (PAR) of X
It is necessary to remove the PAR region of X for plink's sex check
pipeline. *awk* will check if there there are variants in the PAR region
of your X chomosome (assumes hg38).
If there are nor variants in the PAR region, then it will append
'.split-x' to ensure compatibility further on. Make sure you are using
the correct genome alignment for this. e.g. ReDLat files are aligned to
hg38. Set up `genome_alignment` variable accordingly You can find more
information on this step in [plink
documentation](https://www.cog-genomics.org/plink/1.9/data#split_x)
```{bash split-PAR}
if [[ $genome_alignment == 'hg19' ]]; then
if [[ $(awk '$1 == "23" && ($4 <= 2699520 || $4 >= 154931044) { count++ } END { print count }' "${prefix}.bim") -gt 0 ]]; then
$plink --bfile ${prefix} --split-x $genome_alignment --make-bed --out ${prefix}.split-x
else
mv ${prefix}.bed ${prefix}.split-x.bed
mv ${prefix}.bim ${prefix}.split-x.bim
mv ${prefix}.fam ${prefix}.split-x.fam
echo "plink files have been renamed to ${prefix}.split-x as there is no PAR in chromosome 23"
fi
elif [[ $genome_alignment == 'hg38' ]]; then
if [[ $(awk '$1 == "23" && ($4 <= 2781479 || $4 >= 155701383) { count++ } END { print count }' "${prefix}.bim") -gt 0 ]]; then
$plink --bfile ${prefix} --split-x $genome_alignment --make-bed --out ${prefix}.split-x
else
mv ${prefix}.bed ${prefix}.split-x.bed
mv ${prefix}.bim ${prefix}.split-x.bim
mv ${prefix}.fam ${prefix}.split-x.fam
echo "plink files have been renamed to ${prefix}.split-x as there is no PAR in chromosome 23"
fi
else
echo "Genome alignment must be either 'hg19' or 'hg38'."
fi
```
**Check sex in X chromosome**
```{bash check-sex-X, echo=FALSE}
if [ $(awk -F' ' '{sum+=$5;} END {print sum;}' ${prefix}.split-x.fam) != '0' ]; then
# Retain X chromosome and calculate r2 between the variants in a given window
# This step requires that the .bim file has variant IDs in the second column
# it will generate two files: plink.prune.in and plink.prune.out
$plink --bfile ${prefix}.split-x --chr 23 --indep-pairwise 50 5 0.2
# Extract unlinked variants in x
$plink --bfile ${prefix}.split-x --extract plink.prune.in --make-bed --out ${prefix}.split-x.LD
# Calculate heterozygosity F statistic of X chromosome
$plink --bfile ${prefix}.split-x.LD --check-sex 0.25 0.8 --out ${prefix}.split-x.chrX
rm plink.*
rm ${prefix}.split-x.LD.*
else
echo "Please edit the .fam with the sex of the samples"
fi
```
Plot sex in X chromosome
```{r plot-sex-X}
# Ensure "ggbeeswarm" is installed and loaded
if (!requireNamespace("ggbeeswarm", quietly = TRUE)) {
install.packages("ggbeeswarm")
}
library(ggbeeswarm)
sex_X = read.delim(file = paste0(prefix,".split-x.chrX.sexcheck"),
header = TRUE,
sep = '')
# Plot these coefficients comparing males vs. females
# Roughly you expect female to have an F coefficient < 0.2-0.3 and males have an F coefficient > 0.7-0.8
# If there are individuals for which the fam has sex = 0 they will be plotted in an additional category called "no data"
if (any(fam$V5 %in% "0")) {
ggplot(sex_X,
aes(x = factor(PEDSEX,
labels = c("Not disclosed", "Male", "Female")),
y = F,
color = PEDSEX)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
labs(title = "Chromosomal sex assignement in samples based in X chromosome",
x = "Disclosed sex",
y = "F coefficient X chromosome") +
theme_minimal() +
theme(legend.position = "none")
} else {
ggplot(sex_X,
aes(x = factor(PEDSEX,
labels = c("Male", "Female")),
y = F,
color = PEDSEX)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
labs(title = "Chromosomal sex assignement in samples based in X chromosome",
x = "Disclosed sex",
y = "F coefficient X chromosome") +
theme_minimal() +
theme(legend.position = "none")
}
```
**Check sex according to Y chromosome**
I recommend doing an initial check on the distribution of male vs female
Y counts and then you can modify the detection thresholds after.
Detection count thresholds can be modified by changing the following
flag: `--check-sex y-only [female max Y obs] [male min Y obs]`
```{bash check-sex-Y}
if [[ $(awk '$1 == "24" { count++ } END { print count }' ${prefix}.split-x.bim) != "" ]]; then
if [ $(awk -F' ' '{sum+=$5;}END{print sum;}' ${prefix}.split-x.fam) != "0" ]; then
$plink --bfile ${prefix}.split-x --check-sex y-only 2500 4000 --out ${prefix}.split-x.chrY
else
echo "Please edit the .fam with the sex of the samples"
fi
else
echo "Dataset does not contain Y chromosome. Y chromosome sex check was skipped"
fi
```
Plot sex in Y chromosome
```{r plot-sex-Y}
# Ensure "ggbeeswarm" is installed and loaded
if (!requireNamespace("ggbeeswarm", quietly = TRUE)) {
install.packages("ggbeeswarm")
}
library(ggbeeswarm)
# File path to search
Y_check_file = paste0(prefix,".split-x.chrY.sexcheck")
# Check if the file exists
if (file.exists(Y_check_file)) {
sex_Y = read.delim(file = paste0(prefix,".split-x.chrY.sexcheck"),
header = TRUE,
sep = '')
} else {
print("chrY.sexcheck file does not exist. Y chromosome sex check was skipped")
}
if (exists("sex_Y")) {
# Plot the variant call count in chromosome Y
# If there are individuals for which the fam has sex = 0 they will be plotted in an additional category called "no data"
if (any(fam$V5 %in% "0")) {
ggplot(sex_Y,
aes(x = factor(PEDSEX,
labels = c("No Data", "Male", "Female")),
y = YCOUNT,
color = PEDSEX)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
labs(title = "Chromosomal sex assignement in samples based in Y chromosome",
x = "Disclosed sex",
y = "Y chromosome variant count") +
theme_minimal() +
theme(legend.position = "none")
} else {
ggplot(sex_Y,
aes(x = factor(PEDSEX,
labels = c("Male", "Female")),
y = YCOUNT,
color = PEDSEX)) +
geom_quasirandom(alpha = 0.7,
size = 1.5) +
labs(title = "Chromosomal sex assignement in samples based in Y chromosome",
x = "Disclosed sex",
y = "Y chromosome variant count") +
theme_minimal() +
theme(legend.position = "none")
}
}
```
**Identify individuals that fail sex-check**
Per Plink's software recomendations, samples will only be cluded as 'sex-fails' if they have x misstch. Y chromosome data may be scarce for older males: see e.g. Dumanski JP et al. (2014) Smoking is associated with mosaic loss of chromosome Y.
Y chromosomes check will be added to the sample_statistics.txt as supporting data.
```{r identify-sex-check-fails}
# Identify individuals that fail in each chromosome test
fail_x = sex_X[sex_X$STATUS == 'PROBLEM',]
write.table(fail_x, file = "sex_fail_samples.txt",
quote = FALSE,
row.names = FALSE,
col.names = FALSE)
print(paste(nrow(fail_x), "samples have missmatched X chromosome sex. IDs are written in sex_fail_samples.txt"))
if (exists("sex_Y")) {
fail_y = sex_Y[sex_Y$STATUS == 'PROBLEM',]
print(paste(nrow(fail_y), "samples have unexpected count value in Y Chromosome. Take a look at the", Y_check_file, "These samples won't be excluded automatically"))
}
```
These samples will be removed after identifying duplicates.