Skip to content

Latest commit

 

History

History
84 lines (57 loc) · 3.82 KB

2.call_variants_and_fasta_consensus.md

File metadata and controls

84 lines (57 loc) · 3.82 KB

Variant call and filter with bcftools

Use bcftools mpileup, call and filter to obtain genotypes per individual.

bcftools mpileup -Q 20 -q 20 -r $chromosome -O b -o $sample_raw_$chromosome.bcf
bcftools call -cO z --format-fields GQ -o $sample_$chromosome_call.vcf.gz $sample_raw_$chromosome.bcf

I decided to concatenate the bcf files here, but you don't have to. To concatenate them in a specific order, like the chromosome order on the bam files, extract the order contained in the bam header.

samtools view -H $sample.realigned.bam | grep '@SQ' | cut -f2 | cut -d':' -f2 > file.bcf.header

Using this order file with chromosome names, create a list that has the names of the each chromosome files for your specific sample. Something like this:

$sample_1_call.vcf.gz
$sample_2_call.vcf.gz
$sample_3_call.vcf.gz
$sample_4_call.vcf.gz
...
bcftools concat -O z -f $sample_order_file.txt -o $sample_call.vcf.gz

Finally, apply a filter to your called genotypes. The file max_depth.txt has sample IDs in the first column and the average depth X 3 for each sample (to use as DP). We obtained depth information from part 1, point 6.

parallel -j 24 --colsep '\t' '~/bcftools/bcftools filter --include "type!=\"INDEL\" && STRLEN(REF) == 1 && QUAL >= 20 && MQ >= 20 && INFO/DP >= 6 && INFO/DP < {2} && FMT/GQ >=20 || type!=\"INDEL\" && STRLEN(REF) == 1 && QUAL >= 20 && MQ >= 20 && INFO/DP >= 6 && INFO/DP < {2} && FMT/GT==\"0/0\"" --SnpGap 10 -O z -o {1}_filter_f3.vcf.gz {1}_call.vcf.gz' :::: max_depth.txt

Generating a consensus fasta for each individual

Disclaimer: There are probably better ways to do this, but these were the commands I used.

  1. Index your filtered vcf files.
find *filter*.vcf.gz | parallel -j30 '~/bcftools/bcftools index {}'
  1. Generate an intermediate file that containts three columns with chromosome, position and genotype that passed each filter. list.txt contains a list of sample IDs.
parallel --colsep '\t' -j 12 "bcftools view --apply-filters 'PASS' {1}_filter_f3.vcf.gz | vcf-query -f '%CHROM\t%POS\t[%GT]\n' | sed 's/\///g' > {1}_filter_PASS_GT.txt" :::: list.txt

The _PASS_GT.txt will have the following format (note: the file should not have a header). It should have whole-genome genotypes:

1 2345 GG
1 3456 TA
1 4567 AA
...
  1. Split this file into chromosomes, for speed. This command will create a folder per chromosome, and output each new chromosome file to the respective folder.
for i in $(ls *_PASS_GT.txt); do o=${i/_filter_PASS_GT.txt/}; mkdir $o; awk -F '\t' '{print > var"/"var"_"$1"_PASS_GT.txt"}' var="$o" $i; done
  1. To generate the fasta alignments for each chromosome, we will use a costume script generate_consensus.pl. This script will create consensus chromosome sequences with the genotypes provided in _PASS_GT.txt. All non-called positions in the chromosome will be hard masked. I specified the name (1st column) and size of each chromosome (2nd column) with the file chromosomes_lengths.txt.

Because not all individuals have genotypes called for all chromosomes in the capture, I created extra _PASS_GT.txt files using create_extra_chromosome_files.py. target_chromosomes.txt is a li st of all chromosomes in the capture.

for d in $(ls -d */); do cd $d; create_extra_chromosome_files.py ../target_chromosomes.txt; cd ../; done

Use this to run generate_consensus.pl:

parallel -j 16 --colsep '\t' 'generate_consensus.pl {1}/{1}_{2}_PASS_GT.txt {3} > {1}/{1}_{2}.fasta' :::: samples.txt :::: chromosome_lengths.txt

Go back to main page