Set of scripts to summarise, analyse and visualise Mutation Annotation Format (MAF) file(s) using maftools R package. The maftools manuscript is on bioRxiv and scripts are available on GitHub. The downstream analyses include cancer driver discovery.
- Installation
- MAF field requirements
- Scripts summary
- Converting VCF files to MAF files
- Converting ICGC mutation format to MAF
- Extracting variants within exonic regions
- Extracting variants of specific classification
- Changing sample names
- Subsetting MAF
- Merging MAFs
- Summarising and visualising MAF file(s)
- Summarising and visualising MAF file(s) for selected genes
Run the environment.yaml file to create conda environment and install required packages. The -p
flag should point to the miniconda installation path. For instance, to create maf-summary
environment using miniconda installed in /miniconda
directory run the following command:
conda env create -p /miniconda/envs/maf-summary --file envm/environment.yaml
Activate created maf-summary
conda environment before running the pipeline
conda activate maf-summary
Additionally, vcf2maf tool available on GitHub is required for Converting VCF files to MAF files.
While MAF files contain many fields ranging from chromosome names to cosmic annotations, the mandatory fields used by maftools are the following:
Field | Description | Allowed values |
---|---|---|
Hugo_Symbol | HUGO gene symbol | - |
Chromosome | Chromosome no. | 1-22, X, Y |
Start_Position | Event start position | Numeric |
End_Position | Event end position | Numeric |
Reference_Allele | Positive strand reference allele | A, T, C, G |
Tumor_Seq_Allele2 | Primary data genotype | A, T, C, G |
Variant_Classification | Translational effect of variant allele | Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, Intron, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame, Splice_Region, Unknown |
Variant_Type | Variant Type | SNP, DNP, INS, DEL, TNP and ONP |
Tumor_Sample_Barcode | Sample ID | Either a TCGA barcode, or for non-TCGA data, a literal SAMPLE_ID as listed in the clinical data file |
Script | Description |
---|---|
multi_vcf2maf.pl | Converts multiple VCF (Variant Call Format) file to MAF file |
icgcMutationToMAF.R | Converts ICGC Simple Somatic Mutation Format file to MAF file |
exons_maf.pl | Extracts variants detected within exonic regions |
var_class_maf.pl | Extracts variants of specific classification/consequence |
MAFsamplesRename.R | Changes sample names in MAF's Tumor_Sample_Barcode field |
subsetMAF.R | Subsets MAF based on defined list of samples and/or genes |
mergeMAFs.R | Merges multiple MAFs |
summariseMAFs.R | Summarises and visualises MAF file(s) |
summariseMAFsGenes.R | Summarises and visualises MAF file(s) for selected genes |
To convert multiple VCF files into one collective MAF file use multi_vcf2maf.pl script. It requires a file with listed VCF files to be converted into corresponding MAF files (see an example here). The individual MAF files are then merged into one collective MAF file, which is saved in the same directory as the files listing VCF files.
The following are required
- vcf2maf tool available on GitHub (see Installation section).
- Reference FASTA file (e.g. GRCh37 available on ncbi.nih.gov FTP site)
Script: multi_vcf2maf.pl
Argument | Description | Required |
---|---|---|
--vcf_list | Full path with name of a file listing VCF files to be converted | Yes |
--exons | Include exonic regions only: TRUE/T or FALSE/F (defualt) | No |
--v2m | Full path to vcf2maf.pl script | Yes |
--ref | Reference FASTA file (e.g. GRCh37) | Yes |
--maf_file | Name of the merged MAF file to be created | Yes |
NOTE: The definition of exonic regions is based on the MAF's Variant_Classification field and is described in Extracting variants within exonic regions section. If one requires to subset generated MAF file based on different Variant_Classification categories, then one can run the var_class_maf.pl script, which allows define Variant_Classification categories of interest using --var_class
parameter.
Command line use example:
perl scripts/multi_vcf2maf.pl --vcf_list examples/example_vcf_list.txt --exons FALSE --v2m /path/to/vcf2maf.pl --ref /path/to/reference/GRCh37-lite.fa --maf_file example.maf
This will convert the VCF files listed in example_vcf_list.txt into corresponding MAF files, which are then will be merged into one collective MAF file example.maf saved in examples directory.
NOTE: Sometimes the VCF files miss the chr
prefix for chromosome names, while these were present in the reference FASTA file (indicated by --ref
parameter). This may trigger error similar to the one below:
[W::fai_get_val] Reference 10:69103935-69103937 not found in file, returning empty sequence
[faidx] Failed to fetch sequence in 10:69103935-69103937
ERROR: Make sure that ref-fasta is the same genome build as your MAF: ../GCF_000001405.38_GRCh38.p12_genomic.fna
To solve that issue one needs to add chr
prefix to the VCF file of interest, e.g. no_chr.vcf
:
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf
The publicly available ICGC mutation data is stored in Simple Somatic Mutation Format file, which is similar to MAF format in its structure, but the field names and classification of variants are different. The icgcMutationToMAF.R script implements icgcSimpleMutationToMAF function within maftools R package to convert ICGC Simple Somatic Mutation Format to MAF.
Script: icgcMutationToMAF.R
Argument | Description | Required |
---|---|---|
--icgc_file | ICGC Simple Somatic Mutation Format file to be converted | Yes |
--remove_duplicated_variants | Remove repeated variants in a particuar sample, mapped to multiple transcripts of same gene? (OPTIONAL; defulat is TRUE ). NOTE, option TRUE removes all repeated variants as duplicated entries. FALSE results in keeping all of them) |
No |
--output | Output file name | No |
Command line use example:
gunzip examples/simple_somatic_mutation.open.PACA-AU.tsv.gz
Rscript scripts/icgcMutationToMAF.R --icgc_file examples/simple_somatic_mutation.open.PACA-AU.tsv --output simple_somatic_mutation.open.PACA-AU.maf
This will convert the /data/simple_somatic_mutation.open.PACA-AU.tsv Simple Somatic Mutation Format file into MAF as output it as /data/simple_somatic_mutation.open.PACA-AU.maf.
To extract variants detected within exonic regions run the exons_maf.pl script. It will create a MAF file with .exonic.maf extension, which will have the following variants included/excluded based on the input MAF's Variant_Classification field:
Variants included | Variants excluded |
---|---|
Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Silent, Translation_Start_Site | 3'Flank, 3'UTR, 5'Flank, 5'UTR, IGR, Intron, RNA, Splice_Site, Splice_Region |
NOTE: If one requires to subset MAF file based on different Variant_Classification categories, then one can run the var_class_maf.pl script, which allows define Variant_Classification categories of interest using --var_class
parameter.
Script: exons_maf.pl
Argument | Description | Required |
---|---|---|
--maf | Full path with name of a MAF file to be converted | Yes |
Command line use example:
perl scripts/exons_maf.pl --maf examples/simple_somatic_mutation.open.PACA-AU.maf
This will extract variants within exonic regions reported in /data/simple_somatic_mutation.open.PACA-AU.maf file and will save them in /data/simple_somatic_mutation.open.PACA-AU.exonic.maf.
To extract variants with specific consequences run the var_class_maf.pl script. It will create a MAF file with .var_class.maf extension, which will have the user-defined variants included based on the input MAF's Variant_Classification field. Available options are listed in MAF field requirements section.:
Script: var_class_maf.pl
Argument | Description | Required |
---|---|---|
--maf | Full path with name of a MAF file to be converted | Yes |
--var_class | List of variants classifications to incude in the subet MAF. The default list includes exonic regions, i.e. marked as Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Silent and Translation_Start_Site in MAF's Variant_Classification field | No |
Command line use example:
perl scripts/var_class_maf.pl --maf examples/simple_somatic_mutation.open.PACA-AU.maf --var_class Missense_Mutation,Nonsense_Mutation
This will extract variants within exonic regions reported in /data/simple_somatic_mutation.open.PACA-AU.maf file and will save them in /data/simple_somatic_mutation.open.PACA-AU.var_class.maf.
To change sample names (as shown in MAF's Tumor_Sample_Barcode field) run the MAFsamplesRename.R script. It expects a file listing samples to be renamed. The first column is expected to contain sample names (as shown MAF's Tumor_Sample_Barcode field) to be changed and the second columns is expected to contain the corresponding name to be used instead (see example file example_samples_to_rename.txt).
Script: MAFsamplesRename.R
Argument | Description | Required |
---|---|---|
--maf_file | MAF file to be processed | Yes |
--names_file | Name and path to a file listing samples to be renamed | Yes |
--output | Name for the output MAF file | No |
Packages: maftools, optparse, tibble
Command line use example:
Rscript scripts/MAFsamplesRename.R --maf_file examples/simple_somatic_mutation.open.PACA-AU.maf --names_file examples/example_samples_to_rename.txt --output examples/simple_somatic_mutation.open.PACA-AU_samples_renamed.maf
This will create a /data/simple_somatic_mutation.open.PACA-AU_samples_renamed.maf with sample names changed according to /examples/example_samples_to_rename.txt file.
NOTE: If no output file name is specified the output will have the same name as the input maf_file with suffix _samples_renamed.maf.
To subset MAF based on a list of samples and/or genes run the subsetMAF.R script. It also allows to subset MAF based on variants classifiaction as defined in MAF's Variant_Classification field. The script expects file(s) listing samples (as shown in MAF's Tumor_Sample_Barcode field) and/or genes to include in the new MAF file (see example files example_samples_to_subset.txt and example_genes_to_subset.txt).
Script: subsetMAF.R
Argument | Description | Required |
---|---|---|
--maf_file | MAF file to be subsetted | Yes |
--samples | Name and path to a file listing samples to be kept in the subsetted MAF. Sample names are expected to be separated by comma. Use all to keep all samples (OPTIONAL) | No |
--genes | Name and path to a file listing genes to be kept in the subsetted MAF. Gene symbols are expected to be separated by comma. Use all to keep all genes (OPTIONAL) | No |
--var_class | Classification of variants to be kept in the subsetted MAF (OPTIONAL). Available options are listed in MAF field requirements section | No |
--output | Name for the subsetted MAF | No |
NOTE: The available variants types for var_class parameter are listed in Variant_Classification row in the MAF field requirements section.
Command line use example:
Rscript scripts/subsetMAF.R --maf_file examples/simple_somatic_mutation.open.PACA-AU.maf --samples examples/example_samples_to_subset.txt --genes examples/example_ genes_to_subset.txt --var_class Missense_Mutation --output examples/simple_somatic_mutation.open.PACA-AU_subset.maf
This will create a /data/simple_somatic_mutation.open.PACA-AU_subset.maf restricted to samples and genes listed in /examples/example_samples_to_subset.txt and /examples/example_ genes_to_subset.txt, respectively, and containing missense mutations.
NOTE: If no output file name is specified the output will have the same name as the input maf_file with suffix _subset.maf.
To merge multiple MAFs run the mergeMAFs.R script. It uses merge_mafs function in maftools R package, which merges MAFs based on MAF's Tumor_Sample_Barcode field in each MAF file.
Script: mergeMAFs.R
Argument | Description | Required |
---|---|---|
--maf_dir | Directory with MAF files to be merged | Yes |
--maf_files | List of MAF files to be merged. Each file name is expected to be separated by comma | Yes |
--maf_fields | Fields to be kept in merged MAF. Options available: All (default), Nonredundant (i.e. those which are present in more than one dataset) and Basic (see MAF field requirements section) | No |
--output | Location and name for the merged MAF file | No |
Command line use example:
Rscript scripts/mergeMAFs.R --maf_dir examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --maf_fields All --output examples/icgc.simple_somatic_mutation.merged.maf
This will create merged /data/icgc.simple_somatic_mutation.merged.maf file with mutation data from the individual /data/simple_somatic_mutation.open.PACA-AU.maf and /data/simple_somatic_mutation.open.PACA-CA.maf files.
NOTE: If no output file name is specified the output will be saved as merged.maf in the directory with MAF files to be merged.
To summarise MAF file(s) run the summariseMAFs.R script. This script catches the arguments from the command line and passes them to the summariseMAFs.Rmd script to produce the html report, generate set of plots and excel spreadsheets summarising each MAF file.
NOTE: Only non-synonymous variants with high/moderate variant consequences, including frame shift deletions, frame shift insertions, splice site mutations, translation start site mutations ,nonsense mutation, nonstop mutations, in-frame deletion, in-frame insertions and missense mutation, are reported (silent variants are ignored). One can manually define variant classifications to be considered as non-synonymous using --nonSyn_list
parameter.
Script: summariseMAFs.R
Argument | Description | Required |
---|---|---|
--maf_dir | Directory with MAF file(s) | Yes |
--maf_files | List of MAF file(s) to be processed. Each file name is expected to be separated by comma | Yes |
--datasets | Desired names of each dataset. The names are expected to be in the same order as provided MAF files | Yes |
--samples_id_cols | The name(s) of MAF file(s) column containing samples' IDs. One column name is expected for a single file, and each separated by comma. The defualt samples' ID column is Tumor_Sample_Barcode |
No |
--genes_min | Minimal percentage of patients carrying mutations in individual genes to be included in the report (default is 4 ) |
No |
--genes_list | Location and name of a file listing genes of interest to be considered in the report (OPTIONAL) | No |
--genes_keep_order | Keep order of genes as provided in file specified by genes_list parameter (OPTIONAL; default is FALSE ) |
No |
--genes_blacklist | Location and name of a file listing genes to be excluded (OPTIONAL). Header is not expected and the genes should be listed in separate lines | No |
--samples_list | Location and name of a file listing specific samples to be included (OPTIONAL). All other samples will be ignored. The ID of samples to be included are expected to be listed in column named Tumor_Sample_Barcode . Additional columns are also allowed |
No |
--samples_keep_order | Keep order of samples as provided in the MAF file (OPTIONAL; default is FALSE ). |
No |
--sort_by_annotation | Sort samples by provided clinical_features . Sorts based on first clinical_features (OPTIONAL; default is FALSE ). |
No |
--samples_blacklist | Location and name of a file listing samples to be excluded (OPTIONAL). The ID of samples to be excluded are expected to be listed in column named Tumor_Sample_Barcode . Additional columns are allowed |
No |
--nonSyn_list | List of variant classifications to be considered as non-synonymous. Rest will be considered as silent variants. Default uses Variant Classifications with High/Moderate variant consequences |
No |
--remove_duplicated_variants | Remove repeated variants in a particuar sample, mapped to multiple transcripts of same gene? (defulat is TRUE ). NOTE, option TRUE removes all repeated variants as duplicated entries. FALSE results in keeping all of them |
No |
--pathways | Location of a file with custom pathway list in the form of a two column tsv file containing gene names and their corresponding pathway (OPTIONAL) | No |
--purple | Location of the corresponding PURPLE output files (OPTIONAL) | No |
--purple_hd | Copy-number (CN) upper threshold to call homozygous deletion (HD) in PURPLE output files (OPTIONAL; defulat is 0.5 ) |
No |
--purple_loh | CN upper threshold to call loss of heterozygosity (LOH) in PURPLE output files (OPTIONAL; defulat is 1.5 ) |
No |
--purple_amp | CN lower threshold to call amplification in PURPLE output files (OPTIONAL; defulat is 6 ) |
No |
--cnvkit | Location of the corresponding CNVkit output files (OPTIONAL) | No |
--cnvkit_hd | Copy-number (CN) upper threshold to call homozygous deletion (HD) in CNVkit output files (OPTIONAL; defulat is 0.5 ) |
No |
--cnvkit_loh | CN upper threshold to call loss of heterozygosity (LOH) in CNVkit output files (OPTIONAL; defulat is 1.5 ) |
No |
--cnvkit_amp | CN lower threshold to call amplification in CNVkit output files (OPTIONAL; defulat is 6 ) |
No |
--gistic | Location of the corresponding GISTIC output files (including gisticAllLesionsFile, gisticAmpGenesFile, gisticDelGenesFile and gisticScoresFile) (OPTIONAL) | No |
--clinical_info | Location of clinical data associated with each sample in individual MAF file. Each file name (for each dataset) is expected to be separated by comma (OPTIONAL) | No |
--clinical_features | Columns names (separated by comma) from clinical data (specified by --clinical_info argument) to be drawn in oncoplot(s) (OPTIONAL) | No |
--clinical_enrichment_p | P-value threshold for clinical enrichment analysis (OPTIONAL; defulat is 0.05 ) |
No |
--signature_enrichment_p | P-value threshold for reporting significant enrichment of genes in detected mutational signatures (OPTIONAL; defulat is 0.05 ) |
No |
--maf_comp_p | P-value threshold for reporting significant differences between cohorts (OPTIONAL; defulat is 0.05 ) |
No |
--maf_comp_fdr | FDR threshold for reporting significant differences between cohorts (OPTIONAL; defulat is 1 ) |
No |
--out_folder | Output folder | No |
--hide_code_btn | Hide the Code button allowing to show/hide code chunks in the final HTML report. Available options are: TRUE (default) and FALSE |
No |
--ucsc_genome_assembly | Human reference genome version used for signature analysis. Available options are: 19 (default) and 38 |
No |
Packages: required packages are listed in environment.yaml file.
Purple output files can be combined and converted into GISTIC compatible seg file using purple2gistic.R script.
Command line use example:
cd scripts
Rscript summariseMAFs.R --maf_dir ../examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --datasets ICGC-PACA-AU,ICGC-PACA-CA --genes_min 4 --out_folder ../examples/MAF_summary_report
This will generate summariseMAFs.html report with interactive summary tables and heatmaps within examples / MAF_summary_report folder. It will also create results
folder with user-defined name containing output tables and plots described here.
To summarise MAF file(s) for specific set of genes run the summariseMAFsGenes.R script. This script catches the arguments from the command line and passes them to the summariseMAFsGenes.Rmd script to produce the html report, generate set of plots and excel spreadsheets summarising user-defined genes for individual MAF files.
NOTE: Only non-synonymous variants with high/moderate variant consequences, including frame shift deletions, frame shift deletions, splice site mutations, translation start site mutations ,nonsense mutation, nonstop mutations, in-frame deletion, in-frame insertions and missense mutation, are reported (silent variants are ignored).
Make sure that X11 is installed, as this is required to generate the interactive plots.
Script: summariseMAFsGenes.R
Argument | Description | Required |
---|---|---|
--maf_dir | Directory with MAF file(s) | Yes |
--maf_files | List of MAF file(s) to be processed. Each file name is expected to be separated by comma | Yes |
--datasets | Desired names of each dataset. The names are expected to be in the same order as provided MAF files | Yes |
--genes | Genes to query in each MAF file | Yes |
--out_folder | Output folder | No |
Packages: maftools, openxlsx, optparse, knitr, DT, plotly, heatmaply, ggplot2
Command line use example:
cd scripts
Rscript summariseMAFsGenes.R --maf_dir ../examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --datasets ICGC-PACA-AU,ICGC-PACA-CA --genes KRAS,SMAD4,TP53,CDKN2A,ARID1A,BRCA1,BRCA2 --out_folder ../examples/MAF_summary_report_genes
This will generate summariseMAFsGenes.html report with interactive summary tables and heatmaps within MAF_summary folder. It will also create a folder with user-defined name containing output tables and plots described here.
To subset full MAF file to a subset of interest, the samples IDs for the samples to be included or excluded are needed. For KRAS-wt cohort, the sample IDs for KRAS mutant samples can be extracted from atlas MAF using following commands:
Below mafInfo[[i]] is full cohort MAF
> om = maftools:::createOncoMatrix(m = mafInfo[[i]], g = c("KRAS"))
> names(om)
[1] "oncoMatrix" "numericMatrix" "vc"
> write.table(om$numericMatrix, path = "oncomatrix.csv", sep = "\t")
It will return a list of matrices which includes a numericMatrix element which is what we're interested in. Values are integer coded for the variant classification, and those codes are explained in vc slot. The IDs in this file are for samples that have a KRAS non-synonymous mutation. These can be saved in a new file (kras_mut_ids.txt
) and converted to a proper format using $ awk '{ for (i=1; i<=NF; i++) print substr($i, 2, length($i)-2) }' kras_mut_ids.txt > kras_mut.txt