MAF-summary

Set of scripts to summarise, analyse and visualise Mutation Annotation Format (MAF) file(s) using maftools R package. The maftools manuscript is on bioRxiv and scripts are available on GitHub. The downstream analyses include cancer driver discovery.

Installation
MAF field requirements
Scripts summary
Converting VCF files to MAF files
Converting ICGC mutation format to MAF
Extracting variants within exonic regions
Extracting variants of specific classification
Changing sample names
Subsetting MAF
Merging MAFs
Summarising and visualising MAF file(s)
Summarising and visualising MAF file(s) for selected genes

Installation

Run the environment.yaml file to create conda environment and install required packages. The -p flag should point to the miniconda installation path. For instance, to create maf-summary environment using miniconda installed in /miniconda directory run the following command:

conda env create -p /miniconda/envs/maf-summary --file envm/environment.yaml

Activate created maf-summary conda environment before running the pipeline

conda activate maf-summary

Additionally, vcf2maf tool available on GitHub is required for Converting VCF files to MAF files.

MAF field requirements

While MAF files contain many fields ranging from chromosome names to cosmic annotations, the mandatory fields used by maftools are the following:

Field	Description	Allowed values
Hugo_Symbol	HUGO gene symbol	-
Chromosome	Chromosome no.	1-22, X, Y
Start_Position	Event start position	Numeric
End_Position	Event end position	Numeric
Reference_Allele	Positive strand reference allele	A, T, C, G
Tumor_Seq_Allele2	Primary data genotype	A, T, C, G
Variant_Classification	Translational effect of variant allele	Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, Intron, RNA, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame, Splice_Region, Unknown
Variant_Type	Variant Type	SNP, DNP, INS, DEL, TNP and ONP
Tumor_Sample_Barcode	Sample ID	Either a TCGA barcode, or for non-TCGA data, a literal SAMPLE_ID as listed in the clinical data file

Scripts summary

Script	Description
multi_vcf2maf.pl	Converts multiple VCF (Variant Call Format) file to MAF file
icgcMutationToMAF.R	Converts ICGC Simple Somatic Mutation Format file to MAF file
exons_maf.pl	Extracts variants detected within exonic regions
var_class_maf.pl	Extracts variants of specific classification/consequence
MAFsamplesRename.R	Changes sample names in MAF's Tumor_Sample_Barcode field
subsetMAF.R	Subsets MAF based on defined list of samples and/or genes
mergeMAFs.R	Merges multiple MAFs
summariseMAFs.R	Summarises and visualises MAF file(s)
summariseMAFsGenes.R	Summarises and visualises MAF file(s) for selected genes

Converting VCF files to MAF files

To convert multiple VCF files into one collective MAF file use multi_vcf2maf.pl script. It requires a file with listed VCF files to be converted into corresponding MAF files (see an example here). The individual MAF files are then merged into one collective MAF file, which is saved in the same directory as the files listing VCF files.

Note

The following are required

vcf2maf tool available on GitHub (see Installation section).
Reference FASTA file (e.g. GRCh37 available on ncbi.nih.gov FTP site)

Script: multi_vcf2maf.pl

Argument	Description	Required
--vcf_list	Full path with name of a file listing VCF files to be converted	Yes
--exons	Include exonic regions only: TRUE/T or FALSE/F (defualt)	No
--v2m	Full path to vcf2maf.pl script	Yes
--ref	Reference FASTA file (e.g. GRCh37)	Yes
--maf_file	Name of the merged MAF file to be created	Yes

NOTE: The definition of exonic regions is based on the MAF's Variant_Classification field and is described in Extracting variants within exonic regions section. If one requires to subset generated MAF file based on different Variant_Classification categories, then one can run the var_class_maf.pl script, which allows define Variant_Classification categories of interest using --var_class parameter.

Command line use example:

perl scripts/multi_vcf2maf.pl  --vcf_list examples/example_vcf_list.txt  --exons FALSE  --v2m /path/to/vcf2maf.pl  --ref /path/to/reference/GRCh37-lite.fa  --maf_file example.maf

This will convert the VCF files listed in example_vcf_list.txt into corresponding MAF files, which are then will be merged into one collective MAF file example.maf saved in examples directory.

NOTE: Sometimes the VCF files miss the chr prefix for chromosome names, while these were present in the reference FASTA file (indicated by --ref parameter). This may trigger error similar to the one below:

[W::fai_get_val] Reference 10:69103935-69103937 not found in file, returning empty sequence
[faidx] Failed to fetch sequence in 10:69103935-69103937
ERROR: Make sure that ref-fasta is the same genome build as your MAF: ../GCF_000001405.38_GRCh38.p12_genomic.fna

To solve that issue one needs to add chr prefix to the VCF file of interest, e.g. no_chr.vcf:

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Converting ICGC mutation format to MAF

The publicly available ICGC mutation data is stored in Simple Somatic Mutation Format file, which is similar to MAF format in its structure, but the field names and classification of variants are different. The icgcMutationToMAF.R script implements icgcSimpleMutationToMAF function within maftools R package to convert ICGC Simple Somatic Mutation Format to MAF.

Script: icgcMutationToMAF.R

Argument	Description	Required
--icgc_file	ICGC Simple Somatic Mutation Format file to be converted	Yes
--remove_duplicated_variants	Remove repeated variants in a particuar sample, mapped to multiple transcripts of same gene? (OPTIONAL; defulat is `TRUE`). NOTE, option `TRUE` removes all repeated variants as duplicated entries. `FALSE` results in keeping all of them)	No
--output	Output file name	No

Packages: maftools, optparse

Command line use example:

gunzip examples/simple_somatic_mutation.open.PACA-AU.tsv.gz

Rscript scripts/icgcMutationToMAF.R --icgc_file examples/simple_somatic_mutation.open.PACA-AU.tsv --output simple_somatic_mutation.open.PACA-AU.maf

This will convert the /data/simple_somatic_mutation.open.PACA-AU.tsv Simple Somatic Mutation Format file into MAF as output it as /data/simple_somatic_mutation.open.PACA-AU.maf.

Extracting variants within exonic regions

To extract variants detected within exonic regions run the exons_maf.pl script. It will create a MAF file with .exonic.maf extension, which will have the following variants included/excluded based on the input MAF's Variant_Classification field:

Variants included	Variants excluded
Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Silent, Translation_Start_Site	3'Flank, 3'UTR, 5'Flank, 5'UTR, IGR, Intron, RNA, Splice_Site, Splice_Region

NOTE: If one requires to subset MAF file based on different Variant_Classification categories, then one can run the var_class_maf.pl script, which allows define Variant_Classification categories of interest using --var_class parameter.

Script: exons_maf.pl

Argument	Description	Required
--maf	Full path with name of a MAF file to be converted	Yes

Command line use example:

perl scripts/exons_maf.pl --maf examples/simple_somatic_mutation.open.PACA-AU.maf

This will extract variants within exonic regions reported in /data/simple_somatic_mutation.open.PACA-AU.maf file and will save them in /data/simple_somatic_mutation.open.PACA-AU.exonic.maf.

Extracting variants of specific classification

To extract variants with specific consequences run the var_class_maf.pl script. It will create a MAF file with .var_class.maf extension, which will have the user-defined variants included based on the input MAF's Variant_Classification field. Available options are listed in MAF field requirements section.:

Script: var_class_maf.pl

Argument	Description	Required
--maf	Full path with name of a MAF file to be converted	Yes
--var_class	List of variants classifications to incude in the subet MAF. The default list includes exonic regions, i.e. marked as Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Silent and Translation_Start_Site in MAF's Variant_Classification field	No

Command line use example:

perl scripts/var_class_maf.pl --maf examples/simple_somatic_mutation.open.PACA-AU.maf --var_class Missense_Mutation,Nonsense_Mutation

This will extract variants within exonic regions reported in /data/simple_somatic_mutation.open.PACA-AU.maf file and will save them in /data/simple_somatic_mutation.open.PACA-AU.var_class.maf.

Changing sample names

To change sample names (as shown in MAF's Tumor_Sample_Barcode field) run the MAFsamplesRename.R script. It expects a file listing samples to be renamed. The first column is expected to contain sample names (as shown MAF's Tumor_Sample_Barcode field) to be changed and the second columns is expected to contain the corresponding name to be used instead (see example file example_samples_to_rename.txt).

Script: MAFsamplesRename.R

Argument	Description	Required
--maf_file	MAF file to be processed	Yes
--names_file	Name and path to a file listing samples to be renamed	Yes
--output	Name for the output MAF file	No

Packages: maftools, optparse, tibble

Command line use example:

Rscript scripts/MAFsamplesRename.R --maf_file examples/simple_somatic_mutation.open.PACA-AU.maf --names_file examples/example_samples_to_rename.txt --output examples/simple_somatic_mutation.open.PACA-AU_samples_renamed.maf

This will create a /data/simple_somatic_mutation.open.PACA-AU_samples_renamed.maf with sample names changed according to /examples/example_samples_to_rename.txt file.

NOTE: If no output file name is specified the output will have the same name as the input maf_file with suffix _samples_renamed.maf.

Subsetting MAF

To subset MAF based on a list of samples and/or genes run the subsetMAF.R script. It also allows to subset MAF based on variants classifiaction as defined in MAF's Variant_Classification field. The script expects file(s) listing samples (as shown in MAF's Tumor_Sample_Barcode field) and/or genes to include in the new MAF file (see example files example_samples_to_subset.txt and example_genes_to_subset.txt).

Script: subsetMAF.R

Argument	Description	Required
--maf_file	MAF file to be subsetted	Yes
--samples	Name and path to a file listing samples to be kept in the subsetted MAF. Sample names are expected to be separated by comma. Use all to keep all samples (OPTIONAL)	No
--genes	Name and path to a file listing genes to be kept in the subsetted MAF. Gene symbols are expected to be separated by comma. Use all to keep all genes (OPTIONAL)	No
--var_class	Classification of variants to be kept in the subsetted MAF (OPTIONAL). Available options are listed in MAF field requirements section	No
--output	Name for the subsetted MAF	No

NOTE: The available variants types for var_class parameter are listed in Variant_Classification row in the MAF field requirements section.

Packages: maftools, optparse

Command line use example:

Rscript scripts/subsetMAF.R --maf_file examples/simple_somatic_mutation.open.PACA-AU.maf --samples examples/example_samples_to_subset.txt --genes examples/example_ genes_to_subset.txt --var_class Missense_Mutation --output examples/simple_somatic_mutation.open.PACA-AU_subset.maf

This will create a /data/simple_somatic_mutation.open.PACA-AU_subset.maf restricted to samples and genes listed in /examples/example_samples_to_subset.txt and /examples/example_ genes_to_subset.txt, respectively, and containing missense mutations.

NOTE: If no output file name is specified the output will have the same name as the input maf_file with suffix _subset.maf.

Merging MAFs

To merge multiple MAFs run the mergeMAFs.R script. It uses merge_mafs function in maftools R package, which merges MAFs based on MAF's Tumor_Sample_Barcode field in each MAF file.

Script: mergeMAFs.R

Argument	Description	Required
--maf_dir	Directory with MAF files to be merged	Yes
--maf_files	List of MAF files to be merged. Each file name is expected to be separated by comma	Yes
--maf_fields	Fields to be kept in merged MAF. Options available: All (default), Nonredundant (i.e. those which are present in more than one dataset) and Basic (see MAF field requirements section)	No
--output	Location and name for the merged MAF file	No

Packages: maftools, optparse

Command line use example:

Rscript scripts/mergeMAFs.R --maf_dir examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --maf_fields All --output examples/icgc.simple_somatic_mutation.merged.maf

This will create merged /data/icgc.simple_somatic_mutation.merged.maf file with mutation data from the individual /data/simple_somatic_mutation.open.PACA-AU.maf and /data/simple_somatic_mutation.open.PACA-CA.maf files.

NOTE: If no output file name is specified the output will be saved as merged.maf in the directory with MAF files to be merged.

Summarising and visualising MAF file(s)

To summarise MAF file(s) run the summariseMAFs.R script. This script catches the arguments from the command line and passes them to the summariseMAFs.Rmd script to produce the html report, generate set of plots and excel spreadsheets summarising each MAF file.

NOTE: Only non-synonymous variants with high/moderate variant consequences, including frame shift deletions, frame shift insertions, splice site mutations, translation start site mutations ,nonsense mutation, nonstop mutations, in-frame deletion, in-frame insertions and missense mutation, are reported (silent variants are ignored). One can manually define variant classifications to be considered as non-synonymous using --nonSyn_list parameter.

Script: summariseMAFs.R

Argument	Description	Required
--maf_dir	Directory with MAF file(s)	Yes
--maf_files	List of MAF file(s) to be processed. Each file name is expected to be separated by comma	Yes
--datasets	Desired names of each dataset. The names are expected to be in the same order as provided MAF files	Yes
--samples_id_cols	The name(s) of MAF file(s) column containing samples' IDs. One column name is expected for a single file, and each separated by comma. The defualt samples' ID column is `Tumor_Sample_Barcode`	No
--genes_min	Minimal percentage of patients carrying mutations in individual genes to be included in the report (default is `4`)	No
--genes_list	Location and name of a file listing genes of interest to be considered in the report (OPTIONAL)	No
--genes_keep_order	Keep order of genes as provided in file specified by `genes_list` parameter (OPTIONAL; default is `FALSE`)	No
--genes_blacklist	Location and name of a file listing genes to be excluded (OPTIONAL). Header is not expected and the genes should be listed in separate lines	No
--samples_list	Location and name of a file listing specific samples to be included (OPTIONAL). All other samples will be ignored. The ID of samples to be included are expected to be listed in column named `Tumor_Sample_Barcode`. Additional columns are also allowed	No
--samples_keep_order	Keep order of samples as provided in the MAF file (OPTIONAL; default is `FALSE`).	No
--sort_by_annotation	Sort samples by provided `clinical_features`. Sorts based on first `clinical_features`(OPTIONAL; default is `FALSE`).	No
--samples_blacklist	Location and name of a file listing samples to be excluded (OPTIONAL). The ID of samples to be excluded are expected to be listed in column named `Tumor_Sample_Barcode`. Additional columns are allowed	No
--nonSyn_list	List of variant classifications to be considered as non-synonymous. Rest will be considered as silent variants. Default uses Variant Classifications with `High/Moderate variant consequences`	No
--remove_duplicated_variants	Remove repeated variants in a particuar sample, mapped to multiple transcripts of same gene? (defulat is `TRUE`). NOTE, option `TRUE` removes all repeated variants as duplicated entries. `FALSE` results in keeping all of them	No
--pathways	Location of a file with custom pathway list in the form of a two column tsv file containing gene names and their corresponding pathway (OPTIONAL)	No
--purple	Location of the corresponding PURPLE output files (OPTIONAL)	No
--purple_hd	Copy-number (CN) upper threshold to call homozygous deletion (HD) in PURPLE output files (OPTIONAL; defulat is `0.5`)	No
--purple_loh	CN upper threshold to call loss of heterozygosity (LOH) in PURPLE output files (OPTIONAL; defulat is `1.5`)	No
--purple_amp	CN lower threshold to call amplification in PURPLE output files (OPTIONAL; defulat is `6`)	No
--cnvkit	Location of the corresponding CNVkit output files (OPTIONAL)	No
--cnvkit_hd	Copy-number (CN) upper threshold to call homozygous deletion (HD) in CNVkit output files (OPTIONAL; defulat is `0.5`)	No
--cnvkit_loh	CN upper threshold to call loss of heterozygosity (LOH) in CNVkit output files (OPTIONAL; defulat is `1.5`)	No
--cnvkit_amp	CN lower threshold to call amplification in CNVkit output files (OPTIONAL; defulat is `6`)	No
--gistic	Location of the corresponding GISTIC output files (including gisticAllLesionsFile, gisticAmpGenesFile, gisticDelGenesFile and gisticScoresFile) (OPTIONAL)	No
--clinical_info	Location of clinical data associated with each sample in individual MAF file. Each file name (for each dataset) is expected to be separated by comma (OPTIONAL)	No
--clinical_features	Columns names (separated by comma) from clinical data (specified by --clinical_info argument) to be drawn in oncoplot(s) (OPTIONAL)	No
--clinical_enrichment_p	P-value threshold for clinical enrichment analysis (OPTIONAL; defulat is `0.05`)	No
--signature_enrichment_p	P-value threshold for reporting significant enrichment of genes in detected mutational signatures (OPTIONAL; defulat is `0.05`)	No
--maf_comp_p	P-value threshold for reporting significant differences between cohorts (OPTIONAL; defulat is `0.05`)	No
--maf_comp_fdr	FDR threshold for reporting significant differences between cohorts (OPTIONAL; defulat is `1`)	No
--out_folder	Output folder	No
--hide_code_btn	Hide the Code button allowing to show/hide code chunks in the final HTML report. Available options are: `TRUE` (default) and `FALSE`	No
--ucsc_genome_assembly	Human reference genome version used for signature analysis. Available options are: `19` (default) and `38`	No

Packages: required packages are listed in environment.yaml file.

Note

Purple output files can be combined and converted into GISTIC compatible seg file using purple2gistic.R script.

Command line use example:

cd scripts

Rscript summariseMAFs.R --maf_dir ../examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --datasets ICGC-PACA-AU,ICGC-PACA-CA --genes_min 4 --out_folder ../examples/MAF_summary_report

This will generate summariseMAFs.html report with interactive summary tables and heatmaps within examples / MAF_summary_report folder. It will also create results folder with user-defined name containing output tables and plots described here.

Summarising and visualising MAF file(s) for selected genes

To summarise MAF file(s) for specific set of genes run the summariseMAFsGenes.R script. This script catches the arguments from the command line and passes them to the summariseMAFsGenes.Rmd script to produce the html report, generate set of plots and excel spreadsheets summarising user-defined genes for individual MAF files.

NOTE: Only non-synonymous variants with high/moderate variant consequences, including frame shift deletions, frame shift deletions, splice site mutations, translation start site mutations ,nonsense mutation, nonstop mutations, in-frame deletion, in-frame insertions and missense mutation, are reported (silent variants are ignored).

Make sure that X11 is installed, as this is required to generate the interactive plots.

Script: summariseMAFsGenes.R

Argument	Description	Required
--maf_dir	Directory with MAF file(s)	Yes
--maf_files	List of MAF file(s) to be processed. Each file name is expected to be separated by comma	Yes
--datasets	Desired names of each dataset. The names are expected to be in the same order as provided MAF files	Yes
--genes	Genes to query in each MAF file	Yes
--out_folder	Output folder	No

Packages: maftools, openxlsx, optparse, knitr, DT, plotly, heatmaply, ggplot2

Command line use example:

cd scripts

Rscript summariseMAFsGenes.R --maf_dir ../examples --maf_files simple_somatic_mutation.open.PACA-AU.maf,simple_somatic_mutation.open.PACA-CA.maf --datasets ICGC-PACA-AU,ICGC-PACA-CA --genes KRAS,SMAD4,TP53,CDKN2A,ARID1A,BRCA1,BRCA2 --out_folder ../examples/MAF_summary_report_genes

This will generate summariseMAFsGenes.html report with interactive summary tables and heatmaps within MAF_summary folder. It will also create a folder with user-defined name containing output tables and plots described here.

Analysing subset of cohort - e.g. KRAS-wt

To subset full MAF file to a subset of interest, the samples IDs for the samples to be included or excluded are needed. For KRAS-wt cohort, the sample IDs for KRAS mutant samples can be extracted from atlas MAF using following commands:

Below mafInfo[[i]] is full cohort MAF
> om = maftools:::createOncoMatrix(m = mafInfo[[i]], g = c("KRAS"))
> names(om)
[1] "oncoMatrix"    "numericMatrix" "vc"
> write.table(om$numericMatrix, path = "oncomatrix.csv", sep = "\t")

It will return a list of matrices which includes a numericMatrix element which is what we're interested in. Values are integer coded for the variant classification, and those codes are explained in vc slot. The IDs in this file are for samples that have a KRAS non-synonymous mutation. These can be saved in a new file (kras_mut_ids.txt) and converted to a proper format using $ awk '{ for (i=1; i<=NF; i++) print substr($i, 2, length($i)-2) }' kras_mut_ids.txt > kras_mut.txt

Name		Name	Last commit message	Last commit date
Latest commit History 599 Commits
data		data
envm		envm
examples		examples
img		img
scripts		scripts
.gitignore		.gitignore
README.md		README.md
README_output_files.md		README_output_files.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAF-summary

Table of contents

Installation

MAF field requirements

Scripts summary

Converting VCF files to MAF files

Note

Converting ICGC mutation format to MAF

Extracting variants within exonic regions

Extracting variants of specific classification

Changing sample names

Subsetting MAF

Merging MAFs

Summarising and visualising MAF file(s)

Note

Summarising and visualising MAF file(s) for selected genes

Analysing subset of cohort - e.g. KRAS-wt

About

Releases

Packages

Contributors 2

Languages

umccr/MAF-summary

Folders and files

Latest commit

History

Repository files navigation

MAF-summary

Table of contents

Installation

MAF field requirements

Scripts summary

Converting VCF files to MAF files

Note

Converting ICGC mutation format to MAF

Extracting variants within exonic regions

Extracting variants of specific classification

Changing sample names

Subsetting MAF

Merging MAFs

Summarising and visualising MAF file(s)

Note

Summarising and visualising MAF file(s) for selected genes

Analysing subset of cohort - e.g. KRAS-wt

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages