-
Notifications
You must be signed in to change notification settings - Fork 21
Generating Mutational Spectra
Yo Akiyama edited this page Jun 27, 2023
·
4 revisions
Generating mutational signatures using signatureanalyzer
. For a comprehensive description of mutational signatures, their relevance, and references, please see the Catalogue of Somatic Mutations in Cancer, or COSMIC, here, as well as The repertoire of mutational signatures in human cancer. The following document is a reference for easy creation of spectra using .mafs
.
For more information about the .maf
format, see here. The following examples show generating mutational spectra from an hg19
.maf
using signatureanalyzer
. First, load your maf
as a pandas dataframe.
The following columns are required:
Tumor_Sample_Barcode
Chromosome
Start_Position
Reference_Allele
Tumor_Seq_Allele2
Variant_Type
The following columns are optional:
-
ref_context
: will map context if not provided
import signatureanalyzer as sa
import pandas as pd
maf_df = pd.read_csv(<PATH_TO_MAF>, sep='\t').loc[:,[
'Hugo_Symbol',
'Tumor_Sample_Barcode',
'Chromosome',
'Start_Position',
'Reference_Allele',
'Tumor_Seq_Allele2',
'Variant_Type'
]]
print(maf_df.head())
Hugo_Symbol | Tumor_Sample_Barcode | Chromosome | Start_Position | Reference_Allele | Tumor_Seq_Allele2 | Variant_Type | |
---|---|---|---|---|---|---|---|
0 | URGCP | sample_192 | 7 | 43916856 | C | A | SNP |
1 | CLCN1 | sample_127 | 7 | 143048832 | C | A | SNP |
2 | NAV2 | sample_354 | 11 | 20113762 | A | G | SNP |
3 | TUBB8P7 | sample_32 | 16 | 90162224 | G | T | SNP |
... | CHFR | sample_35 | 12 | 133438086 | C | A | SNP |
- This encodes the 96-base context
-
note: two forms of this exist - either input should work
- word: ACAG --> (REF)(MUT)(LEFT NT)(RIGHT NT)
- arrow: A[A>C]G --> (LEFT NT)(REF)>(MUT)(RIGHT NT)
- REQUIRES a 2-bit human genome build
_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_exome', hgfile='hg19.2bit')
print(spectra_sbs.head().iloc[:,:5])
Arrow
sample_551 | sample_135 | sample_118 | sample_191 | sample_124 | |
---|---|---|---|---|---|
A[A>C]A | 1 | 2 | 0 | 1 | 2 |
A[A>C]C | 0 | 2 | 0 | 0 | 0 |
A[A>C]G | 0 | 6 | 2 | 2 | 0 |
A[A>C]T | 0 | 4 | 0 | 0 | 2 |
... | 1 | 2 | 0 | 1 | 0 |
or
Word
context96.word | sample_551 | sample_135 | sample_118 | sample_191 | sample_124 |
---|---|---|---|---|---|
ACAA | 1 | 2 | 0 | 1 | 2 |
ACAC | 0 | 2 | 0 | 0 | 0 |
ACAG | 0 | 6 | 2 | 2 | 0 |
ACAT | 0 | 4 | 0 | 0 | 2 |
... | 1 | 2 | 0 | 1 | 0 |
- This encodes the 78-base context
_,spectra_dbs = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_DBS')
print(spectra_dbs.head())
context78.word | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
AC>CA | 0 | 0 | 0 | 0 | 0 |
AC>CG | 0 | 0 | 0 | 0 | 0 |
AC>CT | 0 | 0 | 0 | 0 | 0 |
AC>GA | 0 | 0 | 0 | 0 | 0 |
... | 0 | 0 | 0 | 0 | 0 |
- This encodes the 83-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='cosmic3_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context83.word | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
Cdel1 | 0 | 1 | 3 | 1 | 0 |
Cdel2 | 0 | 4 | 3 | 6 | 0 |
Cdel3 | 2 | 2 | 2 | 9 | 0 |
Cdel4 | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base context
-
note: two forms of this exist - either input should work
- word: ACAGAT --> (REF)(MUT)(L-2)(L-1)(R+1)(R+2)
- arrow: AA[A>C]GT --> (L-2)(L-1)(REF)>(ALT)(R+2)
- REQUIRES a 2-bit human genome build
_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS', hgfile='hg19.2bit')
print(spectra_sbs.head().iloc[:,:5])
Arrow
sample_551 | sample_135 | sample_118 | sample_191 | sample_124 | |
---|---|---|---|---|---|
AA[T>A]AA | 1 | 2 | 0 | 1 | 2 |
AA[T>A]AC | 0 | 2 | 0 | 0 | 0 |
AA[T>A]AG | 0 | 6 | 2 | 2 | 0 |
AA[T>A]AT | 0 | 4 | 0 | 0 | 2 |
... | 1 | 2 | 0 | 1 | 0 |
or
Word
context1536.word | sample_551 | sample_135 | sample_118 | sample_191 | sample_124 |
---|---|---|---|---|---|
TAAAAA | 1 | 2 | 0 | 1 | 2 |
TAAAAC | 0 | 2 | 0 | 0 | 0 |
TAAAAG | 0 | 6 | 2 | 2 | 0 |
TAAAAT | 0 | 4 | 0 | 0 | 2 |
... | 1 | 2 | 0 | 1 | 0 |
- This encodes the 1536-base + 78-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_COMPOSITE', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 0 | 0 | 0 |
AC>CA | 0 | 0 | 0 | 3 | 0 |
AC>CG | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
Cdel1 | 0 | 1 | 3 | 1 | 0 |
Cdel2 | 0 | 4 | 3 | 6 | 0 |
Cdel3 | 2 | 2 | 2 | 9 | 0 |
Cdel4 | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 96-base + 78-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_COMPOSITE96', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
A[C>A]A | 0 | 1 | 3 | 1 | 0 |
A[C>A]C | 0 | 4 | 3 | 6 | 0 |
A[C>A]G | 2 | 2 | 2 | 9 | 0 |
A[C>A]T | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 0 | 0 | 0 |
AC>CA | 0 | 0 | 0 | 3 | 0 |
AC>CG | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
Cdel1 | 0 | 1 | 3 | 1 | 0 |
Cdel2 | 0 | 4 | 3 | 6 | 0 |
Cdel3 | 2 | 2 | 2 | 9 | 0 |
Cdel4 | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
Cdel1 | 0 | 1 | 3 | 1 | 0 |
Cdel2 | 0 | 4 | 3 | 6 | 0 |
Cdel3 | 2 | 2 | 2 | 9 | 0 |
Cdel4 | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 96-base + 83-base context
- REQUIRES a 2-bit human genome build
- Either SBS form should work (Arrow vs Word)
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='pcawg_SBS96_ID', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
A[C>A]A | 0 | 1 | 3 | 1 | 0 |
A[C>A]C | 0 | 4 | 3 | 6 | 0 |
A[C>A]G | 2 | 2 | 2 | 9 | 0 |
A[C>A]T | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 0 | 0 | 0 |
Cdel1 | 0 | 1 | 3 | 1 | 0 |
Cdel2 | 0 | 4 | 3 | 6 | 0 |
Cdel3 | 2 | 2 | 2 | 9 | 0 |
Cdel4 | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
- This encodes the 1536-base + 8-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='polymerase_msi', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
AA[T>A]AA | 0 | 1 | 3 | 1 | 0 |
AA[T>A]AC | 0 | 4 | 3 | 6 | 0 |
AA[T>A]AG | 2 | 2 | 2 | 9 | 0 |
AA[T>A]AT | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
INS4 | 0 | 1 | 3 | 1 | 0 |
DEL1 | 0 | 4 | 3 | 6 | 0 |
DEL3 | 2 | 2 | 2 | 9 | 0 |
DEL4 | 0 | 0 | 0 | 3 | 0 |
- This encodes the 1536-base + 8-base context
- REQUIRES a 2-bit human genome build
_,spectra_id = sa.spectra.get_spectra_from_maf(maf_df, reference='polymerase_msi96', hgfile='hg19.2bit')
print(spectra_id.head().iloc[:,:5])
context.pcawg | sample_0 | sample_1 | sample_2 | sample_3 | sample_4 |
---|---|---|---|---|---|
A[C>A]A | 0 | 1 | 3 | 1 | 0 |
A[C>A]C | 0 | 4 | 3 | 6 | 0 |
A[C>A]G | 2 | 2 | 2 | 9 | 0 |
A[C>A]T | 0 | 0 | 0 | 3 | 0 |
... | 0 | 0 | 1 | 3 | 0 |
INS4 | 0 | 1 | 3 | 1 | 0 |
DEL1 | 0 | 4 | 3 | 6 | 0 |
DEL3 | 2 | 2 | 2 | 9 | 0 |
DEL4 | 0 | 0 | 0 | 3 | 0 |