in silico mutagenesis

4-step in silico mutagenesis

This is an alternative way to 1-step in silico mutagenesis.

If you do not require computational speed, I recommend using the simple way.

If you want to change MENTR parameters – or customize scripts appropriate for your environment –, please utilize the below four shell scripts instead of the 1-step script: bin/quick.mutgen.sh.

All scripts have a help file (please check by bin/aaa.sh -h)

Step1: Preprocessing: get variant-promoter/enhancer pairs

This script only uses a CPU core

If you can use many CPU cores (e.g., GNU parallel or some job scheduler), please modify the last for-loop of src/insilico_mutgen_make_closest_gene_prospective.sh that is referred from the following shell script.

bash bin/01.get.pairs.sh -i ${in_f} -o ${cmn_dir}

Step2: Preprocessing: collect information of input files

Maybe you do not have to customize this script

bash bin/02.collect.inputs.sh -o ${cmn_dir}

Step3: Run in silico mutagenesis

If you want to parallelize GPU computation, please modify the last for-loop of src/sh/insilico_mutgen_run.sh that is referred from the following shell script.

bash bin/03.run.mutgen.sh -o ${cmn_dir} -m ${deepsea} -f ${hg19_fa}

Step4: Postprocessing: Check output files and collect all results in a file.

Maybe you do not have to customize this script

bash bin/04.post.mutgen.sh -o ${cmn_dir}

Input file of MENTR

TSV file with: col1: Chromosome number col2: Position (hg19) col3: Reference allele (hg19) col4: Alternate allele (hg19)

Default parameter setting assumes no header. See example file (./resources/example/input.txt.gz)

If you want to analyze multiallelic sites, please split them into multiple rows. For col3 and col4, SNVs and short InDels are acceptable.

in_f=./resources/example/input.txt.gz

Output files from MENTR

Preprocessing: get variant-promoter/enhancer pairs

The following files are made for each CAGE ID.

${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.txt.gz
- col1: CHR
- col2: POS
- col3: REF
- col4: ALT
${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.closestgene.gz
- col1: Variant_CHR
- col2: Variant_POS - 1
- col3: Variant_POS
- col4: TSS_CHR
- col5: TSS_POS - 1
- col6: TSS_POS
- col7; strand (MENTR do not require this for training and running in silico mutagenesis)
- col8: CAGE_ID
- col9: distance (TSS - Variant)

The pairs (${CAGE_ID}.txt.gz and ${CAGE_ID}.closestgene.gz) are required for running MENTR.

NOTE:

This step requires much time, but if you can use many CPUs and change the shell script src/sh/insilico_mutgen_make_closest_gene_prospective.sh a little (by GNU parallel, apply JOB scheduler, etc.), you can save your time very much.
This step's speed strongly depends on your I/O.

Results of in silico mutagenesis

Output files are:

[CAGE cluster ID]_{diff/refs/alts}.txt.gz

diff: mutation effect (Δprobability; abs(Δprobability)≧0.1 is robust and abs(Δprobability)≧0.05 is permissive)
refs: expression probability (REF allele, hg19)
alts: expression probability (ALT allele)

Each file has ...

index: row index
CHR: chromosome number
POS: position (hg19)
REF: reference allele
ALT: alternative allele (effect allele of insilico mutagenesis)
dist: distance between POS and CAGE peak position (from closest-features software)
gene (Unique name of CAGE cluster)
strand (strand of CAGE cluster; enhancer -> N)
ref_matched (if REF is truely reference allele in hg19, return True; if it is False, please refrain from using the record, because it is not accurate estimates due to complex analysis conditions.)
CL:0000034 ... LCL154 (ML models, 347 F5 ontology + LCL) -> values of [diff/refs/alts]

Results from postprocessing

You can get all_qcd_res.txt.gz, whose header is the same with each file.

Additional resources for interpretation

Correspondence tables for sample ID and CAGE ID are available from FANTOM5 website:

Especially,

Sample ID information (supp_table_10.sample_ontology_information.tsv) is available in https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/supp_table/ (Hon et al., Nature 2017)
CAGE ID information (F5.cage_cluster.hg19.info.tsv, made by Chung Chau HON) is available in the resources directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly