Skip to content

in silico mutagenesis

Masaru Koido edited this page Feb 7, 2023 · 2 revisions

TOC



4-step in silico mutagenesis

This is an alternative way to 1-step in silico mutagenesis.

If you do not require computational speed, I recommend using the simple way.

If you want to change MENTR parameters – or customize scripts appropriate for your environment –, please utilize the below four shell scripts instead of the 1-step script: bin/quick.mutgen.sh.

All scripts have a help file (please check by bin/aaa.sh -h)

Step1: Preprocessing: get variant-promoter/enhancer pairs

This script only uses a CPU core

If you can use many CPU cores (e.g., GNU parallel or some job scheduler), please modify the last for-loop of src/insilico_mutgen_make_closest_gene_prospective.sh that is referred from the following shell script.

bash bin/01.get.pairs.sh -i ${in_f} -o ${cmn_dir}

Step2: Preprocessing: collect information of input files

Maybe you do not have to customize this script

bash bin/02.collect.inputs.sh -o ${cmn_dir}

Step3: Run in silico mutagenesis

If you want to parallelize GPU computation, please modify the last for-loop of src/sh/insilico_mutgen_run.sh that is referred from the following shell script.

bash bin/03.run.mutgen.sh -o ${cmn_dir} -m ${deepsea} -f ${hg19_fa}

Step4: Postprocessing: Check output files and collect all results in a file.

Maybe you do not have to customize this script

bash bin/04.post.mutgen.sh -o ${cmn_dir}

Input file of MENTR

TSV file with: col1: Chromosome number col2: Position (hg19) col3: Reference allele (hg19) col4: Alternate allele (hg19)

Default parameter setting assumes no header. See example file (./resources/example/input.txt.gz)

If you want to analyze multiallelic sites, please split them into multiple rows. For col3 and col4, SNVs and short InDels are acceptable.

in_f=./resources/example/input.txt.gz

Output files from MENTR

Preprocessing: get variant-promoter/enhancer pairs

The following files are made for each CAGE ID.

  • ${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.txt.gz
    • col1: CHR
    • col2: POS
    • col3: REF
    • col4: ALT
  • ${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.closestgene.gz
    • col1: Variant_CHR
    • col2: Variant_POS - 1
    • col3: Variant_POS
    • col4: TSS_CHR
    • col5: TSS_POS - 1
    • col6: TSS_POS
    • col7; strand (MENTR do not require this for training and running in silico mutagenesis)
    • col8: CAGE_ID
    • col9: distance (TSS - Variant)

The pairs (${CAGE_ID}.txt.gz and ${CAGE_ID}.closestgene.gz) are required for running MENTR.

NOTE:

  • This step requires much time, but if you can use many CPUs and change the shell script src/sh/insilico_mutgen_make_closest_gene_prospective.sh a little (by GNU parallel, apply JOB scheduler, etc.), you can save your time very much.
  • This step's speed strongly depends on your I/O.

Results of in silico mutagenesis

Output files are:

[CAGE cluster ID]_{diff/refs/alts}.txt.gz

  • diff: mutation effect (Δprobability; abs(Δprobability)≧0.1 is robust and abs(Δprobability)≧0.05 is permissive)
  • refs: expression probability (REF allele, hg19)
  • alts: expression probability (ALT allele)

Each file has ...

  • index: row index
  • CHR: chromosome number
  • POS: position (hg19)
  • REF: reference allele
  • ALT: alternative allele (effect allele of insilico mutagenesis)
  • dist: distance between POS and CAGE peak position (from closest-features software)
  • gene (Unique name of CAGE cluster)
  • strand (strand of CAGE cluster; enhancer -> N)
  • ref_matched (if REF is truely reference allele in hg19, return True; if it is False, please refrain from using the record, because it is not accurate estimates due to complex analysis conditions.)
  • CL:0000034 ... LCL154 (ML models, 347 F5 ontology + LCL) -> values of [diff/refs/alts]

Results from postprocessing

You can get all_qcd_res.txt.gz, whose header is the same with each file.


Additional resources for interpretation

Correspondence tables for sample ID and CAGE ID are available from FANTOM5 website:

Especially,