-
Notifications
You must be signed in to change notification settings - Fork 1
in silico mutagenesis
- 4-step in silico mutagenesis
- Input file of MENTR
- Output files from MENTR
- Additional resources for interpretation
This is an alternative way to 1-step in silico mutagenesis.
If you do not require computational speed, I recommend using the simple way.
If you want to change MENTR parameters – or customize scripts appropriate for your environment –, please utilize the below four shell scripts instead of the 1-step script: bin/quick.mutgen.sh
.
All scripts have a help file (please check by bin/aaa.sh -h
)
This script only uses a CPU core
If you can use many CPU cores (e.g., GNU parallel or some job scheduler),
please modify the last for-loop of src/insilico_mutgen_make_closest_gene_prospective.sh
that is referred from the following shell script.
bash bin/01.get.pairs.sh -i ${in_f} -o ${cmn_dir}
Maybe you do not have to customize this script
bash bin/02.collect.inputs.sh -o ${cmn_dir}
If you want to parallelize GPU computation, please modify the last for-loop of src/sh/insilico_mutgen_run.sh
that is referred from the following shell script.
bash bin/03.run.mutgen.sh -o ${cmn_dir} -m ${deepsea} -f ${hg19_fa}
Maybe you do not have to customize this script
bash bin/04.post.mutgen.sh -o ${cmn_dir}
TSV file with: col1: Chromosome number col2: Position (hg19) col3: Reference allele (hg19) col4: Alternate allele (hg19)
Default parameter setting assumes no header. See example file (./resources/example/input.txt.gz)
If you want to analyze multiallelic sites, please split them into multiple rows. For col3 and col4, SNVs and short InDels are acceptable.
in_f=./resources/example/input.txt.gz
The following files are made for each CAGE ID.
-
${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.txt.gz
- col1: CHR
- col2: POS
- col3: REF
- col4: ALT
-
${cmn_dir}/input/closest_gene/chr${CHR}/${CAGE_ID}.closestgene.gz
- col1: Variant_CHR
- col2: Variant_POS - 1
- col3: Variant_POS
- col4: TSS_CHR
- col5: TSS_POS - 1
- col6: TSS_POS
- col7; strand (MENTR do not require this for training and running in silico mutagenesis)
- col8: CAGE_ID
- col9: distance (TSS - Variant)
The pairs (${CAGE_ID}.txt.gz
and ${CAGE_ID}.closestgene.gz)
are required for running MENTR.
NOTE:
- This step requires much time, but if you can use many CPUs and change the shell script
src/sh/insilico_mutgen_make_closest_gene_prospective.sh
a little (by GNU parallel, apply JOB scheduler, etc.), you can save your time very much. - This step's speed strongly depends on your I/O.
Output files are:
[CAGE cluster ID]_{diff/refs/alts}.txt.gz
- diff: mutation effect (Δprobability; abs(Δprobability)≧0.1 is robust and abs(Δprobability)≧0.05 is permissive)
- refs: expression probability (REF allele, hg19)
- alts: expression probability (ALT allele)
Each file has ...
- index: row index
- CHR: chromosome number
- POS: position (hg19)
- REF: reference allele
- ALT: alternative allele (effect allele of insilico mutagenesis)
- dist: distance between POS and CAGE peak position (from closest-features software)
- gene (Unique name of CAGE cluster)
- strand (strand of CAGE cluster; enhancer -> N)
- ref_matched (if REF is truely reference allele in hg19, return True; if it is False, please refrain from using the record, because it is not accurate estimates due to complex analysis conditions.)
- CL:0000034 ... LCL154 (ML models, 347 F5 ontology + LCL) -> values of [diff/refs/alts]
You can get all_qcd_res.txt.gz
, whose header is the same with each file.
Correspondence tables for sample ID and CAGE ID are available from FANTOM5 website:
Especially,
- Sample ID information (supp_table_10.sample_ontology_information.tsv) is available in https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/supp_table/ (Hon et al., Nature 2017)
- CAGE ID information (F5.cage_cluster.hg19.info.tsv, made by Chung Chau HON) is available in the
resources
directory.