You must be signed in to change notification settings - Fork 0
Usage info
Graham Larue edited this page May 25, 2020
3 revisions
To get full usage information for intronIC
, do intronIC --help
, which will output the following:
usage: intronIC [-h] [-g GENOME] [-a ANNOTATION] -n SPECIES_NAME
[-q SEQUENCE_FILE] [-f {cds,exon}] [-s] [--no_nc] [-i] [-v]
[-m {matrix file} [{matrix file} ...]]
[--r12 {reference U12 intron sequences}]
[--r2 {reference U2 intron sequences}] [--no_plot]
[--format_info] [-d] [-u] [--na] [-t 0-100] [--ns]
[--5c start stop] [--3c start stop] [--bpc start stop]
[-r {five,bp,three} [{five,bp,three} ...]] [--afn]
[--recursive] [--n_subsample N_SUBSAMPLE]
[--cv_processes CV_PROCESSES] [-p PROCESSES]
[--matrix_score_info] [-C HYPERPARAMETER_C]
[--min_intron_len MIN_INTRON_LEN] [--pseudocount PSEUDOCOUNT]
[--exons_as_flanks] [-b BED_FILE]
intronIC (intron Interrogator and Classifier) is a script which collects all
of the annotated introns found in a genome/annotation file pair, and produces
a variety of output files (*.iic) which describe the annotated introns and
(optionally) their similarity to known U12 sequences. Without the '-m' flag,
there MUST exist a matrix file in the 'intronIC_data' subdirectory in the same
parent directory as intronIC.py, with filename 'scoring_matrices.fasta.iic'.
In the same data directory, there must also be a pair of sequence files (see
--format_info) with reference intron sequences named '[u2,
optional arguments:
-h, --help show this help message and exit
-f {cds,exon}, --feature {cds,exon}
Specify feature to use to define introns. By default,
intronIC will identify all introns uniquely defined by
both CDS and exon features. Under the default mode,
introns defined by exon features only will be
demarcated by an '[e]' tag (default: None)
-s, --sequences_only Bypass the scoring system and simply report the intron
sequences present in the annotations (default: False)
--no_nc Omit introns with non-canonical terminal dinucleoties
from scoring (default: False)
-i, --allow_multiple_isoforms
Include non-duplicate introns from isoforms other than
the longest in the scored intron set (default: False)
-v, --allow_intron_overlap
Allow introns with boundaries that overlap other
introns from higher-priority transcripts (longer
coding length, etc.) to be included. This will
include, for instance, introns with alternative 5′/3′
boundaries (default: False)
-m {matrix file} [{matrix file} ...], --matrices {matrix file} [{matrix file} ...]
One or more matrices to use in place of the defaults.
Must follow the formatting described by the
--format_info option (default: None)
--r12 {reference U12 intron sequences}, --reference_u12s {reference U12 intron sequences}
introns.iic file with custom reference introns to be
used for setting U12 scoring expectation, including
flanking regions (default: None)
--r2 {reference U2 intron sequences}, --reference_u2s {reference U2 intron sequences}
introns.iic file with custom reference introns to be
used for setting U12 scoring expectation, including
flanking regions (default: None)
--no_plot Do not output illustrations of intron
scores/distributions(plotting requires matplotlib)
(default: False)
--format_info Print information about the system files required by
this script (default: False)
-d, --include_duplicates
Include introns with duplicate coordinates in the
intron seqs file (default: False)
-u, --uninformative_naming
Use a simple naming scheme for introns instead of the
verbose, metadata-laden default format (default:
--na, --no_abbreviate
Use the provided species name in full within the
output files (default: False)
-t 0-100, --threshold 0-100
Threshold value of the SVM-calculated probability of
being a U12 to determine output statistics (default:
--ns, --no_sequence_output
Do not create a file with the full intron sequences of
all annotated introns (default: False)
--5c start stop, --five_score_coords start stop
Coordinates describing the 5' sequence to be scored,
relative to the 5' splice site (e.g. position 0 is the
first base of the intron); half-closed interval
[start, stop) (default: (-3, 9))
--3c start stop, --three_score_coords start stop
Coordinates describing the 3' sequence to be scored,
relative to the 3' splice site (e.g. position -1 is
the last base of the intron); half-closed interval
(start, stop] (default: (-10, 4))
--bpc start stop, --branch_point_coords start stop
Coordinates describing the region to search for branch
point sequences, relative to the 3' splice site (e.g.
position -1 is the last base of the intron); half-
closed interval [start, stop). (default: (-55, -5))
-r {five,bp,three} [{five,bp,three} ...], --scoring_regions {five,bp,three} [{five,bp,three} ...]
Intron sequence regions to include in intron score
calculations. (default: ('five', 'bp'))
--afn, --abbreviate_filenames
Use abbreviated species name when creating output
filenames. (default: False)
--recursive Generate new scoring matrices and training data using
confident U12s from the first scoring pass. This
option may produce better results in species distantly
related to the species upon which the training
data/matrices are based, though beware accidental
training on false positives. Recommended only in cases
where clear separation between types is seen with
default data. (default: False)
--n_subsample N_SUBSAMPLE
Number of sub-samples to use to generate SVM
classifiers; 0 uses the entire training set and should
provide the best results; otherwise, higher values
will better approximate the entire set at the expense
of speed. (default: 0)
--cv_processes CV_PROCESSES
Number of parallel processes to use during cross-
validation (default: None)
Number of parallel processes to use for scoring (and
cross-validation, unless --cv_processes is also set)
(default: 1)
--matrix_score_info Produce additional per-matrix raw score information
for each intron (default: False)
Provide the value for hyperparameter C directly
(bypasses optimized parameter search) (default: None)
--min_intron_len MIN_INTRON_LEN
Minimum intron length to consider for scoring
(default: 30)
--pseudocount PSEUDOCOUNT
Pseudocount value to add to each matrix value to avoid
0-div errors (default: 0.0001)
--exons_as_flanks Use entire up/downstream exonic sequence as flank
sequence in output (default: False)
Supply intron coordinates in BED format (default:
required arguments (-g, -a | -q):
-g GENOME, --genome GENOME
Genome file in FASTA format (gzip compatible)
(default: None)
Annotation file in gff/gff3/gtf format (gzip
compatible) (default: None)
-n SPECIES_NAME, --species_name SPECIES_NAME
Binomial species name, used in output file and intron
label formatting. It is recommended to include at
least the first letter of the species, and the full
genus name since intronIC (by default) abbreviates the
provided name in its output (e.g. Homo_sapiens -->
HomSap) (default: None)
Provide intron sequences directly, rather than using a
genome/annotation combination. Must follow the
introns.iic format (see README for description)
(default: None)