-
Notifications
You must be signed in to change notification settings - Fork 0
SB Group by prefix
Group sequences together based on a prefix (e.g., by taxa). The prefix can be a set number of characters on the front of each sequence ID, or can be identified by splitting the entire ID on certain characters and/or regular expressions. The groups are then written to files in the current working directory or some other pre-existing directory.
Optional. One or more characters, strings, or regular expressions can be used to specify how to split IDs. The first substring after the split will be used as the prefix for grouping purposes. If the ID is split multiple times by the split pattern(s), only the first substring is considered. Any records that do not contain the split character will be sent to a separate file called 'Unknown'.
If you wish to split the ID on an integer, please use a regular expression pattern instead of a raw integer. Otherwise the number will be interpreted as the 'Number of leading characters' argument (see example 3).
By default, the sequence IDs will be split on the '-' character. If '-' is not present in any IDs, then the first 5 characters will be used.
To separate every single record into its own file, pass in the empty split pattern --> "" (see example 3)
Optional. Only use some fixed number of leading characters as the grouping criteria. This can be combined with a split pattern to prevent 'overflow' into the unique identifier part of the ID if the prefixes are of variable length (see example 3).
Optional. By default, all new files will be written to the current working directory. If you wish to send the output elsewhere, provide a path to an existing directory (new directories will not be created for you).
>Dme~Panxδ1
YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPI
>Dme~Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPID
>Dme~Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFC
>Dme~Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPI
>Mle-Panxα1 cDNA - ML078817.
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFM
>Mle-Panxα5 cDNA - ML223536a.
MIYWVWAVFKRMAPFKVVTLDDRWDQMNRSFMMPLTMSFAYLIDYGIIAG
>Mle-Panxα6 cDNA - ML25993a.
MLLEILANFKGATPFKEIVLDDKWDQINRCYMFLLCVIFGTVVTFRQYTG
>Mle-Panxα9 cDNA - ML47742a.
MLDILSKFKGVTPFKGITIDDGWDQLNRSFMFVLLVVMGTTVTVRQYTGS
Passing in no arguments uses "-" as the default split character, or the first 5 characters of each ID if no "-" are present, and writes the files to the current working directory.
$: sb C-terms.fa -gbp
New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Mle.fa
Specify a single split character
$: sb C-terms.fa -gbp "~"
New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Dme.fa
Specify multiple split characters
$: sb C-terms.fa -gbp "~" "-"
New file: /path/to/cwd/Mle.fa
New file: /path/to/cwd/Dme.fa
Specify the number of characters to use as a prefix
$: sb C-terms.fa -gbp 9
New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Mle-Panxα.fa
To use a number as the split character, you will need to use regular expression syntax.
$: sb C-terms.fa -gbp "[1]"
New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Mle-Panxα.fa
Combine split characters and maximum number of characters to further customize the groups
$: sb C-terms.fa -gbp "α" "m" 4
New file: /path/to/cwd/D.fa
New file: /path/to/cwd/Mle-.fa
Write every single record out to its own file by passing in the empty string ""
$: sb C-terms.fa -gbp ""
New file: /path/to/cwd/Dme~Panxδ1.fa
New file: /path/to/cwd/Dme~Panxδ2.fa
New file: /path/to/cwd/Dme~Panxδ3.fa
New file: /path/to/cwd/Dme~Panxδ4.fa
New file: /path/to/cwd/Mle-Panxα1.fa
New file: /path/to/cwd/Mle-Panxα5.fa
New file: /path/to/cwd/Mle-Panxα6.fa
New file: /path/to/cwd/Mle-Panxα9.fa
Specify a pre-existing folder to change where the files are written to
$: sb C-terms.fa -gbp "~/foo/bar/"
New file: /home/foo/bar/Unknown.fa
New file: /home/foo/bar/Mle.fa