Skip to content
Steve Bond edited this page Oct 19, 2015 · 4 revisions

--group_by_prefix, -gbp

Description

Group sequences together based on a prefix (e.g., by taxa). The prefix can be a set number of characters on the front of each sequence ID, or can be identified by splitting the entire ID on certain characters and/or regular expressions. The groups are then written to files in the current working directory or some other pre-existing directory.

Arguments

Split pattern(s) ( regex )

Optional. One or more characters, strings, or regular expressions can be used to specify how to split IDs. The first substring after the split will be used as the prefix for grouping purposes. If the ID is split multiple times by the split pattern(s), only the first substring is considered. Any records that do not contain the split character will be sent to a separate file called 'Unknown'.

If you wish to split the ID on an integer, please use a regular expression pattern instead of a raw integer. Otherwise the number will be interpreted as the 'Number of leading characters' argument (see example 3).

By default, the sequence IDs will be split on the '-' character. If '-' is not present in any IDs, then the first 5 characters will be used.

To separate every single record into its own file, pass in the empty split pattern --> "" (see example 3)

Number of leading characters ( int )

Optional. Only use some fixed number of leading characters as the grouping criteria. This can be combined with a split pattern to prevent 'overflow' into the unique identifier part of the ID if the prefixes are of variable length (see example 3).

Output directory ( path )

Optional. By default, all new files will be written to the current working directory. If you wish to send the output elsewhere, provide a path to an existing directory (new directories will not be created for you).

Examples

Input file: C-terms.fa

>Dme~Panxδ1
YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPI
>Dme~Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPID
>Dme~Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFC
>Dme~Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPI
>Mle-Panxα1 cDNA - ML078817.
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFM
>Mle-Panxα5 cDNA - ML223536a.
MIYWVWAVFKRMAPFKVVTLDDRWDQMNRSFMMPLTMSFAYLIDYGIIAG
>Mle-Panxα6 cDNA - ML25993a.
MLLEILANFKGATPFKEIVLDDKWDQINRCYMFLLCVIFGTVVTFRQYTG
>Mle-Panxα9 cDNA - ML47742a.
MLDILSKFKGVTPFKGITIDDGWDQLNRSFMFVLLVVMGTTVTVRQYTGS

Usage example 1

Passing in no arguments uses "-" as the default split character, or the first 5 characters of each ID if no "-" are present, and writes the files to the current working directory.

$: sb C-terms.fa -gbp

Output

New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Mle.fa

Usage example 2

Specify a single split character

$: sb C-terms.fa -gbp "~"

Output

New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Dme.fa

Usage example 3

Specify multiple split characters

$: sb C-terms.fa -gbp "~" "-"

Output

New file: /path/to/cwd/Mle.fa
New file: /path/to/cwd/Dme.fa

Usage example 4

Specify the number of characters to use as a prefix

$: sb C-terms.fa -gbp 9

Output

New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Mle-Panxα.fa

Usage example 5

To use a number as the split character, you will need to use regular expression syntax.

$: sb C-terms.fa -gbp "[1]"

Output

New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Mle-Panxα.fa

Usage example 6

Combine split characters and maximum number of characters to further customize the groups

$: sb C-terms.fa -gbp "α" "m" 4

Output

New file: /path/to/cwd/D.fa
New file: /path/to/cwd/Mle-.fa

Usage example 7

Write every single record out to its own file by passing in the empty string ""

$: sb C-terms.fa -gbp ""

Output

New file: /path/to/cwd/Dme~Panxδ1.fa
New file: /path/to/cwd/Dme~Panxδ2.fa
New file: /path/to/cwd/Dme~Panxδ3.fa
New file: /path/to/cwd/Dme~Panxδ4.fa
New file: /path/to/cwd/Mle-Panxα1.fa
New file: /path/to/cwd/Mle-Panxα5.fa
New file: /path/to/cwd/Mle-Panxα6.fa
New file: /path/to/cwd/Mle-Panxα9.fa

Usage example 8

Specify a pre-existing folder to change where the files are written to

$: sb C-terms.fa -gbp "~/foo/bar/"

Output

New file: /home/foo/bar/Unknown.fa
New file: /home/foo/bar/Mle.fa

Main Toolkit Pages





Further Reading

Clone this wiki locally