-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathmode_build.txt
145 lines (108 loc) · 6.06 KB
/
mode_build.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
SYNOPSIS
metacache build <database> <sequence file/directory>... [OPTION]...
metacache build <database> [OPTION]... <sequence file/directory>...
DESCRIPTION
Create a new database of reference sequences (usually genomic sequences).
REQUIRED PARAMETERS
<database> database file name;
A MetaCache database contains taxonomic information and
min-hash signatures of reference sequences (complete
genomes, scaffolds, contigs, ...).
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences
(complete genomes, scaffolds, contigs, ...) that shall
beused as representatives of an organism/taxon.
If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
BASIC OPTIONS
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
taxonomic data files)
-taxpostmap <file>
Files with sequence to taxon id mappings that are used as
alternative source in a post processing step.
default: 'nucl_(gb|wgs|est|gss).accession2taxid'
-sequence-id-format (smart|ncbi|gi|filename|leadingword)
Method used for extracting sequence IDs from filenames and
sequence headers.Sequence IDs are also used to assign taxa
to reference sequences.
Available types are:
smart : try NCBI > genbank > filename
ncbi : NCBI-style accession/accession.version
gi : genbank identifier
filename : filename without extension
leadingword : first stretch of non-whitespace characters
default: smart
-silent|-verbose information level during build:
silent => none / verbose => most detailed
default: neither => only errors/important info
SKETCHING (SUBSAMPLING)
-kmerlen <k> number of nucleotides/characters in a k-mer
default: 16
-sketchlen <s> number of features (k-mer hashes) per sampling window
default: 16
-winlen <w> number of letters in each sampling window
default: 127
-winstride <l> distance between window starting positions
default: 112 (w-k+1)
ADVANCED OPTIONS
-reset-taxa Attempts to re-rank all sequences after the main build
phase using '.accession2taxid' files. This will reset the
taxon id of a reference sequence even if a taxon id could
be obtained from other sources during the build phase.
default: off
-max-locations-per-feature <#>
maximum number of reference sequence locations to be
stored per feature;
If the value is too high it will significantly impact
querying speed. Note that an upper hard limit is always
imposed by the data type used for the hash table bucket
size (set with compilation macro
'-DMC_LOCATION_LIST_SIZE_TYPE').
default: 254
-remove-overpopulated-features
Removes all features that have reached the maximum allowed
amount of locations per feature. This can improve querying
speed and can be used to remove non-discriminative
features.
default: off
Not available in the GPU version.
-remove-ambig-features <rank>
Removes all features that have more distinct reference
sequence on the given taxonomic rank than set by
'-max-ambig-per-feature'. This can decrease the database
size significantly at the expense of sensitivity. Note
that the lower the given taxonomic rank is, the more
pronounced the effect will be.
Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain
default: off
Not available in the GPU version.
-max-ambig-per-feature <#>
Maximum number of allowed different reference sequence
taxa per feature if option '-remove-ambig-features' is
used.
Not available in the GPU version.
-max-load-fac <factor>
maximum hash table load factor;
This can be used to trade off larger memory consumption
for speed and vice versa. A lower load factor will improve
speed, a larger one will improve memory efficiency.
default: 0.800000
Not available in the GPU version.
-parts <#> Splits the database into multiple parts. Each part
contains a separate hash table.
default: 1
EXAMPLES
Build database 'mydb' from sequence file 'genomes.fna':
metacache build mydb genomes.fna
Build database with latest complete genomes from the NCBI RefSeq
download-ncbi-genomes refseq/bacteria myfolder
download-ncbi-genomes refseq/viruses myfolder
download-ncbi-taxonomy myfolder
metacache build myRefseq myfolder -taxonomy myfolder
Build database 'mydb' from two sequence files:
metacache build mydb mrsa.fna ecoli.fna
Build database 'myBacteria' from folder containing sequence files:
metacache build myBacteria all_bacteria