-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
223 lines (172 loc) · 11.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
trtok - a fast and trainable tokenizer for natural languages
------------------------------------------------------------
Trtok is a very universal performance-oriented tokenizer for processing
natural languages. It reads text and tries to correctly detect sentence
boundaries and divide the text into tokens.
Trtok does not implement any specific heuristic to perform these tasks,
instead it lets the user define rules for potential joining and splitting of
words into tokens and sentences. The final decision whether to split or join
words and whether to break sentences is left to a conditional probabilistic
model which is trained from user-supplied annotated data. The way the trainer
understands the data can be extensively customized by the user who can define
his own features and specify which features are significant for what tokens.
1) Tokenization schemes
-----------------------
The user might want to use trtok for processing more than 1 language or for
processing 1 language in many ways. These different ways of tokenization are
described by "tokenization schemes". Their definitions reside in the
"schemes" subdirectory of the installation directory. Every folder inside
"schemes" defines a single tokenization scheme by way of various
configuration files.
Tokenization schemes may be nested to represent a sort of scheme inheritance
where a scheme inherits all the configuration files of its ancestors unless
it redefines them by having a configuration file of the same name.
a) Rough tokenization rules
The tokenizer identifies all potential token and sentence boundaries within
the text and uses them and the whitespace to split the text into short
segments called rough tokens. The ambiguous boundaries are placed according
to the tokenization scheme. Files with the .split extension define positions
where a word may be broken into two tokens (called a MAY_SPLIT). Files with
the .join extension define positions where two words may be joined into a
single token (MAY_JOIN). Finally, files with the .break extension define
positions at which there might be a sentence break (MAY_BREAK_SENTENCE).
All of the above-named files must contain lines of pairs of
whitespace-delimited regular expressions. If the text leading to a position
and the text following it match the two paired regular expressions
respectively, the ambiguous boundary (MAY_SPLIT for .split files, MAY_JOIN
for .join files or MAY_BREAK_SENTENCE for .break files) is placed at that
position.
The grammar of the regular expressions in these files is the one used by
Quex and described in detail at
http://quex.sourceforge.net/doc/html/usage/patterns/context-free.html. Particularly
take care since Quex does not handle Unicode characters directly in its
regular expression syntax, so be sure to use the \UXXXXXX escape notation
if you need to make use of them.
The files may contain comments which are lines that begin with the # symbol.
b) User-defined properties
Files with a .rep extension contain a single regular expression from the
family of expressions allowed in PCRE (see pcre.org). A rough token is
marked as having this property if it can be matched to the regular
expression.
Files with a .listp extension define properties using lists of token types.
If a rough token's text is exactly the same as a line from a .listp file,
then that rough token is marked as having the property defined by that
.listp file.
c) Feature selection
Every tokenization scheme must have a file named "features". For each rough
token in the vicinity of the potential split/join/sentence break, it
specifies which features are important for the decision.
A typical line starts by declaring a set of interesting offsets (0 is the
rough token preceding the decision point, -1 the one before it, +1 the one
after it, etc...). These offsets are separated by commas and intervals can
be used for convenience (e.g. -4,-2..+2,5 selects -4,-2,-1,0,1,2,5).
After the offsets comes a colon and a comma separated list of properties.
The property names are the filenames of their definitions without the
extensions and they are limited to the common identifier character set
[a-zA-Z0-9_]. The line is closed with a terminating semicolon.
Apart from these simple features, it is possible to ask for combined
features which bundle the value of different properties of tokens at
different offsets into a single feature value. These are defined on their
own line and are enclosed in parentheses. Inside the parentheses is a "^"
separated list of offset:property pairs. If a combined feature takes
properties from a single token only, the parenthesized expression can
appear on the right-hand side of a typical line instead of a simple
property name and the offsets within its definition are omitted.
Apart from the user-defined properties from the .rep and .listp files, the
tokenizer defines the non-binary property "%length" whose value is the
length of the rough tokenizer and the meta-property "%Word" which generates
a property for each rough token type.
Example:
-2..+2: %Word;
-5..5: uppercase, abbreviation, (starts_with_number ^ ends_with_period);
(0:fullstop ^ 1:initial)
d) Maxent training parameters
More control over the process of training the probabilistic model can be
had by manipulating the "maxent.params" file. This file is an INI-style
configuration file which lets the user set the following parameters, which
get passed directly to the training toolkit.
event_cutoff=<int> All training events which occur less
times than event_cutoff are ignored. Default 1.
n_iterations=<int> How many iterations at most will the
iterative method use. Default 15.
method_name=lbfgs|gis Which of the two methods L-BFGS or GIS
is to be used. L-BFGS is recommended. Default lbfgs.
smoothing_coefficient=<double> Sigma, the coefficient in Gaussian
smoothing. Default 0 (no smoothing).
convergence_tolerance=<double> The model is regarded as convergent
when the relative difference between the log-likelihood of the
succeeding models is < convergence_tolerance. Default 1e-05.
save_as_binary=false|true Whether to save the file in a binary
format which is faster to load and smaller if Maxent was compiled
with zlib support. Default false.
e) File lists and filename replacement regular expressions
Files [prepare|train|heldout|tokenize|evaluate].[fl|fnre] are for
convenience only and are described later.
2) Running the tokenizer
------------------------
a) Different ways of selecting input
The first argument passed to the tokenizer selects its mode, which can be
either "prepare", "train", "tokenize" or "evaluate". The second argument is
a path relative to the directory "schemes" which selects the tokenization
scheme to be used. The rest of the arguments are input files and options.
Input files can be specified explicitly on the command line. More files can
be given using the -l (--file-list) option which takes a path to a file and
adds every line of it as another input file.
When running in prepare mode or tokenize mode, an output file for each file
has to be specified and when running in train mode or evaluate mode, a file
with the annotated version has to be specified. These secondary files are
selected by taking the input file's path and transforming it using a regular
expression/replacement string. The filename regular expression/replacement
string is specified using the -r (--filename-regexp) option. The strings
look like replacement commands in sed, where the first character can be any
ASCII character and that character separates the regular expression from
the replacement string and also terminates the entire string. Unlike sed,
this special character cannot be present anywhere else in the string (no
escaping). The breed of regular expressions used here is the one supported
by PCRE, the replacement strings contain the placeholders \0, \1... for the
entire matched string, first captured sequence...
Example:
trtok train en/simple/brown -l data/brown/train.fl -r "|raw|txt|"
In the annotated/tokenized files, sentences are split by newlines and
tokens are split by spaces.
If no input file or file lists were given, a default file list named
<mode_name>.fl, which is part of the tokenization scheme, is used. If no
filename regular expression/replacement string is given, the one in the
file named <mode_name>.fnre from the tokenization scheme is used. In both
cases <mode_name> is expanded to either "prepare", "train", "tokenize" or
"evaluate" depending on the current mode.
If no input file or file lists were given and there are no default file
lists defined by the tokenization scheme, then the tokenizer processes the
standard input and writes to the standard output. This is, however, only
possible for the "prepare" and "tokenize" modes. The standard input/output
combo can also be explicitly selected by specifying the input file "-" on
the command line.
b) Different modes of execution
In "prepare" mode, the tokenizer reads the input, splits it into rough
tokens and then outputs it with all possible splits and sentence breaks
performed. This format might be handy for manual annotators who then only
have to join together parts of tokens and sentences.
In "train" mode, the tokenizer reads both the input and its annotated
version. It uses the annotated data to get pairs of questions (values of
features in a given context surrounding a decision point) and answers
(whether the decision point is to become a joining of tokens, a splitting
of tokens or a sentence break). These pairs are then used to train the
probabilistic model and store it in a file under the "build" directory.
In "tokenize" mode, the tokenizer relies on the presence of an already
trained model and uses it to classify every decision point in the input
file and output the tokenized and segmented text.
In "evaluate" mode, the tokenizer reads both the input and its annotation
as in "train" mode, but now it also queries the trained model for an
opinion and compares it with the one found in the annotated data. The
tokenizer outputs a log of every context and both the predicted and correct
outcomes for later analysis. The "analyze" script provided with trtok will
let you read this output and determine the accuracy of your system.
c) Different options
If you launch trtok with no command line arguments, you will get a summary
of all the supported command line options and their meaning. These include
options for setting the encoding of the input and output files, options for
controlling the output (preserving the original tokenization, segmentation
or paragraph division), the preprocessing of input (if entities are to be
expanded for the duration of the tokenization and if they are to be kept
expanded in the output; if XML should be hidden from tokenization), options
for logging the contexts and outcomes to a third file and others.