-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGES.txt
276 lines (231 loc) · 14.8 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
--- VERSION 0.99 ---
Multiple speed and memory optimizations, for more details, please refer to
javadoc of the following:
- in the class FastRandomTree, please see function buildTree(), and the new
functions splitDataNew() and distributionSequentialAtt()
- in the FastRfBagging, see function computeOOBError( DataCache data, ... ) and
its helper class VotesCollectorDataCache, and also the buildClassifier() where
the feature importances are computed
Nominal attributes with more than 2 levels are treated differently than before;
in 0.98 and previous, as well as Weka's Random Forest, for such attributes,
splits with many branches are created. In 0.99, all splits are binary, and the
best value of the nominal attribute (in a one-vs-rest fashion) is chosen for the
split. Thus, the accuracy of 0.99 in datasets with multilevel nominal attributes
may differ somewhat (in either direction) from Weka's standard implementation.
The feature to compute RF attribute importances is accessible via the Explorer
GUI and is off by default, as it incurs a significant computational cost.
Results of the feature importance calculation are now displayed in the GUI; they
are output by the FastRandomForest toString() function.
Small changes in the output format of the Benchmark class.
--- VERSION 0.98 ---
Change in handling empty nodes.
� Empty nodes frequently appear when splitting on nominal attributes with
more than two categories. What happens if an instance ends up in an empty
node during testing?
� In Weka RF, the class distribution of its parent node would be looked up
and offered as an answer.
� In FastRF prior to version 0.98, the tree would vote with all zeroes,
effectively excluding its vote from the forest when testing the given
instance.
� To increase compatibility, FastRF 0.98 onwards reverts to a behavior similar
to Weka RF. However, the class distributions of the non-leaf nodes are not
saved, and therefore we have to check whether a node will be empty *before*
performing the split and set its class probabilities in the parent's
buildTree() call.
Memory use optimizations in the DataCache.
� The creation of the whatGoesWhere[] for each bag is delayed; this array is
now created in the FastRandomTree.run().
� The creation of the sortedIndices[][] for each bag is delayed; it was created
in DataCache.resample(), but now it is created in the FastRandomTree.run().
Changes to the Benchmark class.
� Switched to a generalization of AUC score to multiclass problems as
described in Provost and Domingos (CeDER Working Paper #IS-00-04, Stern
School of Business, New York University, 2001), computed by taking a
**weighted** average over all one-vs-all binary classification problems
that can be derived from the multiclass problem, where weights
orrespond to class prior probabilities.
� System.gc() is invoked after each classifier.
--- VERSION 0.97 ---
Changes in normalization of class probabilities in leaves (m_ClassProbs)
� In Weka RF, the final m_ClassProbs was obtained by normalizing the sums of
per-class instance weights so that their total equals to 1.0; now, they
are normalized by dividing sum of instance weights in each class with the
total number of instances in node. This subtle change has two implications:
(1) As a consequence of bootstrap sampling, instances get varying weights
if they are sampled multiple times. Now we allow this to influence how
trees vote, which decorrelates individual trees and may have a mild
positive effect on predictive ability of the forest.
(2) More importantly, if user changes instance weights before training
(for example, to put more emphasis on a rare class), the current
normalization scheme will cause this change to shift the per-class
probabilities (prediction) supplied by the forest. Before this wasn't
the case.
� In parallel with the above, the manner in which the out-of-bag (OOB) error
estimate is computed has been changed. OOB estimation in Weka's Bagging is
one tree - one vote. In FastRF 0.97 onwards, some trees will have a heavier
weight in the overall vote depending on the averaged weights of instances that
ended in the specific leaf. Also, if impure nodes are present, trees may vote
for more than one class simultaneously.
--- VERSION 0.95 ---
Multithreading is used in classifier training, and in out-of-bag error
estimation.
� The classifiers in FastRfBagging are trained in separate "tasks" which are
handled by an ExecutorService (the FixedThreadPool) which runs the tasks in
more threads in parallel. If the number of threads is not specified,
it will be set automatically to the available number of cores.
� Estimating the out-of-bag (OOB) error is also multithreaded, using
the newly introduced VotesCollector class
� To support multithreading, the classes FastRandomTree and DataCache also had
to be reworked. FastRandomTree now holds a field with a reference to its
DataCache (set to null right after training), and the DataCache has its own
reusableRandomGenerator; before, both of these objects were passed down the
buildTree() recursion. Also, the FastRandomTree.run() method now replaces
the initial call of buildTree().
Two rarely occuring bugs were fixed in finding split points.
� Sometimes an attribute with all missing values could be judged as carrying
some information after split, and selected as the split criterion. while
the splitpoint would be set to -Double.MAX_VALUE. This would result in
unnecessary (but otherwise harmless) additional branches in the tree.
� Due to imprecision of the fastLog2 function, some splits which would normally
have reduction of entropy of 0 (ie. be useless) actually had a very small
positive value; this was fixed by checking reduction of entropy against a
small constant (0.01) instead of against 0.
Serialization is no longer used to make copies of the initial FastRandomTree.
� It's much faster to create the trees using the normal constructor and fill
their fields with necessary data. The whole FastRandomForest may normally be
serialized, as before.
--- VERSION 0.9 ---
The SplitCriteria class was introduced.
� EntropyConditionedOnRows() and entropyOverColumns() functions are made a
bit faster by avoiding normalization for total number of instances.
� The fastLog2() function was introduced as a rough approximation of log base 2
logarithms by bitwise operations on the internal representation of floating
point numbers. It is approx. 4 times faster than Math.log(), relieving
a performance bottleneck of FastRandomTree.
Changes in the FastRandomTree class:
� The value of splitting criterion before split (entropyOverColumns) is not
re-calculated for each attribute (because it doesn't change between
attributes) in the distribution(), but only once in buildTree()
� Also, when searching for best split points, the entropyOverColumns is not
subtracted from entropyConditionedOnRows() in every possible split point,
only in the best one in the buildTree().
� In the distribution() function:
- before it was possible to create a split 'in the middle' of the first
instance 0, which would result in empty nodes after the split and cause
errors after the (otherwise unnecessary) m_classProbs was removed in 0.71.
This 'empty node problem' is now fixed.
- dist[][] is now computed only after the split point has been found, and not
updated continually by copying from currDist.
- instance 0 is now skipped when looking for split points, as the split
point 'before instance 0' is not sensible and entropy does not need to be
computed.
--- VERSION 0.8 ---
The FastRfInstances class was removed.
The DataCache class was introduced:
� The DataCache stores a dataset that FastRandomTrees use for training. The
data points are stored in a single-precision float array indexed by attribute
first, and by instance second, allowing faster access by FastRandomTrees.
� The DataCache can seed a random number generator from the stored data.
� The DataCache creates bootstrap samples of data. This functionality was
removed from FastRfBagging class.
-> There is a subtle change in the way sampling is performed; before, if an
instance had higher weight, it was more likely to get sampled multiple
times, and its weight in the bagged dataset would increase in multiples of
1 (i.e. 1.0, 2.0, 3.0...) Now, the sampling frequency is independant of
instance original weight, and its weight in the sampled data increases in
multiples of the original weight, for example 1.5, 3.0, 4.5... This ensures
the bagging procedure conserves class proportions in the bagged data.
� The DataCache also computes and stores the sorted order of the instances by
any attribute. This functionality was removed from FastRfBagging class. Due
to the way that the DataCache is organized, the sortedInstances arrays don't
have to be readjusted in making bootstrap samples.
The FastRfUtils class was introduced:
� Contains functions for sorting single-precision float arrays used in
DataCache, and for normalization of double arrays.
The FastRfTree class was renamed to FastRandomTree.
--- VERSION 0.72 ---
The FastRfTree class:
� m_Distribution field, and all references to it are removed; it was used
only localy in buildTree() as a reference to another array. Also, in leaves
m_Distribution held (quite useless) copies of the m_ClassProbs.
� in buildTree(), local array classProbs was unnecesarily deep copied to
m_classProbs in every split; now it's only copied by reference, and only
when a leaf is made. It's also normalized only in leaves (before, it was
normalized in every node). To do:
-> last change sometimes causes an error, as the split function can produce
an empty leaf (which it shouldn't!) and then there is no m_classProbs of
the parent leaf to look up - it wasn't saved!!
--- VERSION 0.71 ---
The FastRfTree class:
� During the normal training of the forest, a lot of instances of FastRfTree
are being created - one per split per tree. Since the trees in a Random
Forest normally contain n-1 splits each (for n instances), it could pay off
to make 'lean' versions of the FastRfTree that has less fields, and therefore
a smaller memory footprint.
- m_MotherForest (FastRandomForest) was introduced
- m_KValue (int) was removed; a call to getKValue looks the value up by
asking the m_MotherForest its KValue; setKValue was removed
- ditto for m_MaxDepth and getMaxDepth
- ditto for m_Info (no getter/setter)
- m_MinNum is now a static final field always = 1; it was never set to
another value before anyway
- m_RandomSeed and its getter and setter have been removed; instead,
the FastRfBagging passes a random seed it has generated as a parameter
of the BuildClassifier function - later, a Random object is generated from
the seed and the seed is not used anymore. As a consequence of this,
FastRfTree can't implement Randomizable anymore.
- Some methods which are likely to go unused have been omitted: toGraph(),
toString(), and leafString() - this doesn't help with size of trees in
memory, I've done it just to clean up clutter.
- The check whether the default classifier (ZeroR) should be built was moved
to FastRandomForest's buildClassifier(); and check if it should be used
in testing to FastRandomForest's distributionForInstance(). So there is
no more need for m_ZeroR field in FastRfTrees.
The Benchmark class was introduced, useful for quick comparisons (speed-wise
and accuracy-wise) to the weka.trees.RandomForest.
--- VERSION 0.7 (first public release) ---
The following changes were made over Weka's RandomForest implementation (in Weka
developer branch nightly snapshot of 31-Oct-2007).
The FastRfBagging class (replaces weka.classifiers.meta.Bagging):
� The new method resampleFastRf has functionality similar to
resampleWithWeights, with the following difference: When an instance is to be
sampled multiple times, resampleWithWeights would add multiple copies of the
instance to the newData; resampleFastRf increases the weight (by 1) for the
copy already present. This reduces the number of instances by ~30%, speeding
things up. The trees' splitting criteria should work equally on this kind of
dataset because they explicitely take weights into account, however the
forests produced are a bit different when this sampling procedure is used.
I'm not sure what exactly causes the difference, however the crossvalidation
performance of the forest did not seem to deteriorate on my tests.
� The FastRfBagging now sorts all instances on every attribute and stores this
data into each bag's sortedIndices field, adjusting for instances absent from
the bag; this means the sort is now performed once per forest, and not once
per RandomTree (as in ordinary RandomForest).
The FastRfInstances class (inherits from weka.core.Instances):
� The resampleFastRf now generates 'bags' of type FastRfInstances, which
differs from ordinary Instances only in having 2 additional fields:
int[][] sortedIndices and int[] whatGoesWhere.
The FastRandomForest class (replaces weka.classifiers.trees.RandomForest):
� This class differs from ordinary RandomForest basically only in using
FastRfBagger with FastRfTree, instead of Bagger with RandomTree.
The FastRfTree class (replaces weka.classifiers.trees.RandomTree):
� The FastRfTree has a different way of handling instances with missing values
in attribute chosen for the split. The RandomTree used to 'split up' such an
instance in two instances, with weights that are proportional to size of each
branch at splitting point. Instead, FastRfTree assignes the instance to one
of the branches at random, with bigger branches having a higher probability
of getting the instance. This allows us to skip creating new arrays of
instance weights and passing them down the tree, as the weights do not change
anymore. My reasoning was that such an approximation was acceptable when a
forest with lots (>100) of trees was created, although it may not be on a
single tree.
� The splitData procedure splits int[][] sortedIndices into two (or more for
categorical split attributes) int[][][] subsetIndices. It was quite slow
before because it checked whether an instance's split attribute value was
above or below splitting threshold for each attribute in sortedIndices. Now,
this checking is done only once, and results are stored into the dataset's
whatGoesWhere array (dataset must be of type FastRfInstances).
� As a consequence of the above, the exact branch sizes (even with instances
having unknowns in the split attribute) are known in advance so subsetIndices
arrays don't have to be 'resized' (i.e. a new shorter copy of each one
created and the old one GCed � quite a lot of data being shuffled around).