-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathOPTIMIZING
343 lines (245 loc) · 13.1 KB
/
OPTIMIZING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
*** Flowage Optimization
The flowage.pl perl script processes netflow data one record at-a-time
and writes results to rrd database files. The key variables that affect
performance include:
* The number of datapoints (roughly the number of categories multiplied
by the number of interfaces)
* The number of exporters sending data
* The speed of the CPU and the number of cores
* Disk I/O performance and the optional use of the rrdcached daemon
Performance health is logged for every flow file processed, and can be
viewed in the web health check screen or from the log itself. E.g.,
tail -1000 /var/log/webview/flows/data/flowage.log | grep processed
Two real-world examples from 2012:
A bare-metal server with two Intel E5540 (4-core) CPUs and a very complex
configuration keeps up handily with over 70,000 flows/second from over
600 exporters.
A virtualized server with four vCPU's is able to keep up with over 25,000
flows/second and a very demanding configuration with over 200 traffic
categories from over 500 exporters.
Apart from normal flow export volume, networks occasionally see "flow
storms" -- periods of unusual flow activity due to port scanning, worm
propagations, etc. These cause the flow files to balloon to many times
their normal size and may cause flowage to temporarily fall behind. The
monFlows.pl script can detect and alert on such storms.
However, if Flowage is falling behind your normal flow volume or if the
system or web interface is chronically unresponsive, it's time to spend
some time optimizing flowage.
There are several approaches to take when flowage is not keeping up with
incoming flows:
1. Evaluate the server hardware.
(a) Get a faster CPU. For a small number of exporters, a higher
clock rate does much more good than the number of cores.
(b) Enable symmetric multiprocessing (SMP), either flowage's native
"fork <processes>" configuration which works well when there
are many exporters, or with a manual SMP configuration (see
#5 below).
(c) Check disk I/O. If the server becomes unresponsive for many
seconds at a time from command-line or web GUI, then the
server may have disk I/O issues.
In a large environment, flowage may need to update thousands
or tens of thousands of RRD files each time it runs. Having
so many files opened, written, and closed in rapid succession
is a problem for many OS's and disk subsytems. The number
of files in a given directory can also be a problem (be sure
you use the "directory hierarchical" flowage.cfg command).
The best fix is to install the rrdcached daemon (see
http://oss.oetiker.ch/rrdtool/doc/rrdcached.en.html), which
may need to be installed separately from rrdtool. rrdcached
allows the system to buffer and batch RRD updates to smooth the
I/O spikes. See it's documentation for details on its tuning
knobs. To enable webview to use rrdcached, invoke flowage.pl
with the "--daemon" option and set the RRDCACHED_ADDRESS
environment variable for CGI scripts run by Apache.
A less-prefered fix is to use flowage.pl's packed-rrd datafile
format which reduces the number of rrd files at the expense
of flexibility.
Regardless of the approach, the use of 10k or 15k RPM SCSI
drives with RAID0, RAID6, or RAID10 can greatly improve
performance and responsiveness.
2. Reduce the number of flows.
(a) The vast majority of flows are small and of little value to
capacity planning or understanding application volumes. For
example each broadcast packet may be reported as a 1-packet
flow that flowage must process.
If you are interested only in bandwidth utilization of "real"
traffic, one trick that works well is to filter out the
small flows. This can be done in flowage.cfg :
datafile foo Interfaces Categories
Quickly
ip access-list extended Quickly
permit ip any any packets gt 1
The downside is, of course, that filtered traffic won't be
reported on. It may still be seen in the ad hoc query tool.
3. Increase capture efficiency
If flows are being lost by the collector, use a ramdisk (/dev/shm)
as a capture directory in flow-capture, and use the flow-shuffle
script from the optional-accessories subdirectory to move the
captured files to their real location. Using multiple collectors
on unique port numbers can also help.
4. Optimize flowage.cfg
(a) Group ordering
The items listed in a group are processed in the order you
specify them, so try to place the most common categories near
the top. Use the ad hoc query tool to run a report sorted by
'flows' to figure out the most burdensome categories.
(b) ACL rewriting
Often the ACLs themselves can be reworked for efficiency. For
example, if a server is responsible only for mail and nothing
else, then this:
ip access-list extended Email
permit tcp any host <mail server> eq 25 reverse
permit tcp any host <mail server> eq 110 reverse
permit tcp any host <mail server> eq 143 reverse
can probably be replaced with this:
ip access-list extended Email
permit ip any host <mail server> reverse
Also, be sure to use host-lists when dealing with many
equivalent IPs. For example, this:
ip access-list extended Email
permit ip any host <mail server 1> reverse
permit ip any host <mail server 2> reverse
permit ip any host <mail server 3> reverse
permit ip any host <mail server 4> reverse
could be replaced by this:
ip access-list extended Email
permit ip any host @Mail_Servers reverse
ip host-list Mail_Severs
<mail server 1>
<mail server 2>
<mail server 3>
<mail server 4>
(c) Use access-list maps. If you have many categories that are
made of simple host-lists, these can be combined into a very
efficient "map" structure. For example, given this:
ip host-list Mail_Severs
<mail server 1>
<mail server 2>
ip host-list FTP_Severs
<ftp server 1>
<ftp server 2>
instead of this:
ip access-list extended Email
permit ip any host @Mail_Servers reverse
ip access-list extended FTP
permit ip any host @FTP_Servers reverse
group Categories
Email
FTP
you can do this:
ip access-list map CategoryMap
Email @Mail_Servers
FTP @FTP_Servers
group Categories
CategoryMap
Each group can combine ACL maps and traditional ACL's to
achieve a good balance between efficiency and preservation of
order. For example:
group Categories
CategoryMap1
SimpleACL1
CategoryMap2
SimpleACL2
The above group would first check the src/dst IP's against
CategoryMap1. If no match is found, SimpleACL1 would be checked,
then CategoryMap2, then SimpleACL2.
(d) Per-datafile ACLs
If you have multiple datafile definitions, and some of those
definitions only apply to a subset of the flow data, add a
master ACL to the datafile that reduces the flows they must
process.
For example, say you are collecting from a bunch of internal
WAN routers and a single internet router. You would likely want
Internet-appropriate categories (web surfing, web hosting, smtp,
etc) for just the one internet router, while the enterprise
categories (file_sharing, SQL, etc) would apply to the others.
Here is how that can be done efficiently:
if-matrix Interfaces auto
datafile rrd Internet
Interfaces # interface-matrix
Internet_Categories # my group of internet categories
Internet_Router_ACL # my ACL to restrict this datafile
datafile rrd Enterprise
Interfaces # interface-matrix
Enterprise_Categories # my group of categories
Enterprise_Router_ACL # my ACL to restrict this datafile
ip access-list extended Internet_Router_ACL
permit ip any any exporter <internet-router>
ip access-list extended Enterprise_Router_ACL
deny ip any any exporter <internet-router>
permit ip any any
5. Harness multiple CPUs by running multiple instances of flowage
flowage.pl's internal SMP capabilities work best with a large number
of flow sources. An alternative is to manually configure multiple
instances of flowage to divide the workload across multiple CPU's.
(a) You can place each datafile definition in a separate config
file and have concurrent instances of flowage.pl running off
the same capture data. Each flowage instance needs its own
tmp directory and log file, but they can share capture and
data directories.
To have these instances share the same GUI, one more config file
with all datafile definitions is needed so that a proper web
index can be created. This last config file is never invoked
for actual processing, but instead is run with the "-check"
flag which simply creates the index file and exits.
Here are four files that demonstrate the scheme applied to
the Internet/enterprise example from the previous section:
-- flowage.cfg --
# never invoked for processing -- only used to create the web index
directory temp /var/log/webview/flows/tmp
include flowage-1.cfg
include flowage-2.cfg
-- flowage-1.cfg --
# processes Internet_Categories
include flowage-global.cfg
directory temp /var/log/webview/flows/tmp-1
logging file flowage-1.log
datafile rrd Internet
Interfaces # interface-matrix
Internet_Categories # my group of internet categories
Internet_Router_ACL # my ACL to restrict this datafile
-- flowage-2.cfg --
# processes Enterprise_Categories
include flowage-global.cfg
directory temp /var/log/webview/flows/tmp-2
logging file flowage-2.log
datafile rrd Enterprise
Interfaces # interface-matrix
Enterprise_Categories # my group of categories
Enterprise_Router_ACL # my ACL to restrict this datafile
-- flowage-global.cfg --
# all global parameters (exporters, if-matrix, access-lists, groups, host-lists, etc)
Cron would be set up like this:
*/5 * * * * /usr/local/webview/flowage/flowage.pl -f flowage-1.cfg
*/5 * * * * /usr/local/webview/flowage/flowage.pl -f flowage-2.cfg
0 * * * * /usr/local/webview/flowage/flowage.pl -check
(b) Use multiple instances of flowd running on different
port numbers to save flow data into separate capture
directories. Then you can run multiple flowage instances.
This is common when dealing with very different exporter types
-- e.g., server farm switches and WAN routers.
-- Contents of /usr/local/etc/flowd-2055.conf --
logfile "/dev/shm/2055/flowd"
pidfile "/var/run/flowd-2055.pid"
listen on 0.0.0.0:2055
store ALL
-- Contents of /usr/local/etc/flowd-2056.conf --
logfile "/dev/shm/2056/flowd"
pidfile "/var/run/flowd-2056.pid"
listen on 0.0.0.0:2056
store ALL
-- /etc/init.d scripts --
ln -s /etc/init.d/flowd-2055 /etc/init.d/flowd
ln -s /etc/init.d/flowd-2056 /etc/init.d/flowd
update-rc.d flowd-2055 defaults
update-rc.d flowd-2056 defaults
service flowd-2055 start
service flowd-2056 start
-- Cron jobs --
*/5 * * * * /usr/local/webview/utils/flowd2ft 2055 >> /var/log/flowd2ft-2055.log 2>&1
*/5 * * * * /usr/local/webview/utils/flowd2ft 2056 >> /var/log/flowd2ft-2056.log 2>&1
The above causes flow files to be written into a subdirectory of the
capture directory based on the port number of the collector. E.g.,
/opt/netflow/capture/2055 and /opt/netflow/capture/2056
Then, built a separate flowage.cfg config files for each
capture directory, similar to what was shown in (5a) above.