-
Notifications
You must be signed in to change notification settings - Fork 2
Metrics
The throughput and trap charts below present benchmarking metrics with (ribose@9a1a112) after some recent (Fall 2023) performance tuning. This work was done on Windows 10 with Java 17 on a Lenovo ThinkBook with 12 CPUs, 384kb L1 cache, 16gb of RAM.
Benchmarking focuses on two data sets. One is a Linux kernel log file with mixed entries, including iptables
entries which are the features of interest. The other is an IBM Java GC log with mixed XML stanzas including generational collection start stanzas. The former is dense with features of interest, and each feature is dense with data (38% of input is extracted). The latter is more sparsely populated with features of interest, and each feature contains only one field of interest (<1% of input is extracted). Both input files are ~20MB in size and were loaded in RAM for the benchmarking runs. For the regex runs, the inputs were decoded to a 40MB char
array as required. No decoding was performed for the ribose runs, which operated directly on the 20 MB byte
arrays. The time to load and decode the input arrays is excluded from the throughput measurement and both byte
and char
array were persistent in the heap over the course of the benchmarking runs.
For the Linux kernel log there are 4 ribose transducers and one Java regex, each producing identical output. The sum and product thresholds (see below) were set to 128 and 10, respectively. For each data set an equivalent Java regex expression was constructed to extract identical data from the input files. Mean throughput the for regex equivalents was 85 mb/s for the LinuxKernel and 1385 mb/s for tye GC log transductions.
- LinuxKernel: single-pass transduction extracting several fields from each iptables entry.
- LinuxKernelLoose: partial lookahead to identify iptables entry followed by single-pass transduction.
- LinuxKernelStrict: full lookahead to identify iptables entry followed by single-pass transduction.
- LinuxKernelNil: single-pass transduction locating fields of interest without extraction.
- Regex.LinuxKernel: Java regex equivalent to LinuxKernel*.
For the GC log there are 2 ribose transducers and one Java regex, each producing identical output.
- Sintervals: coarse-grained identification of GC cycle start stanzas and extraction of one attribute value.
- Tintervals: fine-grained identification of GC cycle start stanzas and extraction of one attribute value.
- Regex.VerboseGC: Java regex equivalent of *intervals.
Throughput benchmarking involves running each transducer 20 times in series with input data in RAM and output muted. The first 10 runs are warmup runs, the last 10 run times are used to calculate the average throughput, proportions of input caught in sum, product or scan traps, and number of nul
events per kilobyte of input. A nul
signal is injected when no transition is defined for current state and input token, giving the current transducer an opportunity to enter a synchronizing pattern that consumes input with nil effect up to the end of the innermost enclosing loop. The ribose transducers involved here are conditioned to accept nul
anywhere and synchronize at the head of the next potential feature, so any input from byte*
is acceptable but only recognizable features will produce any effect.
Input bytes and signals are filtered through the transductor in a tight loop that ranges over the cells of the active transducer's transition matrix in RAM. In the process they may stall frequently in the von Neumann bottleneck as they pull substantial portions of the matrix into the L1 data cache (192kb or 3072x64 cache line blocks for these runs). The ribose model compiler recognizes 3 kinds of input traps, which bypass the transition matrix and operate in tight loops with small working sets for certain subsequences of input.
- sum trap: emulates a state with many idempotent transitions ([input, state] -> [state, nil]) until an input byte forces state change or non-nil effect.
- product trap: emulates an acyclic series of >3 states each with only one nil exit transition (a non-branching input vector with nil effect).
- scan trap: specializes the sum trap in cases where there is only one input that escapes the state.
The sum, product and scan traps are controlled by private effectors injected into the transducer matrix and effect vectors by the model compiler. The sum and scan traps are particularly useful in error recovery contexts. The relatively high throughput seen with the *intervals transducers is obtained in scan traps searching for the next opening '<' and, in the case of Rintervals which presents many product traps when the product threshold set to 3. The product trap eliminates intermediate states and reduces the granularity of the input equivalence relation and so compresses the transition matrix in both dimensions (reduced number of states and input equivalence classes).
The automata that ginr produces from ribose transducer patterns are state-minimized deterministic finite automata with 3 tapes (input, effector, parameter) and >256 input symbols (byte|signal
). Each transition is associated with an effect, which may be a scalar simple or parameterized effector or a nul
-terminated vector of effectors and associated parameters. Effectors and parameters are represented in ginr automata as non-branching chains of state transitions that the ribose model compiler extracts as a single action referring to an effect vector. This is the first stage of matrix compression. Some groups of byte and signal inputs may behave identically in all states; for some transducers a digit is just a digit, everywhere. So the transition matrix induces an equivalence relation on byte|signal
and equivalence class ordinals can stand in for inputs in the transducer matrix. This is the second stage of compression.
The ribose model compiler enumerates all maximal non-branching chains of input state transitions in the transducer matrix, similarly as for effect vectors. These chains are diverted to product traps, which eliminate the intermediate states and erase the unique character of the involved bytes, leading to a more coarse-grained input equivalence relation. This final stage of compression reduces the transducer matrix in both dimensions and improves transition density (transitions / (states * input equivalents)), thereby minimizing the representation of the transducer matrix in RAM and reducing the likelihood of L1 cache misses while running transductions. The net result is lossless decomposition of the transducer matrix to an integer array mapping byte|signal → input equivalent
, a compressed state transition matrix mapping input equivalent → effect
and an integer array mapping effect → effector vector
(an indexed concatenation of nul
-terminated effect vectors).
For all of the instrumented benchmarks the sum trap threshold was set to a high value (128) to avoid setting up short-running sum traps. Short runs in any trap tend to degrade performance as illustrated by the reduced throughput for LinuxKernel when the product threshold in set to 3 (1 product vector, 9 bytes) vs 10 (0 profduct vector). In the former case frequent switching between looping through the transition matrix and looping in the product trap disrupts the instruction pipeline to an extent that outweighs any throughput improvement. The Rintervals benchmark involves 23 product vector with the product threshold set to 3 and throughput degrades to an even greater extent, even though the matrix size is substantially reduced.
Garbage collection metrics are logged for output equivalence tests. Output is written through a buffered output stream to System.out and tested for binary equivalence. Java regex requires that the entire input corpus be presented as a single unit so there is an up front 60MB heap charge to read the raw UTF8 input into a 20MB byte
array and decode it into a 40MB char
array to serve to the regex API. The ribose output equivalence transductions incur the 60MB charge identically as for regex but do not use these arrays; ribose input is presented in 64kb segments as read from the raw input file and transduced as a UTF8 byte stream without decoding to UNICODE. In all cases this 60MB charge persists for the duration of the benchmarking run, so the retained heap volumes reflect turnover of transient objects instantiated during benchmark runs.
Heap usage is a strong differentiator for the ribose runtime. Data are extracted to a finite set of byte accumulators. These are analogous to registers in a von Neumann CPU - they are persistent while their values are transient. Extracted data must be output or assimilated into the target model immediately as it becomes available. As a consequence, ribose transductions typically operate with very low heap resources although ribose byte accumulators are allowed to grow without limit when required. For the kernel log runs the ribose heap grows from an initial heap size of 2MB to a maximal extent of 64MB while the regex heap grows to 76MB. In both cases 7.4MB of identical data are extracted to the output stream. The verbose GC log runs (regex and ribose) extract only 61kb to output and run in a 62MB heap.
Copyright 2022,2023 Kim Briggs (except 'Ginr', copyright J Howard Johnson)