Metrics

Winding down

The data presented below represent the end of this winter's cold code camp. These are from a build for commit 6fa9a23 (Refactor (simplify) Transductor.run()).

Transduction throughput

Benchmarking focusses on two data sets. One is a Linux kernel log file with mixed entries, including iptables entries which are the features of interest. The other is an IBM Java GC log with mixed XML stanzas including generational collection start stanzas. The former is dense with features of interest, and each feature is dense with data (38% of input is extracted). The latter is more sparsely populated with features of interest, and each feature contains only one field of interest (<1% of input is extracted). Both input files are ~20MB in size.

For the Linux kernel log there are 4 ribose transducers and one Java regex, each producing identical output.

LinuxKernel: single-pass transduction extracting several fields from each iptables entry.
LinuxKernelLoose: partial lookahead to identify iptables entry followed by single-pass transduction.
LinuxKernelStrict: full lookahead to identify iptables entry followed by single-pass transduction.
LinuxKernelNil: single-pass transduction locating fields of interest without extraction.
Regex.LinuxKernel: Java regex equivalent to LinuxKernel*.

For the GC log there are 2 ribose transducers and one Java regex, each producing identical output.

Sintervals: coarse-grained identification of GC cycle start stanzas and extraction of one attribute value.
Tintervals: fine-grained identification of GC cycle start stanzas and extraction of one attribute value.
Regex.VerboseGC: Java regex equivalent of *intervals.

Throughput benchmarking involves running each transducer 20 times in series with input data in RAM and output muted. The first 10 runs are warmup runs, the last 10 run times are used to calculate the average throughput, number of nul events per kilobyte of input, and proportions of input caught in sum, product or scan traps. Output is muted for throughput benchmarking and enabled for output equivalence tests described below.

Input traps

There are 3 input traps, which bypass the transition matrix for certain subsequences of input.

sum trap: emulates a state with many idempotent transitions ([input, state] -> [state, nil]) until an input byte forces state change or non-nil effect.
product trap: emulates a series of states each with only one exit transition (a non-branching input vector) until a branching state is reached.
scan trap: specializes the sum trap in cases where there is only one input that escapes the state.

The sum, product and scan traps are controlled by private effectors inserted into the transducer matrix and effect vectors by the model compiler. The sum and scan traps are particularly useful in error recovery contexts. The relatively high throughput seen with the *intervals transducers is obtained in scan traps searching for the next opening '<'. The product trap eliminates intermediate states and reduces the granularity of the input equivalence relation and so compresses the transition matrix in both dimensions (reduced number of states and input equivalent classes).

Transducer matrix compression

The automata that ginr produces from ribose transducer patterns are state-mininized deterministic finite automata with 3 tapes (input, effector, parameter) representing subsequential transducers. Each transition is associated with an effect, which may be a scalar simple or parameterized effector or a nul-terminated vector of effectors and associated parameters. The latter are represented in ginr automata as chains of non-branching state transitions that the ribose model compiler extracts as a single action referring to an effect vector. This is the first stage of matrix compression. Byte and signal inputs are treated differently in each state but often behave similarly in all states. So the transition matrix induces an equivalence relation on these basic input symbols and equivalence class ordinals can stand in for symbols in the transducer matrix. This is the second stage of compression.

The ribose model compiler enumerates all maximal non-branching chains of input state transitions in the transducer matrix, similarly as for effect vectors. These chains are diverted to product traps, which eliminate the intermediate states and erase the unique character of the involved bytes, leading to a more coarse-grained input equivalence relation. This final stage of compression reduces the transducer matrix in both dimensions, improves transition density (transitions/(states * input equivalents)) and constructs the product traps.

Heap usage

Garbage collection metrics are logged for output equivalence tests. Output is written through a buffered output stream to System.out and tested for binary equivalence. Java regex requires that the entire input corpus be presented as a single unit so there is an up front 60MB charge to prepare the input (20MB byte array decoded to 40MB char array). The ribose output equivalence transductions incur the 60MB charge identically as for regex output equivalence testing but do not use these arrays; ribose input is presented in 64KB segments as read from the raw input file and transduced without decoding to UNICODE. In all cases these input arrays persist in the heap for the duration of the benchmarking run. From an initial heap size of 2MB the heaps grow to a maximal extend of 62MB for the verbose GC log runs (regex and ribose) and 64MB and 76MB for the ribose and regex kernel log runs, respectively.

Heap usage is a strong differentiator for the ribose runtime. Data are extracted to a finite set of byte accumulators, analogous to registers in a von Neumann CPU, and their values are transient. Extracted data must be output or assimilated into the target model immediately as it becomes available. As a consequence, ribose transductions typically operate with very low heap resources although ribose byte accumulators are allowed to grow without limit when required.

Provide feedback

Saved searches