This repository contains C++ bindings between Apache Datasketches library and Vertica Database. Datasketch functions are installed as UDX in Vertica so they can be run as SQL.
Details on the library and underlying algorithm can be found at https://datasketches.apache.org/
This extensions uses the open-source C++ implementation from https://github.com/apache/datasketches-cpp
Currently the theta sketch, Hll (HyperLogLog) sketch, and frequency sketch are implemented for Vertica, see examples below.
LogLogBeta is built into Vertica as approximate count distinct and synopsis aggregate functions. It's possible to run COUNT DISTINCT, save the result as a synopsis, and perform counts and merges on the synopsis. Please see the documentation at https://www.vertica.com/docs/latest/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingCOUNTDISTINCTByCalculatingApproximateCounts.htm
This library requires cmake 3.14+ "yum install cmake3" package should install the correct version, though you may have to build from source following this Gist: https://gist.github.com/1duo/38af1abd68a2c7fe5087532ab968574e. Then run:
mkdir build
cd build
cmake ../SOURCES
make
Additional build options can be enabled by running ccmake.
To install, copy the library and SOURCES/install.sql to a Vertica node. Edit install.sql and copy the correct library path and file name at the top, then run with vsql -f install.sql
Theta sketch and Hll sketch used to estimate cardinality. Frequency sketch is used to estimate most common items. Consider the following table and sketches:
dbadmin=> select * from freq order by v1;
v1
----
a
a
a
b
b
c
(6 rows)
dbadmin=> select theta_sketch_get_estimate(theta_sketch_create(v1)) from freq;
theta_sketch_get_estimate
---------------------------
3
(1 row)
dbadmin=> select hll_sketch_create(v1) from freq;
hll_sketch_create
-------------------
3
(1 row)
dbadmin=> select frequency_sketch_create(v1) from freq;
frequency_sketch_create
-------------------------
[[a,3],[b,2],[c,1]]
(1 row)
Theta sketches also support set operations: intersection, union, difference (as a_not_b). Consider the following tables and examples:
Table setA, varchar field v1: a,b,c,d,e
Table setB, varchar field v1: c,d,e,f,g
dbadmin=> select theta_sketch_get_estimate(
theta_sketch_union(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
theta_sketch_get_estimate
---------------------------
7
dbadmin=> select theta_sketch_get_estimate(
theta_sketch_intersection(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
theta_sketch_get_estimate
---------------------------
3
dbadmin=> select theta_sketch_get_estimate(
theta_sketch_a_not_b(theta_sketch_create(setA.v1),theta_sketch_create(setB.v1))
) from setA, setB;
theta_sketch_get_estimate
---------------------------
2
Vertica sends data to UDX in blocks, and depending on the type and size of the field(s) you are using in sketches, the block size may be too small for efficient batching, resulting in long UDX calls.
You can increase the block size to send more records per batch as follows:
ALTER SESSION SET MaxDesiredEEBlockSize = 25165824;
It's recommended to set exact MB sizes (this example is 24 MB), and realize that more is not always better, I've found that 16-48 MB blocks are usually best.