Holds a customized lexical tokenizer as a Spark UDF and a bloom filter aggregator. Used to tokenize spark string columns and run Spark aggregation into a bloom filter with configurable filter size selection.
Spark UDF that will tokenize incoming string value and return it as a list of byte arrays. Tokenization rules are set in blf_01.
Spark UDF that converts results from TokenizerUDF into a list of strings.
Custom spark aggregator that aggregates a column string tokens into a single bloom filter and returns the bytes of the resulting filter that can be processed.
Filter size is selected by giving the aggregator the name of a spark column that holds an estimated value of tokens and by configuring a map of bloom filters with preset values (expected number of items, false positive probability).
See the official documentation on docs.teragrep.com.
You can involve yourself with our project by opening an issue or submitting a pull request.
Contribution requirements:
-
All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
-
Security checks must pass
-
Pull requests must align with the principles and values of extreme programming.
-
Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).
Read more in our Contributing Guideline.
Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.
You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.