DPF_03

Holds a customized lexical tokenizer as a Spark UDF and a bloom filter aggregator. Used to tokenize spark string columns and run Spark aggregation into a bloom filter with configurable filter size selection.

Features

TokenizerUDF

Spark UDF that will tokenize incoming string value and return it as a list of byte arrays. Tokenization rules are set in blf_01.

ByteArrayListAsStringListUDF

Spark UDF that converts results from TokenizerUDF into a list of strings.

BloomFilterAggregator

Custom spark aggregator that aggregates a column string tokens into a single bloom filter and returns the bytes of the resulting filter that can be processed.

Filter size is selected by giving the aggregator the name of a spark column that holds an estimated value of tokens and by configuring a map of bloom filters with preset values (expected number of items, false positive probability).

Documentation

See the official documentation on docs.teragrep.com.

Limitations

Compatible with Java version 1.8, other versions might not work.

How to [compile/use/implement]

See tests for how to apply and import into a Spark project

Contributing

You can involve yourself with our project by opening an issue or submitting a pull request.

Contribution requirements:

All changes must be accompanied by a new or changed test. If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
Security checks must pass
Pull requests must align with the principles and values of extreme programming.
Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our Contributing Guideline.

Contributor License Agreement

Contributors must sign Teragrep Contributor License Agreement before a pull request is accepted to organization’s repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep’s repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPF_03

Features

TokenizerUDF

ByteArrayListAsStringListUDF

BloomFilterAggregator

Documentation

Limitations

How to [compile/use/implement]

Contributing

Contributor License Agreement

About

Releases 16

Packages

Contributors 6

Languages

License

teragrep/dpf_03

Folders and files

Latest commit

History

Repository files navigation

DPF_03

Features

TokenizerUDF

ByteArrayListAsStringListUDF

BloomFilterAggregator

Documentation

Limitations

How to [compile/use/implement]

Contributing

Contributor License Agreement

About

Resources

License

Stars

Watchers

Forks

Releases 16

Packages 0

Contributors 6

Languages

Packages