c++ wrapper class for `binary_fuse(8|16)_t`, plus a "sharded" filter with mmap'ed file format #60

oschonrock · 2024-12-09T18:44:59Z

Having built up this wrapper class for my hibp project, it turned out non-trivial and I thought potentially useful for others, so I extracted into a separate library project:

https://github.com/oschonrock/binfuse

example use case of the sharded filter from one of the tests:

    binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::write> sharded_sink(
        "sharded_filter8.bin");

    std::ifstream sample("sample.txt");
    sharded_sink.stream_prepare();
    for (std::string line; std::getline(sample, line);) {
      std::uint64_t key{};
      std::from_chars(line.data(), line.data() + line.size(), key, 16);
      sharded_sink.stream_add(key);
    }
    sharded_sink.stream_finalize();

    const binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::read>
        sharded_source("sharded_filter8.bin");

    // full verify across all shards
    sample.seekg(0);
    for (std::string line; std::getline(sample, line);) {
      std::uint64_t needle{};
      std::from_chars(line.data(), line.data() + line.size(), needle, 16);
      EXPECT_TRUE(sharded_source.contains(needle));
    }

    EXPECT_LE(sharded_source.estimate_false_positive_rate(), 0.00005);

By default the binfuse::sharded_filter breaks the input stream (which MUST be sorted) into 256 shards (adjustable on construction) and saves the resulting filter into a custom, tagged binary format. From sharded_filter.hpp:

  /* file structure is as follows:
   *
   * header [0 -> 16) : small number of bytes identifying the type of file, the
   * type of filters contained and how many shards are contained.
   *
   * index [16 -> 16 + 8 * capacity() ): table of offsets to each
   * filter in the body. The offsets in the table are relative to the
   * start of the file.
   *
   * body [16 + 8 * capacity() -> end ): the filters: each one has
   * the filter_struct_fields (ie the "header") followed by the large
   * array of (8 or 16bit) fingerprints. The offsets in the index will
   * point the start of the filter_header so that deserialize can be
   *  called directly on that.
   *
   */

It uses the binfuse::filter class internally to wrap each 8 or 16 bit filter.

The singular binfuse::filters cannot be saved to disk on their own right now, as I didn't need it, but that may be useful?

What do you think about this API, and what is missing, for it to be useful?

The text was updated successfully, but these errors were encountered:

lemire · 2024-12-09T18:55:27Z

See oschonrock/binfuse#1

It looks good!

oschonrock · 2024-12-10T07:52:33Z

See oschonrock/binfuse#1

It looks good!

Thank you

I have listed out some specific API question, which I have doubts about. If you have a view on these, that would be helpful.

oschonrock/binfuse#2

oschonrock · 2024-12-10T16:11:34Z

UPDATES..

Done some more work and thinking around this. I am comfortable with finding solutions for 1-5 from oschonrock/binfuse#2

@lemire if you could comment on 6.:

Is there any sense in, or need for supporting xor_filters as well as binary_fuse_filters. The API looks basically similar and could likely be accomodated. Due to the template dispatch structure we have, it likely wouldn't add much code. Or are xor_filters really superseded?

since i know little about that, and possibly 7.

What are reasonable requirements for such a library? Right now this is a c++20 library, because it uses concepts and now std::span. I have pruned out std::format because support is still a little patchy, and didn't want fmtlib as a dependency. The concept usage is limited, so that could be removed. There is use of c++17 std::filesystem and std::from_chars, but that could also be relatively easily removed. mio is c++11 and xor_singleheader is c99, so are we inflating requirements unnecessarily to c++20 with std::span ? My sense is that it should be either a c++20 or c++17 library.

lemire · 2024-12-10T16:29:11Z

@oschonrock

Is there any sense in, or need for supporting xor_filters as well as binary_fuse_filters.

For small instances with few elements, XOR filters might be slightly smaller. However, this advantage is less significant since these filters are primarily useful for handling large datasets. If you're dealing with just 32 elements, for example, an XOR filter would still be my choice.

Nonetheless, there's a crossover point at which XOR filters become the more compact option. Obviously, you can make this transparent for the user, and just switch filter type based on the size, but we never implemented that.

I am not sure it matters.

What are reasonable requirements for such a library? Right now this is a c++20 library

Node.js, Chrome and many important systems have switched to C++20.

I recommend C++20.

oschonrock · 2024-12-11T19:15:02Z

OK, I am fairly happy with the library as a first public cut now.

I have written some content for the README.

You Ok, with how that links back to here?

https://github.com/oschonrock/binfuse

I am currently working on some benchmarks.

lemire · 2024-12-11T20:26:33Z

You Ok, with how that links back to here?

That's fine. There is no requirements regarding links. It is good to refer the users to the paper!

oschonrock · 2024-12-12T09:19:53Z

Thanks for the endorsement in the readme.

I have some benchmark results which you may find interesting:

for 100million keys with 8 shards, 12.5million per shard I am getting ~70ns populate time per key, which is very similar to your own bench. This was expected but a good cross check.
For that same filter with 8x12.5m keys, I am getting 30ns/key to "verify" (within one shard, while the filter is fully in memory), and 50ns/key for "query" (across the whole sharded filter, and through the mmap).
Populate time drops as number of shards increases, to 50ns/key at 256 shards. This is probably due to less memory paging and sorting?
Peak memory consumption during build - not shown in benchmark and not that easy to measure - drops more than linearly as number of shards increases. Also, expected, but pleasing as this was the main motivation.

Also refer to notes on memory consumption during querying via mmap. May be obvious?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c++ wrapper class for `binary_fuse(8|16)_t`, plus a "sharded" filter with mmap'ed file format #60

c++ wrapper class for `binary_fuse(8|16)_t`, plus a "sharded" filter with mmap'ed file format #60

oschonrock commented Dec 9, 2024 •

edited

Loading

lemire commented Dec 9, 2024

oschonrock commented Dec 10, 2024 •

edited

Loading

oschonrock commented Dec 10, 2024

lemire commented Dec 10, 2024

oschonrock commented Dec 11, 2024 •

edited

Loading

lemire commented Dec 11, 2024

oschonrock commented Dec 12, 2024

c++ wrapper class for binary_fuse(8|16)_t, plus a "sharded" filter with mmap'ed file format #60

c++ wrapper class for binary_fuse(8|16)_t, plus a "sharded" filter with mmap'ed file format #60

Comments

oschonrock commented Dec 9, 2024 • edited Loading

lemire commented Dec 9, 2024

oschonrock commented Dec 10, 2024 • edited Loading

oschonrock commented Dec 10, 2024

lemire commented Dec 10, 2024

oschonrock commented Dec 11, 2024 • edited Loading

lemire commented Dec 11, 2024

oschonrock commented Dec 12, 2024

c++ wrapper class for `binary_fuse(8|16)_t`, plus a "sharded" filter with mmap'ed file format #60

c++ wrapper class for `binary_fuse(8|16)_t`, plus a "sharded" filter with mmap'ed file format #60

oschonrock commented Dec 9, 2024 •

edited

Loading

oschonrock commented Dec 10, 2024 •

edited

Loading

oschonrock commented Dec 11, 2024 •

edited

Loading