Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c++ wrapper class for binary_fuse(8|16)_t, plus a "sharded" filter with mmap'ed file format #60

Open
oschonrock opened this issue Dec 9, 2024 · 7 comments

Comments

@oschonrock
Copy link
Contributor

oschonrock commented Dec 9, 2024

Having built up this wrapper class for my hibp project, it turned out non-trivial and I thought potentially useful for others, so I extracted into a separate library project:

https://github.com/oschonrock/binfuse

example use case of the sharded filter from one of the tests:

    binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::write> sharded_sink(
        "sharded_filter8.bin");

    std::ifstream sample("sample.txt");
    sharded_sink.stream_prepare();
    for (std::string line; std::getline(sample, line);) {
      std::uint64_t key{};
      std::from_chars(line.data(), line.data() + line.size(), key, 16);
      sharded_sink.stream_add(key);
    }
    sharded_sink.stream_finalize();

    const binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::read>
        sharded_source("sharded_filter8.bin");

    // full verify across all shards
    sample.seekg(0);
    for (std::string line; std::getline(sample, line);) {
      std::uint64_t needle{};
      std::from_chars(line.data(), line.data() + line.size(), needle, 16);
      EXPECT_TRUE(sharded_source.contains(needle));
    }

    EXPECT_LE(sharded_source.estimate_false_positive_rate(), 0.00005);

By default the binfuse::sharded_filter breaks the input stream (which MUST be sorted) into 256 shards (adjustable on construction) and saves the resulting filter into a custom, tagged binary format. From sharded_filter.hpp:

  /* file structure is as follows:
   *
   * header [0 -> 16) : small number of bytes identifying the type of file, the
   * type of filters contained and how many shards are contained.
   *
   * index [16 -> 16 + 8 * capacity() ): table of offsets to each
   * filter in the body. The offsets in the table are relative to the
   * start of the file.
   *
   * body [16 + 8 * capacity() -> end ): the filters: each one has
   * the filter_struct_fields (ie the "header") followed by the large
   * array of (8 or 16bit) fingerprints. The offsets in the index will
   * point the start of the filter_header so that deserialize can be
   *  called directly on that.
   *
   */

It uses the binfuse::filter class internally to wrap each 8 or 16 bit filter.

The singular binfuse::filters cannot be saved to disk on their own right now, as I didn't need it, but that may be useful?

What do you think about this API, and what is missing, for it to be useful?

@lemire
Copy link
Member

lemire commented Dec 9, 2024

See oschonrock/binfuse#1

It looks good!

@oschonrock
Copy link
Contributor Author

oschonrock commented Dec 10, 2024

See oschonrock/binfuse#1

It looks good!

Thank you

I have listed out some specific API question, which I have doubts about. If you have a view on these, that would be helpful.

oschonrock/binfuse#2

@oschonrock
Copy link
Contributor Author

UPDATES..

Done some more work and thinking around this. I am comfortable with finding solutions for 1-5 from oschonrock/binfuse#2

@lemire if you could comment on 6.:

  1. Is there any sense in, or need for supporting xor_filters as well as binary_fuse_filters. The API looks basically similar and could likely be accomodated. Due to the template dispatch structure we have, it likely wouldn't add much code. Or are xor_filters really superseded?

since i know little about that, and possibly 7.

  1. What are reasonable requirements for such a library? Right now this is a c++20 library, because it uses concepts and now std::span. I have pruned out std::format because support is still a little patchy, and didn't want fmtlib as a dependency. The concept usage is limited, so that could be removed. There is use of c++17 std::filesystem and std::from_chars, but that could also be relatively easily removed. mio is c++11 and xor_singleheader is c99, so are we inflating requirements unnecessarily to c++20 with std::span ? My sense is that it should be either a c++20 or c++17 library.

@lemire
Copy link
Member

lemire commented Dec 10, 2024

@oschonrock

Is there any sense in, or need for supporting xor_filters as well as binary_fuse_filters.

For small instances with few elements, XOR filters might be slightly smaller. However, this advantage is less significant since these filters are primarily useful for handling large datasets. If you're dealing with just 32 elements, for example, an XOR filter would still be my choice.

Nonetheless, there's a crossover point at which XOR filters become the more compact option. Obviously, you can make this transparent for the user, and just switch filter type based on the size, but we never implemented that.

I am not sure it matters.

What are reasonable requirements for such a library? Right now this is a c++20 library

Node.js, Chrome and many important systems have switched to C++20.

I recommend C++20.

@oschonrock
Copy link
Contributor Author

oschonrock commented Dec 11, 2024

OK, I am fairly happy with the library as a first public cut now.

I have written some content for the README.

You Ok, with how that links back to here?

https://github.com/oschonrock/binfuse

I am currently working on some benchmarks.

@lemire
Copy link
Member

lemire commented Dec 11, 2024

You Ok, with how that links back to here?

That's fine. There is no requirements regarding links. It is good to refer the users to the paper!

@oschonrock
Copy link
Contributor Author

Thanks for the endorsement in the readme.

I have some benchmark results which you may find interesting:

  1. for 100million keys with 8 shards, 12.5million per shard I am getting ~70ns populate time per key, which is very similar to your own bench. This was expected but a good cross check.

  2. For that same filter with 8x12.5m keys, I am getting 30ns/key to "verify" (within one shard, while the filter is fully in memory), and 50ns/key for "query" (across the whole sharded filter, and through the mmap).

  3. Populate time drops as number of shards increases, to 50ns/key at 256 shards. This is probably due to less memory paging and sorting?

  4. Peak memory consumption during build - not shown in benchmark and not that easy to measure - drops more than linearly as number of shards increases. Also, expected, but pleasing as this was the main motivation.

Also refer to notes on memory consumption during querying via mmap. May be obvious?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants