Skip to content

Commit

Permalink
Lookup table v1 implementation
Browse files Browse the repository at this point in the history
Changelog-added: New lookup table implementation available
Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
  • Loading branch information
elshize committed Jan 12, 2025
1 parent c42c1e7 commit fc2f633
Show file tree
Hide file tree
Showing 14 changed files with 1,597 additions and 4 deletions.
4 changes: 4 additions & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@
- [`taily-stats`](cli/taily-stats.md)
- [`taily-thresholds`](cli/taily-thresholds.md)
- [`thresholds`](cli/thresholds.md)

# Specifications

- [Lookup Table](specs/lookup-table.md)
112 changes: 112 additions & 0 deletions docs/src/specs/lookup-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Lookup Table Format Specification

A lookup table is a bidirectional mapping from an index, representing an
internal ID, to a binary payload, such as string. E.g., an `N`-element
lookup table maps values `0...N-1` to their payloads. These tables are
used for things like mapping terms to term IDs and document IDs to
titles or URLs.

The format of a lookup table is designed to operate without having to
parse the entire structure. Once the header is parsed, it is possible to
operate directly on the binary format to access the data. In fact, a
lookup table will typically be memory mapped. Therefore, it is possible
to perform a lookup (or reverse lookup) without loading the entire
structure into memory.

The header always begins as follows:

```
+--------+--------+-------- -+
| 0x87 | Ver. | ... |
+--------+--------+-------- -+
```

The first byte is a constant identifier. When reading, we can verify
whether this byte is correct to make sure we are using the correct type
of data structure.

The second byte is equal to the version of the format.

The remaining of the format is defined separately for each version. The
version is introduced in order to be able to update the format in the
future but still be able to read old formats for backwards
compatibility.

## v1

```
+--------+--------+--------+--------+--------+--------+--------+--------+
| 0x87 | 0x01 | Flags | 0x00 |
+--------+--------+--------+--------+--------+--------+--------+--------+
| Length |
+--------+--------+--------+--------+--------+--------+--------+--------+
| |
| Offsets |
| |
+-----------------------------------------------------------------------+
| |
| Payloads |
| |
+-----------------------------------------------------------------------+
```

Immediately after the version bit, we have flags byte.

```
MSB LSB
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | W | S |
+---+---+---+---+---+---+---+---+
```

The first bit (`S`) indicates whether the payloads are sorted (1) or not
(0). The second bit (`W`) defines the width of offsets (see below):
32-bit (0) or 64-bit (1). In most use cases, the cumulative size of the
payloads will be small enough to address it by 32-bit offsets. For
example, if we store words that are 16-bytes long on average, we can
address over 200 million of them. For this many elements, reducing the
width of the offsets would save us over 700 MB. Still, we want to
support 64-bit addressing because some payloads may be much longer
(e.g., URLs).

The rest of the bits in the flags byte are currently not used, but
should be set to 0 to make sure that if more flags are introduced, we
know what values to expect in the older iterations, and thus we can make
sure to keep it backwards-compatible.

The following 5 bytes are padding with values of 0. This is to help with
byte alignment. When loaded to memory, it should be loaded with 8-byte
alignment. When memory mapped, it should be already correctly aligned by
the operating system (at least on Linux).

Following the padding, there is a 64-bit unsigned integer encoding the
number of elements in the lexicon (`N`).

Given `N` and `W`, we can now calculate the byte range of all offsets,
and thus the address offset for the start of the payloads. The offsets
are `N+1` little-endian unsigned integers of size determined by `W`
(either 4 or 8 bytes). The offsets are associated with consecutive IDs
from 0 to `N-1`; the last the `N+1` offsets points at the first byte
after the last payload. The offsets are relative to the beginning of the
first payload, therefore the first offset will always be 0.

Payloads are arbitrary bytes, and must be interpreted by the software.
Although the typical use case are strings, this can be any binary
payload. Note that in case of strings, they will not be 0-terminated
unless they were specifically stored as such. Although this should be
clear by the fact a payload is simply a sequence of bytes, it is only
prudent to point it out. Thus, one must be extremely careful when using
C-style strings, as their use is contingent on a correct values inserted
and encoded in the first place, and assuming 0-terminated strings may
easily lead to undefined behavior. Thus, it is recommended to store
strings without terminating them, and then interpret them as string
views (such as `std::string_view`) instead of a C-style string.

The boundaries of the k-th payload are defined by the values of k-th and
(k+1)-th offsets. Note that because of the additional offset that points
to immediately after the last payload, we can read offsets `k` and `k+1`
for any index `k < N` (recall that `N` is the number of elements).

If the payloads are sorted (S), we can find an ID of a certain payload
with a binary search. This is crucial for any application that requires
mapping from payloads to their position in the table.
2 changes: 1 addition & 1 deletion include/pisa/io.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ template <typename Function>
void for_each_line(std::istream& is, Function fn) {
std::string line;
while (std::getline(is, line)) {
fn(line);
fn(std::move(line));
}
}

Expand Down
229 changes: 229 additions & 0 deletions include/pisa/lookup_table.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
// Copyright 2024 PISA developers
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#pragma once

#include <concepts>
#include <cstddef>
#include <cstdint>
#include <memory>
#include <optional>
#include <ostream>
#include <span>

namespace pisa::lt {

namespace detail {

class BaseLookupTable {
public:
virtual ~BaseLookupTable() = default;
[[nodiscard]] virtual auto size() const noexcept -> std::size_t = 0;
[[nodiscard]] virtual auto operator[](std::size_t idx) const
-> std::span<std::byte const> = 0;
[[nodiscard]] virtual auto find(std::span<std::byte const> value) const noexcept
-> std::optional<std::size_t> = 0;

[[nodiscard]] virtual auto clone() -> std::unique_ptr<BaseLookupTable> = 0;
};

class BaseLookupTableEncoder {
public:
virtual ~BaseLookupTableEncoder() = default;
void virtual insert(std::span<std::byte const> payload) = 0;
void virtual encode(std::ostream& out) = 0;
};

} // namespace detail

namespace v1 {

class Flags {
private:
std::uint8_t flags = 0;

public:
constexpr Flags() = default;
explicit constexpr Flags(std::uint8_t bitset) : flags(bitset) {}

[[nodiscard]] auto sorted() const noexcept -> bool;
[[nodiscard]] auto wide_offsets() const noexcept -> bool;
[[nodiscard]] auto bits() const noexcept -> std::uint8_t;
};

namespace flags {
inline constexpr std::uint8_t SORTED = 0b001;
inline constexpr std::uint8_t WIDE_OFFSETS = 0b010;
} // namespace flags

}; // namespace v1

} // namespace pisa::lt

namespace pisa {

/**
* Lookup table mapping integers from a range [0, N) to binary payloads.
*
* This table assigns each _unique_ value (duplicates are not allowed) to an index in [0, N), where
* N is the size of the table. Thus, this structure is equivalent to a sequence of binary values.
* The difference between `LookupTable` and, say, `std::vector` is that its encoding supports
* reading the values without fully parsing the entire binary representation of the table. As such,
* it supports quickly initializing the structure from an external device (with random access),
* e.g., via mmap, and performing a lookup without loading the entire structure to main memory.
* This is especially useful for short-lived programs that must perform a lookup without the
* unnecessary overhead of loading it to memory.
*
* If the values are sorted, and the appropriate flag is toggled in the header, a quick binary
* search lookup can be performed to find an index of a value. If the values are not sorted, then a
* linear scan will be used; therefore, one should consider having values sorted if such lookups are
* important. Getting the value at a given index is a constant-time operation, though if using
* memory mapping, each such operation may need to load multiple pages to memory.
*/
class LookupTable {
private:
std::unique_ptr<::pisa::lt::detail::BaseLookupTable> m_impl;

explicit LookupTable(std::unique_ptr<::pisa::lt::detail::BaseLookupTable> impl);

[[nodiscard]] static auto v1(std::span<const std::byte> bytes) -> LookupTable;

public:
LookupTable(LookupTable const&);
LookupTable(LookupTable&&);
LookupTable& operator=(LookupTable const&);
LookupTable& operator=(LookupTable&&);
~LookupTable();

/**
* The number of elements in the table.
*/
[[nodiscard]] auto size() const noexcept -> std::size_t;

/**
* Retrieves the value at index `idx`.
*
* If `idx < size()`, then `std::out_of_range` exception is thrown. See `at()` if you want to
* conveniently cast the memory span to another type.
*/
[[nodiscard]] auto operator[](std::size_t idx) const -> std::span<std::byte const>;

/**
* Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
*
* See the templated version of this function if you want to automatically cast from another
* type to byte span.
*/
[[nodiscard]] auto find(std::span<std::byte const> value) const noexcept
-> std::optional<std::size_t>;

/**
* Returns the value at index `idx` cast to type `T`.
*
* The type `T` must define `T::value_type` that resolves to a byte-wide type, as well as a
* constructor that takes `T::value_type const*` (pointer to the first byte) and `std::size_t`
* (the total number of bytes). If `T::value_type` is longer than 1 byte, this operation results
* in **undefined behavior**.
*
* Examples of types that can be used are: `std::string_view` or `std::span<const char>`.
*/
template <typename T>
[[nodiscard]] auto at(std::size_t idx) const -> T {
auto bytes = this->operator[](idx);
return T(reinterpret_cast<typename T::value_type const*>(bytes.data()), bytes.size());
}

/**
* Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
*
* The type `T` of the value must be such that `std:span<typename T::value_type const>` is
* constructible from `T`.
*/
template <typename T>
requires(std::constructible_from<std::span<typename T::value_type const>, T>)
[[nodiscard]] auto find(T value) const noexcept -> std::optional<std::size_t> {
return find(std::as_bytes(std::span<typename T::value_type const>(value)));
}

/**
* Constructs a lookup table from the encoded sequence of bytes.
*/
[[nodiscard]] static auto from_bytes(std::span<std::byte const> bytes) -> LookupTable;
};

/**
* Lookup table encoder.
*
* This class builds and encodes a sequence of values to the binary format of lookup table.
* See `LookupTable` for more details.
*
* Note that all encoded data is accumulated in memory and only flushed to the output stream when
* `encode()` member function is called.
*/
class LookupTableEncoder {
std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> m_impl;

explicit LookupTableEncoder(std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> impl);

public:
/**
* Constructs an encoder for a lookup table in v1 format, with the given flag options.
*
* If sorted flag is _not_ set, then an additional hash set will be produced to keep track of
* duplicates. This will increase the memory footprint at build time.
*/
static LookupTableEncoder v1(::pisa::lt::v1::Flags flags);

/**
* Inserts payload.
*
* If sorted flag was set at construction time, it will throw if the given payload is not
* lexicographically greater than the previously inserted payload. If sorted flag was _not_ set
* and the given payload has already been inserted, it will throw as well.
*/
auto insert(std::span<std::byte const> payload) -> LookupTableEncoder&;

/**
* Writes the encoded table to the output stream.
*/
auto encode(std::ostream& out) -> LookupTableEncoder&;

/**
* Inserts a payload of type `Payload`.
*
* `std::span<typename Payload::value_type const>` must be constructible from `Payload`, which
* in turn will be cast as byte span before calling the non-templated version of `insert()`.
*/
template <typename Payload>
requires(std::constructible_from<std::span<typename Payload::value_type const>, Payload>)
auto insert(Payload const& payload) -> LookupTableEncoder& {
insert(std::as_bytes(std::span(payload)));
return *this;
}

/**
* Inserts all payloads in the given span.
*
* It calls `insert()` for each element in the span. See `insert()` for more details.
*/
template <typename Payload>
auto insert_span(std::span<Payload const> payloads) -> LookupTableEncoder& {
for (auto const& payload: payloads) {
insert(payload);
}
return *this;
}
};

} // namespace pisa
Loading

0 comments on commit fc2f633

Please sign in to comment.