Lookup table v1 implementation

Changelog-added: New lookup table implementation available Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
pisa-engine · Jan 12, 2025 · fc2f633 · fc2f633
1 parent c42c1e7
commit fc2f633
Show file tree

Hide file tree

Showing 14 changed files with 1,597 additions and 4 deletions.
diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -46,3 +46,7 @@
 - [`taily-stats`](cli/taily-stats.md)
 - [`taily-thresholds`](cli/taily-thresholds.md)
 - [`thresholds`](cli/thresholds.md)
+
+# Specifications
+
+- [Lookup Table](specs/lookup-table.md)
diff --git a/docs/src/specs/lookup-table.md b/docs/src/specs/lookup-table.md
@@ -0,0 +1,112 @@
+# Lookup Table Format Specification
+
+A lookup table is a bidirectional mapping from an index, representing an
+internal ID, to a binary payload, such as string. E.g., an `N`-element
+lookup table maps values `0...N-1` to their payloads. These tables are
+used for things like mapping terms to term IDs and document IDs to
+titles or URLs.
+
+The format of a lookup table is designed to operate without having to
+parse the entire structure. Once the header is parsed, it is possible to
+operate directly on the binary format to access the data. In fact, a
+lookup table will typically be memory mapped. Therefore, it is possible
+to perform a lookup (or reverse lookup) without loading the entire
+structure into memory.
+
+The header always begins as follows:
+
+```
++--------+--------+--------   -+
+|  0x87  |  Ver.  |        ... |
++--------+--------+--------   -+
+```
+
+The first byte is a constant identifier. When reading, we can verify
+whether this byte is correct to make sure we are using the correct type
+of data structure.
+
+The second byte is equal to the version of the format.
+
+The remaining of the format is defined separately for each version. The
+version is introduced in order to be able to update the format in the
+future but still be able to read old formats for backwards
+compatibility.
+
+## v1
+
+```
++--------+--------+--------+--------+--------+--------+--------+--------+
+|  0x87  |  0x01  | Flags  |                    0x00                    |
++--------+--------+--------+--------+--------+--------+--------+--------+
+|                                 Length                                |
++--------+--------+--------+--------+--------+--------+--------+--------+
+|                                                                       |
+|                                Offsets                                |
+|                                                                       |
++-----------------------------------------------------------------------+
+|                                                                       |
+|                                Payloads                               |
+|                                                                       |
++-----------------------------------------------------------------------+
+```
+
+Immediately after the version bit, we have flags byte.
+
+```
+ MSB                         LSB
++---+---+---+---+---+---+---+---+
+| 0 | 0 | 0 | 0 | 0 | 0 | W | S |
++---+---+---+---+---+---+---+---+
+```
+
+The first bit (`S`) indicates whether the payloads are sorted (1) or not
+(0). The second bit (`W`) defines the width of offsets (see below):
+32-bit (0) or 64-bit (1). In most use cases, the cumulative size of the
+payloads will be small enough to address it by 32-bit offsets. For
+example, if we store words that are 16-bytes long on average, we can
+address over 200 million of them. For this many elements, reducing the
+width of the offsets would save us over 700 MB. Still, we want to
+support 64-bit addressing because some payloads may be much longer
+(e.g., URLs).
+
+The rest of the bits in the flags byte are currently not used, but
+should be set to 0 to make sure that if more flags are introduced, we
+know what values to expect in the older iterations, and thus we can make
+sure to keep it backwards-compatible.
+
+The following 5 bytes are padding with values of 0. This is to help with
+byte alignment. When loaded to memory, it should be loaded with 8-byte
+alignment. When memory mapped, it should be already correctly aligned by
+the operating system (at least on Linux).
+
+Following the padding, there is a 64-bit unsigned integer encoding the
+number of elements in the lexicon (`N`).
+
+Given `N` and `W`, we can now calculate the byte range of all offsets,
+and thus the address offset for the start of the payloads. The offsets
+are `N+1` little-endian unsigned integers of size determined by `W`
+(either 4 or 8 bytes). The offsets are associated with consecutive IDs
+from 0 to `N-1`; the last the `N+1` offsets points at the first byte
+after the last payload. The offsets are relative to the beginning of the
+first payload, therefore the first offset will always be 0.
+
+Payloads are arbitrary bytes, and must be interpreted by the software.
+Although the typical use case are strings, this can be any binary
+payload. Note that in case of strings, they will not be 0-terminated
+unless they were specifically stored as such. Although this should be
+clear by the fact a payload is simply a sequence of bytes, it is only
+prudent to point it out. Thus, one must be extremely careful when using
+C-style strings, as their use is contingent on a correct values inserted
+and encoded in the first place, and assuming 0-terminated strings may
+easily lead to undefined behavior. Thus, it is recommended to store
+strings without terminating them, and then interpret them as string
+views (such as `std::string_view`) instead of a C-style string.
+
+The boundaries of the k-th payload are defined by the values of k-th and
+(k+1)-th offsets. Note that because of the additional offset that points
+to immediately after the last payload, we can read offsets `k` and `k+1`
+for any index `k < N` (recall that `N` is the number of elements).
+
+If the payloads are sorted (S), we can find an ID of a certain payload
+with a binary search. This is crucial for any application that requires
+mapping from payloads to their position in the table.
diff --git a/include/pisa/io.hpp b/include/pisa/io.hpp
@@ -36,7 +36,7 @@ template <typename Function>
 void for_each_line(std::istream& is, Function fn) {
     std::string line;
     while (std::getline(is, line)) {
-        fn(line);
+        fn(std::move(line));
     }
 }
 

diff --git a/include/pisa/lookup_table.hpp b/include/pisa/lookup_table.hpp
@@ -0,0 +1,229 @@
+// Copyright 2024 PISA developers
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <concepts>
+#include <cstddef>
+#include <cstdint>
+#include <memory>
+#include <optional>
+#include <ostream>
+#include <span>
+
+namespace pisa::lt {
+
+namespace detail {
+
+    class BaseLookupTable {
+      public:
+        virtual ~BaseLookupTable() = default;
+        [[nodiscard]] virtual auto size() const noexcept -> std::size_t = 0;
+        [[nodiscard]] virtual auto operator[](std::size_t idx) const
+            -> std::span<std::byte const> = 0;
+        [[nodiscard]] virtual auto find(std::span<std::byte const> value) const noexcept
+            -> std::optional<std::size_t> = 0;
+
+        [[nodiscard]] virtual auto clone() -> std::unique_ptr<BaseLookupTable> = 0;
+    };
+
+    class BaseLookupTableEncoder {
+      public:
+        virtual ~BaseLookupTableEncoder() = default;
+        void virtual insert(std::span<std::byte const> payload) = 0;
+        void virtual encode(std::ostream& out) = 0;
+    };
+
+}  // namespace detail
+
+namespace v1 {
+
+    class Flags {
+      private:
+        std::uint8_t flags = 0;
+
+      public:
+        constexpr Flags() = default;
+        explicit constexpr Flags(std::uint8_t bitset) : flags(bitset) {}
+
+        [[nodiscard]] auto sorted() const noexcept -> bool;
+        [[nodiscard]] auto wide_offsets() const noexcept -> bool;
+        [[nodiscard]] auto bits() const noexcept -> std::uint8_t;
+    };
+
+    namespace flags {
+        inline constexpr std::uint8_t SORTED = 0b001;
+        inline constexpr std::uint8_t WIDE_OFFSETS = 0b010;
+    }  // namespace flags
+
+};  // namespace v1
+
+}  // namespace pisa::lt
+
+namespace pisa {
+
+/**
+ * Lookup table mapping integers from a range [0, N) to binary payloads.
+ *
+ * This table assigns each _unique_ value (duplicates are not allowed) to an index in [0, N), where
+ * N is the size of the table. Thus, this structure is equivalent to a sequence of binary values.
+ * The difference between `LookupTable` and, say, `std::vector` is that its encoding supports
+ * reading the values without fully parsing the entire binary representation of the table. As such,
+ * it supports quickly initializing the structure from an external device (with random access),
+ * e.g., via mmap, and performing a lookup without loading the entire structure to main memory.
+ * This is especially useful for short-lived programs that must perform a lookup without the
+ * unnecessary overhead of loading it to memory.
+ *
+ * If the values are sorted, and the appropriate flag is toggled in the header, a quick binary
+ * search lookup can be performed to find an index of a value. If the values are not sorted, then a
+ * linear scan will be used; therefore, one should consider having values sorted if such lookups are
+ * important. Getting the value at a given index is a constant-time operation, though if using
+ * memory mapping, each such operation may need to load multiple pages to memory.
+ */
+class LookupTable {
+  private:
+    std::unique_ptr<::pisa::lt::detail::BaseLookupTable> m_impl;
+
+    explicit LookupTable(std::unique_ptr<::pisa::lt::detail::BaseLookupTable> impl);
+
+    [[nodiscard]] static auto v1(std::span<const std::byte> bytes) -> LookupTable;
+
+  public:
+    LookupTable(LookupTable const&);
+    LookupTable(LookupTable&&);
+    LookupTable& operator=(LookupTable const&);
+    LookupTable& operator=(LookupTable&&);
+    ~LookupTable();
+
+    /**
+     * The number of elements in the table.
+     */
+    [[nodiscard]] auto size() const noexcept -> std::size_t;
+
+    /**
+     * Retrieves the value at index `idx`.
+     *
+     * If `idx < size()`, then `std::out_of_range` exception is thrown. See `at()` if you want to
+     * conveniently cast the memory span to another type.
+     */
+    [[nodiscard]] auto operator[](std::size_t idx) const -> std::span<std::byte const>;
+
+    /**
+     * Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
+     *
+     * See the templated version of this function if you want to automatically cast from another
+     * type to byte span.
+     */
+    [[nodiscard]] auto find(std::span<std::byte const> value) const noexcept
+        -> std::optional<std::size_t>;
+
+    /**
+     * Returns the value at index `idx` cast to type `T`.
+     *
+     * The type `T` must define `T::value_type` that resolves to a byte-wide type, as well as a
+     * constructor that takes `T::value_type const*` (pointer to the first byte) and `std::size_t`
+     * (the total number of bytes). If `T::value_type` is longer than 1 byte, this operation results
+     * in **undefined behavior**.
+     *
+     * Examples of types that can be used are: `std::string_view` or `std::span<const char>`.
+     */
+    template <typename T>
+    [[nodiscard]] auto at(std::size_t idx) const -> T {
+        auto bytes = this->operator[](idx);
+        return T(reinterpret_cast<typename T::value_type const*>(bytes.data()), bytes.size());
+    }
+
+    /**
+     * Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
+     *
+     * The type `T` of the value must be such that `std:span<typename T::value_type const>` is
+     * constructible from `T`.
+     */
+    template <typename T>
+        requires(std::constructible_from<std::span<typename T::value_type const>, T>)
+    [[nodiscard]] auto find(T value) const noexcept -> std::optional<std::size_t> {
+        return find(std::as_bytes(std::span<typename T::value_type const>(value)));
+    }
+
+    /**
+     * Constructs a lookup table from the encoded sequence of bytes.
+     */
+    [[nodiscard]] static auto from_bytes(std::span<std::byte const> bytes) -> LookupTable;
+};
+
+/**
+ * Lookup table encoder.
+ *
+ * This class builds and encodes a sequence of values to the binary format of lookup table.
+ * See `LookupTable` for more details.
+ *
+ * Note that all encoded data is accumulated in memory and only flushed to the output stream when
+ * `encode()` member function is called.
+ */
+class LookupTableEncoder {
+    std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> m_impl;
+
+    explicit LookupTableEncoder(std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> impl);
+
+  public:
+    /**
+     * Constructs an encoder for a lookup table in v1 format, with the given flag options.
+     *
+     * If sorted flag is _not_ set, then an additional hash set will be produced to keep track of
+     * duplicates. This will increase the memory footprint at build time.
+     */
+    static LookupTableEncoder v1(::pisa::lt::v1::Flags flags);
+
+    /**
+     * Inserts payload.
+     *
+     * If sorted flag was set at construction time, it will throw if the given payload is not
+     * lexicographically greater than the previously inserted payload. If sorted flag was _not_ set
+     * and the given payload has already been inserted, it will throw as well.
+     */
+    auto insert(std::span<std::byte const> payload) -> LookupTableEncoder&;
+
+    /**
+     * Writes the encoded table to the output stream.
+     */
+    auto encode(std::ostream& out) -> LookupTableEncoder&;
+
+    /**
+     * Inserts a payload of type `Payload`.
+     *
+     * `std::span<typename Payload::value_type const>` must be constructible from `Payload`, which
+     * in turn will be cast as byte span before calling the non-templated version of `insert()`.
+     */
+    template <typename Payload>
+        requires(std::constructible_from<std::span<typename Payload::value_type const>, Payload>)
+    auto insert(Payload const& payload) -> LookupTableEncoder& {
+        insert(std::as_bytes(std::span(payload)));
+        return *this;
+    }
+
+    /**
+     * Inserts all payloads in the given span.
+     *
+     * It calls `insert()` for each element in the span. See `insert()` for more details.
+     */
+    template <typename Payload>
+    auto insert_span(std::span<Payload const> payloads) -> LookupTableEncoder& {
+        for (auto const& payload: payloads) {
+            insert(payload);
+        }
+        return *this;
+    }
+};
+
+}  // namespace pisa