Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ItemList abstraction #458

Merged
merged 16 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 25 additions & 12 deletions docs/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@ abstract class with implementations covering various scenarios.
Creating Datasets
~~~~~~~~~~~~~~~~~

Several functions create :class:`Dataset`s from different input data sources.
Several functions can create a :class:`Dataset` from different input data sources.

.. autofunction:: from_interaction_df
.. autofunction:: from_interactions_df

Loading Common Datasets
~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -89,20 +89,40 @@ LensKit uses *vocabularies* to record user/item IDs, tags, terms, etc. in a way
that facilitates easy mapping to 0-based contiguous indexes for use in matrix
and tensor data structures.

.. module:: lenskit.data.vocab
.. module:: lenskit.data

.. autoclass:: Vocabulary

Dataset implementations

Item Lists
~~~~~~~~~~

LensKit uses *item lists* to represent collections of items that may be scored,
ranked, etc.

.. autoclass:: ItemList

User-Item Data Tables
~~~~~~~~~~~~~~~~~~~~~

.. module:: lenskit.data.tables

.. autoclass:: NumpyUserItemTable
.. autoclass:: TorchUserItemTable

Dataset Implementations
~~~~~~~~~~~~~~~~~~~~~~~

.. module:: lenskit.data.dataset

Matrix Dataset
--------------

The :class:`MatrixDataset` provides an in-memory dataset implementation backed
by a ratings matrix or implicit-feedback matrix.

.. autoclass:: MatrixDataset
:no-members:

Lazy Dataset
------------
Expand All @@ -111,11 +131,4 @@ The lazy data set takes a function that loads a data set (of any type), and
lazily uses that function to load an underlying data set when needed.

.. autoclass:: LazyDataset

User-Item Data Tables
~~~~~~~~~~~~~~~~~~~~~

.. module:: lenskit.data.tables

.. autoclass:: NumpyUserItemTable
.. autoclass:: TorchUserItemTable
:members: delegate
23 changes: 16 additions & 7 deletions docs/releases/2024.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,6 @@ Significant Changes

2024.1 brings substantial changes to LensKit.

* **PyTorch**. LensKit now uses PyTorch to implement most of its algorithms,
instead of Numba-accelerated NumPy code. Algorithms using PyTorch are:

* :py:class:`~lenskit.algorithms.knn.ItemItem`
* :py:class:`~lenskit.algorithms.als.ImplicitMF`
* :py:class:`~lenskit.algorithms.als.BiasedMF`

* :class:`~lenskit.data.Dataset`. LensKit now provides an abstraction for
training data instead of working with Pandas data frames directly, that
allows components to reduce code duplication and recomputation, access data
Expand All @@ -39,6 +32,22 @@ Significant Changes
supersedes the old bespoke dataset loading support, with functions like
:func:`~lenskit.data.load_movielens` to load standard datasets.

* New classes like :class:`~lenskit.data.ItemList` for routing item data
instead of using Pandas data frames and series. This makes component return
types more self-documenting (rather than requiring developers to remember
what is on the index, what the column names are, etc.), and facilitates more
efficient data transfer between components that do not use Pandas (e.g. data
passed between components using PyTorch can leave the data in tensors
without round-tripping through Pandas and NumPy, and keep this transparent
to client code).

* **PyTorch**. LensKit now uses PyTorch to implement most of its algorithms,
instead of Numba-accelerated NumPy code. Algorithms using PyTorch are:

* :py:class:`~lenskit.algorithms.knn.ItemItem`
* :py:class:`~lenskit.algorithms.als.ImplicitMF`
* :py:class:`~lenskit.algorithms.als.BiasedMF`

* Many LensKit components (batch running, model training, etc.) now report progress with
:py:mod:`progress_api`, and can be connected to TQDM or Enlighten.

Expand Down
2 changes: 2 additions & 0 deletions lenskit/lenskit/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,6 @@
"Types of feedback supported."

from .dataset import Dataset, from_interactions_df # noqa: F401, E402
from .items import ItemList # noqa: F401, E402
from .movielens import load_movielens # noqa: F401, E402
from .mtarray import MTArray, MTFloatArray, MTGenericArray, MTIntArray # noqa: F401, E402
150 changes: 150 additions & 0 deletions lenskit/lenskit/data/checks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# This file is part of LensKit.
# Copyright (C) 2018-2023 Boise State University
# Copyright (C) 2023-2024 Drexel University
# Licensed under the MIT license, see LICENSE.md for details.
# SPDX-License-Identifier: MIT

"Data check functions for LensKit."

# pyright: strict
from __future__ import annotations

from typing import Any, Literal, Protocol, TypeVar, overload

import numpy as np
from numpy.typing import NDArray


class HasShape(Protocol):
@property
def shape(self) -> tuple[int, ...]: ...


A = TypeVar("A", bound=HasShape)
NPT = TypeVar("NPT", bound=np.generic)


@overload
def check_1d(
arr: A,
size: int | None = None,
*,
label: str = "array",
error: Literal["raise"] = "raise",
) -> A: ...
@overload
def check_1d(
arr: HasShape,
size: int | None = None,
*,
error: Literal["return"],
) -> bool: ...
def check_1d(
arr: A,
size: int | None = None,
*,
label: str = "array",
error: Literal["raise", "return"] = "raise",
) -> bool | A:
"""
Check that an array is one-dimensional, optionally checking that it has the
expected length.

This check function has 2 modes:

* If ``error="raise"`` (the default), it will raise a :class:`TypeError`
if the array shape is incorrect, and return the array otherwise.
* If ``error="return"``, it will return ``True`` or ``False`` depending on
whether the size is correct.

Args:
arr:
The array to check.
size:
The expected size of the array. If unspecified, this function simply
checks that the array is 1-dimensional, but does not check the size
of that dimension.
label:
A label to use in the exception message.
error:
The behavior when an array fails the test.

Returns:
The array, if ``error="raise"`` and the array passes the check, or a
boolean indicating whether it passes the check.

Raises:
TypeError: if ``error="raise"`` and the array fails the check.
"""
if size is None and len(arr.shape) > 1:
if error == "raise":
raise TypeError(f"{label} must be 1D (has shape {arr.shape})")
else:
return False
elif size is not None and arr.shape != (size,):
if error == "raise":
raise TypeError(f"{label} has incorrect shape (found {arr.shape}, expected {size})")
else:
return False

if error == "raise":
return arr
else:
return True


@overload
def check_type(
arr: NDArray[Any],
*types: type[NPT],
label: str = "array",
error: Literal["raise"] = "raise",
) -> NDArray[NPT]: ...
@overload
def check_type(
arr: NDArray[Any],
*types: type[NPT],
error: Literal["return"],
) -> bool: ...
def check_type(
arr: NDArray[Any],
*types: type[NPT],
label: str = "array",
error: Literal["raise", "return"] = "raise",
) -> bool | NDArray[Any]:
"""
Check that an array array is of an acceptable type.

This check function has 2 modes:

* If ``error="raise"`` (the default), it will raise a :class:`TypeError`
if the array shape is incorrect, and return the array otherwise.
* If ``error="return"``, it will return ``True`` or ``False`` depending on
whether the size is correct.

Args:
arr:
The array to check.
types:
The acceptable types for the array.
label:
A label to use in the exception message.
error:
The behavior when an array fails the test.

Returns:
The array, if ``error="raise"`` and the array passes the check, or a
boolean indicating whether it passes the check.

Raises:
TypeError: if ``error="raise"`` and the array fails the check.
"""
if issubclass(arr.dtype.type, types):
if error == "raise":
return arr
else:
return True
elif error == "raise":
raise TypeError(f"{label} has incorrect type {arr.dtype} (allowed: {types})")
else:
return False
Loading
Loading