Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data set builders and the new entity/relationship model #610

Merged
merged 79 commits into from
Jan 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
597fc52
create builder and manifest files
mdekstrand Jan 19, 2025
8d4b63d
Write up the data model specification
mdekstrand Jan 19, 2025
2babcbe
revise model docs and document schema
mdekstrand Jan 19, 2025
16e8e84
document the data schemas
mdekstrand Jan 19, 2025
c2c618f
More thorough schema docs
mdekstrand Jan 19, 2025
e39371c
add scipy stubs
mdekstrand Jan 19, 2025
c17f30a
first draft of dataset builder API
mdekstrand Jan 19, 2025
f418d71
all attributes are optional
mdekstrand Jan 20, 2025
76d001d
simplify schema model
mdekstrand Jan 20, 2025
8f846d6
some more builder/schema API cleanup
mdekstrand Jan 20, 2025
c3d0bce
rough draft simpler builder API + tests
mdekstrand Jan 20, 2025
03af211
add a lot more failing tests
mdekstrand Jan 20, 2025
d54439e
initial pass at building entity ID lists
mdekstrand Jan 20, 2025
c40d61c
make builder basic entity ID tests pass
mdekstrand Jan 20, 2025
9d449bc
fix forbidden test name
mdekstrand Jan 20, 2025
941554c
test ID type upcasting
mdekstrand Jan 20, 2025
e992194
test more builder errors
mdekstrand Jan 20, 2025
4976a77
working non-repeated interactions
mdekstrand Jan 20, 2025
965302a
clean up ldf tests
mdekstrand Jan 20, 2025
8fb35ad
fix bad fixture name
mdekstrand Jan 20, 2025
7df46b1
check for repeated interactions
mdekstrand Jan 20, 2025
b232fef
attach columns to relationships
mdekstrand Jan 20, 2025
243997a
update from_interactions_df to use DatasetBuilder
mdekstrand Jan 20, 2025
abd5b67
add save/load tests
mdekstrand Jan 20, 2025
5b16792
move schema adaptation into data.adapt
mdekstrand Jan 20, 2025
0ef6193
more dataset updates + make format a keyword-only argument for intera…
mdekstrand Jan 20, 2025
0cbc127
rename interaction_log to interaction_table
mdekstrand Jan 20, 2025
2bd7d22
write a lot of code to implement dataset
mdekstrand Jan 21, 2025
854bebb
fix some bugs and missing features
mdekstrand Jan 21, 2025
46e2614
implement unified statistics
mdekstrand Jan 21, 2025
2f62be7
fix stats and many more tests
mdekstrand Jan 21, 2025
ebc5987
fix row pointers for matrix
mdekstrand Jan 21, 2025
258a9d9
fix empty ItemList construction from arrow
mdekstrand Jan 21, 2025
8c10375
hunt down and fix stray dataset errors
mdekstrand Jan 21, 2025
d507ffe
add missing count column
mdekstrand Jan 21, 2025
368023d
fix stats index
mdekstrand Jan 21, 2025
8464c87
fix lazy datasets
mdekstrand Jan 21, 2025
025da43
remove old docs
mdekstrand Jan 21, 2025
d1a6026
doctree updates
mdekstrand Jan 21, 2025
37c8718
implement dataset save/load
mdekstrand Jan 21, 2025
362c88d
fix lingering ItemList arrow conver bug
mdekstrand Jan 21, 2025
b03e704
add copy-builder capabilities
mdekstrand Jan 21, 2025
27ad7ff
update record-based splitting
mdekstrand Jan 21, 2025
f4ceb19
update stray count methods
mdekstrand Jan 21, 2025
e9583bf
fix remaining data splitters
mdekstrand Jan 21, 2025
431dc57
remove last uses of MatrixDataset
mdekstrand Jan 21, 2025
9842d54
remove MatrixDataset
mdekstrand Jan 21, 2025
f7822ba
remove LazyDataset
mdekstrand Jan 21, 2025
3f8ce7f
stray lazy usage
mdekstrand Jan 21, 2025
f73d811
mark repeated interactions as xfail
mdekstrand Jan 21, 2025
1bfaadb
implement entity set querying
mdekstrand Jan 21, 2025
ef36701
stray query test errr
mdekstrand Jan 21, 2025
5d9fec3
skip attribute tests
mdekstrand Jan 21, 2025
76687b7
remove bogus imports
mdekstrand Jan 21, 2025
81927f6
fix popularity
mdekstrand Jan 21, 2025
7eb39da
tweak item knn towards passing
mdekstrand Jan 21, 2025
354f8fa
simplify pandas invocation
mdekstrand Jan 21, 2025
157b69f
save sorted array to fix tests
mdekstrand Jan 21, 2025
9c49413
use concat_tables to remove unsupported cast
mdekstrand Jan 21, 2025
56504c6
fix time-bounded popularity
mdekstrand Jan 21, 2025
4a8d383
fix stray test failures
mdekstrand Jan 21, 2025
a7c4aab
fix HPF
mdekstrand Jan 21, 2025
53c63fd
export CSR structure
mdekstrand Jan 21, 2025
6602e46
update doctest
mdekstrand Jan 21, 2025
3d85b67
fix doctest
mdekstrand Jan 21, 2025
0661acd
improve dataset test coverage
mdekstrand Jan 21, 2025
321f6c1
test pandas & arrow entity queries
mdekstrand Jan 21, 2025
dae3572
add interaction filter tests
mdekstrand Jan 21, 2025
166f231
remove deprecated method call
mdekstrand Jan 21, 2025
1a00841
rerun getting-started notebook
mdekstrand Jan 21, 2025
f8e2eae
better DSB warning report
mdekstrand Jan 21, 2025
9d8c183
clean up warning in adapt
mdekstrand Jan 21, 2025
ff7aeb1
reduce read-only tensor warnings
mdekstrand Jan 21, 2025
f47d6a7
rerun getting started with fewer warnings
mdekstrand Jan 21, 2025
5ce1d1a
update component index
mdekstrand Jan 21, 2025
9c81dba
remove unused from_src_and_test method
mdekstrand Jan 21, 2025
f93d87b
support ID-based interaction filtering
mdekstrand Jan 21, 2025
72ad783
update docs
mdekstrand Jan 21, 2025
24046eb
use ID-based filtering to simplify list holdout
mdekstrand Jan 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,20 @@ Data Sets
:caption: Data Sets

~lenskit.data.Dataset
~lenskit.data.EntitySet
~lenskit.data.RelationshipSet
~lenskit.data.MatrixRelationshipSet
~lenskit.data.CSRStructure

Building Data Sets
------------------

.. autosummary::
:toctree: .
:nosignatures:
:caption: Data Build and Import

~lenskit.data.DatasetBuilder
~lenskit.data.from_interactions_df
~lenskit.data.load_movielens
~lenskit.data.load_movielens_df
Expand Down Expand Up @@ -49,14 +63,15 @@ Recommendation Queries
~lenskit.data.RecQuery
~lenskit.data.QueryInput

Terms and Identifiers
---------------------
Schemas and Identifiers
-----------------------

.. autosummary::
:toctree: .
:nosignatures:
:caption: Terms and Identifiers

lenskit.data.schema
~lenskit.data.Vocabulary

See also:
Expand Down
19 changes: 9 additions & 10 deletions docs/api/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ Pipeline Classes
:nosignatures:
:caption: Data Sets

~lenskit.pipeline.Pipeline
~lenskit.pipeline.PipelineBuilder
~lenskit.pipeline.PipelineState
~lenskit.pipeline.Node
~lenskit.pipeline.Lazy
Pipeline
PipelineBuilder
PipelineState
Node
Lazy

Component Interface
-------------------
Expand All @@ -30,8 +30,7 @@ LensKit components.
:toctree: .
:nosignatures:

~lenskit.pipeline.Component
~lenskit.pipeline.Trainable
Component

Standard Pipelines
------------------
Expand All @@ -40,8 +39,8 @@ Standard Pipelines
:toctree: .
:nosignatures:

~lenskit.pipeline.RecPipelineBuilder
~lenskit.pipeline.topn_pipeline
RecPipelineBuilder
topn_pipeline

Serialized Configurations
-------------------------
Expand All @@ -50,4 +49,4 @@ Serialized Configurations
:toctree: .
:nosignatures:

~lenskit.pipeline.PipelineConfig
PipelineConfig
22 changes: 11 additions & 11 deletions docs/guide/GettingStarted.ipynb

Large diffs are not rendered by default.

253 changes: 253 additions & 0 deletions docs/guide/data-model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
.. _data-model:

Data Model
==========

LensKit defines holistic data model for recommender training (and evaluation)
data. The model is graph-structured, but the interfaces and definitions center
tabular (data frame) views of that data for ease of training across a variety of
statistical modeling packages.

Apache Arrow is used as the common format for data, and data type definitions
are drawn from there. Data is transparently converted to NumPy arrays, Pandas
series or data frames, Torch tensors, etc. as requested.

Most code will either use one of the predefined dataset loading functions (such
as :func:`~lenskit.data.load_movielens`) or the
:class:`~lenskit.data.DatasetBuilder` to create data sets (see :ref:`data-api`).

.. note::

Working with the data directly as a heterogeneous graph for integration with
packages like PyTorch-Geometric is not difficult, and will be directly
supported in an upcoming backwards-compatible revision.

Core Concepts
~~~~~~~~~~~~~

The LensKit data model has several core concepts, derived from
entity-relationship database model:

.. glossary::

Entity
The items, users, sessions, etc. about which the data set records data.
In a graph view of the data, these are the nodes in the graph.

Entity Class
Each entity has a particular class, such as ``item`` or ``user``, based
on its role in the dataset. All data sets have at least the ``item``
entity class. Entities do not have subtypes in the raw data model; if
components want to conceptually treat entities as having subtypes, such
as different types of items, they can use attributes to distinguish the
different subtypes.

Entity Identifier
Each entity has a unique (within its type) *identifier*. Entity
identifiers can be either integers or strings.

Attribute
Entities can have one or more *attributes*. Attributes are consistent
within an entity type, and are nullable (any individual entity may be
missing a value for an attribute).

Relationship
A relationship connects two (or more) entities and may have additional
attributes attached to the relationship itself. Relationships may also
be repeated (more than one relationship record may exist for the same
combination of entities).

Relationship Class
Relationship classes are like entity classes, but describe the type of a
particular relationship. This allows for models or client code to query
for records of a particular relationship, such as “follows” or
“purchased”.

Interaction
An interaction is a specific type of relationship record that records an
interaction between two or more entities, such as a user rating a book,
or a user purchasing a product in a particular session. Interactions
usually, but not always, have timestamps.

.. _data-entities:

Entities
~~~~~~~~

*Entities* in the LensKit data model represent individual objects in the data,
such as users or items. An entity is defined by its class and identifier, and
nothing else is directly recorded about the entity itself — the interesting data
resides in its attributes and relationships.

Entity identifiers can be integers or strings.

Every data set has the entity class ``item`` for the items that may be
recommended. Most datasets also have the class ``user``. Session-aware
recommendation data sets may have an entity class ``session``.

When representing entities or entity data in tabular form, identifiers are
stored in a column named ``<class>_id`` (e.g. ``item_id``). Dataset functions
that map identifiers to 0-based contiguous array indexes will use the
``<class>_num`` for this index, referred to as the *entity number*.

.. _data-attributes:

Attributes
~~~~~~~~~~

Entities (and relationships) can have associated *attributes* providing data
about that entity, relationship, or interaction. This can be anything from a
timestamp to review text to complex item metadata. Attributes are associated
with entity or relationship *classes*, and have types that must be consistent
across the class (each entity or relationship class has a schema defining its
attributes and their types).

Attributes come in several forms (called a *layout*):

- **Scalar** attributes store a single value for each entity or relationship
instance. The value can be any type supported by NumPy or Apache Arrow.
Attribute values may be missing.

- **List** attributes store zero or more values for each entity or
relationship instance. List elements must have the same type.

- **Vector** attributes store a fixed-length vector of integer or
floating-point values for each entity or relationship instance. The vector
length is defined by the entity or relationship class, and must be the same
for all instances of that class for which the vector attribute is defined.
The vector dimensions may have associated labels or names, or they may just
be numbered (e.g., for representing embeddings from a language model).

- **Sparse** attributes are vector attributes that are stored in compressed
sparse format, with missing values understood to be 0.

Attribute Name Restrictions
---------------------------

Attribute names can be freely chosen, subject to a few lightweight restrictions:

- Within an entity or relationship class, names must be unique.
- For each entity class ``$FOO``, the names ``$FOO_id`` and ``$FOO_num`` are
reserved by LensKit and cannot be used by user-defined attributes (on any
entity or relationship). We recommend avoiding all attribute names of the
form ``$FOO_<ident>``.

Unsupported Features
--------------------

In the initial release of the new LensKit data model (in :ref:`2025.1`), not all
possible attribute and entity or relationship class combinations are supported.
In particular, relationships can only have scalar attributes. We intend to
relax this restriction in the future, with more time to determine an ergonomic
API for accessing such data. All attribute formats are supported for entities.

Repeated relationships are also not yet fully supported. Support is planned for
LensKit 2025.2.

.. _data-relationships:

Relationships
~~~~~~~~~~~~~

Relationships are links between two (or more) entities, optionally with
associated attributes. They are further divided into classes, with each class
defining its own set of relationship attributes.

Most relationships are between entities of different classes, in which case the
entity identifiers are stored in ``<class>_id`` (or ``<class>_num``) columns.
For self-relationships, however, this is not possible; such relationships must
define *aliases* for one or more of their appearances, and LensKit uses these
aliases to derive the appropriate column names. For example, a relationship
class that encodes citation relationships in a research paper recommender system
would be a self-relationship between items. It can alias ``item`` to ``citing``
and ``cited``, in which case the item identifiers are taken from ``citing_id``
and ``cited_id`` columns (or ``citing_num`` and ``cited_num``).

.. note::

Entity and relationship class names must be unique (you cannot use the same
name for an entity class and a relationship class).

.. _data-interactions:

Interactions
~~~~~~~~~~~~

An interaction is a relationship that indicates some kind of interaction between
entities for the purposes of learning and evaluating recommendations, such as
purchasing, shelving, clicking, or rating. There is no logical difference
between relationships and interactions; an interaction class is just a
relationship class that has been declared to represent interactions, so that
client and model code knows to treat it as interaction data. Most data sets
define a single interaction class, but can define more than one.

- Interactions should always involve the ``item`` entity class, without an
alias, preferably as the last entity class in the relationship definition.

- Interactions usually have timestamps (although this is not strictly
required). Timestamps can be either integers (treated as UNIX timestamps)
or Arrow timestamp types.

- The dataset can designate a *default interaction class* so that model code
can request the “interactions” without needing to know the different classes
involved. If no default class is specified, and more than one class is
defined, it is an error to request the interactions without specifying an
interaction class.

Certain attribute names, if defined, have particular meaning for interaction
records:

``timestamp``
The date and time of the interaction, as a UNIX or Arrow timestamp.

``rating``
A user-supplied rating for the user-item pair.

``count``
A count of the interactions between this pair. If client code requests an
matrix of interaction counts, and this attribute is defined, then its sum is
used as the total count of interactions between the entities. If no
``count`` attribute is defined, then a matrix of interaction counts is
computed by counting the interaction records.

.. todo::

Define what happens when ``count`` is NULL.

The order of entity classes in an interaction type is mildly meaningful: it is
convention for the last entity class to be the item, and for “interactor” (e.g.,
user or session) to be first.

.. _data-schema:

Schemas
~~~~~~~

A data *schema* (:class:`~lenskit.data.DataSchema`) defines the layout of the
tables, entity types, and relationship types. Client code will rarely need to
create or work with the schema directly; it is created and maintained by the
:class:`~lenskit.data.DatasetBuilder`.

.. _data-internal:

Internal Representation
~~~~~~~~~~~~~~~~~~~~~~~

Data should only be accessed through the :class:`~lenskit.data.Dataset` API, as
the internal storage is subject to change. Logically, each entity or
relationship type is represented as a table, consisting of:

- One or more entity identifier or number columns
- Zero or more attribute columns

Data may be internally broken into sub-tables for efficiency (e.g., for very
sparse attributes), but this is the logical view. Internally, relationships use
entity numbers instead of entity IDs to record the entities involved in a
relationship record.

As of LensKit 2025.1, the native format for storing a dataset on disk (used by
:meth:`~lenskit.data.Dataset.save` and :meth:`~lenskit.data.Dataset.load`) is a
directory with a ``schema.json`` file containing the serialized logical schema
and a Parquet file ``<class>.parquet`` for each entity or relationship class
containing the identifiers and attribute values. For entity classes,
``<class>.parquet`` contains both the entity IDS and entity numbers.
Loading
Loading