Merge pull request #644 from mdekstrand/tweak/convention-docs

Improve pipeline modification and documentation
lenskit · Feb 22, 2025 · 7cb56f1 · 7cb56f1
2 parents 6294f48 + edec6b7
commit 7cb56f1
Show file tree

Hide file tree

Showing 9 changed files with 318 additions and 110 deletions.
diff --git a/docs/guide/conventions.rst b/docs/guide/conventions.rst
@@ -7,6 +7,8 @@ The components shipped with LensKit follow certain conventions to make their
 configuration and operation consistent and predictable. We encourage you to
 follow these conventions in your own code as well.
 
+.. _list-length:
+
 List Length
 ~~~~~~~~~~~
 
@@ -17,6 +19,32 @@ allows list length to be baked into a pipeline configuration, and also allows
 that length to be specified or overridden at runtime.  If both lengths are
 specified, the runtime length takes precedence.
 
+See :class:`lenskit.basic.TopNRanker` or :class:`lenskit.basic.SoftmaxRanker`
+for examples.
+
+
+.. _config-conventions:
+
+Configuration Conventions
+-------------------------
+
+We strive for consistency in configuration across LensKit components.  To that end,
+there are a few common configuration option or hyperparameter names we use, and
+encourage you to use these in your own components unless you have a compelling reason
+not to.
+
+``embedding_size``
+    The dimensionality of embeddings or a latent feature space (e.g., the dimension
+    in matrix factorization or dimensionality reduction).
+``epochs``
+    The number of training epochs for an iterative method (this option name is
+    required by :ref:`iterative-training`).
+``learning_rate``
+    The learning rate for an iterative method.
+``reg``
+    The regularization weight for regularized models.
+
+
 .. _rng:
 
 Random Seeds

diff --git a/docs/guide/implementing.rst b/docs/guide/implementing.rst
@@ -0,0 +1,160 @@
+.. _component-impl:
+
+Implementing Components
+=======================
+
+LensKit is particularly designed to excel in research and educational
+applications, and for that you will often need to write your own components
+implementing new scoring models, rankers, or other components. The
+:ref:`pipeline design <pipeline>` and :ref:`standard pipelines
+<standard-pipelines>` are intended to make this as easy as possible and allow
+you to focus just on your logic without needing to implement a lot of
+boilerplate like looking up user histories or ranking by score: you can
+implement your training and scoring logic, and let LensKit do the rest.
+
+Basics
+~~~~~~
+
+Implementing a component therefore consists of a few steps:
+
+1.  Defining the configuration class.
+2.  Defining the component class, with its ``config`` attribute declaration.
+3.  Defining a ``__call__`` method for the component class that performs the
+    component's actual computation.
+4.  If the component supports training, implementing the
+    :class:`~lenskit.training.Trainable` protocol by defining a
+    :meth:`~lenskit.training.Trainable.train` method, or implement
+    :ref:`iterative-training`.
+
+A simple example component that computes a linear weighted blend of the scores
+from two other components could look like this:
+
+.. literalinclude:: examples/blendcomp.py
+
+This component can be instantiated with its defaults:
+
+.. testsetup::
+
+    from blendcomp import LinearBlendScorer, LinearBlendConfig
+
+
+.. doctest::
+
+    >>> LinearBlendScorer()
+    <LinearBlendScorer {
+        "mix_weight": 0.5
+    }>
+
+You an instantiate it with its configuration class:
+
+.. doctest::
+
+    >>> LinearBlendScorer(LinearBlendConfig(mix_weight=0.2))
+    <LinearBlendScorer {
+        "mix_weight": 0.2
+    }>
+
+Finally, you can directly pass configuration parameters to the component constructor:
+
+.. doctest::
+
+    >>> LinearBlendScorer(mix_weight=0.7)
+    <LinearBlendScorer {
+        "mix_weight": 0.7
+    }>
+
+
+Component Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+As noted in the :ref:`pipeline documentation <component-config>`, components are
+configured with *configuration objects*.  These are JSON-serializable objects
+defined as Python dataclasses or Pydantic models, and define the different
+settings or hyperparameters that control the model's behavior.
+
+The choice of parameters are up to the component author, and each component will
+have different configuration needs.  Some needs are common across many
+components, though; see :ref:`config-conventions` for common LensKit
+configuration conventions.
+
+Component Operation
+~~~~~~~~~~~~~~~~~~~
+
+The heart of the component interface is the ``__call__`` method (components are
+just callable objects).  This method takes the component inputs as parameters,
+and returns the component's result.
+
+Most components return an :class:`~lenskit.data.ItemList`.  Scoring components usually
+have the following signature:
+
+.. code:: python
+
+    def __call__(self, query: QueryInput, items: ItemList) -> ItemList:
+        ...
+
+The ``query`` input receives the user ID, history, context, or other query
+input; the ``items`` input receives the list of items to be scored (e.g., the
+candidate items for recommendation).  The scorer then returns a list of scored
+items.
+
+Most component begin by converting the query to a
+:class:`~lenskit.data.RecQuery`::
+
+    def __call__(self, query: QueryInput, items: ItemList) -> ItemList:
+        query = RecQuery.create(query)
+        ...
+
+It is conventional for scorers to return a copy of the input item list with the scores
+attached, filling in ``NaN`` for items that cannot be scored.  After assembling a NumPy
+array of scores, you can do this with::
+
+    return ItemList(items, scores=scores)
+
+Scalars can also be supplied, so if the scorer cannot score any of the items, it
+can simply return a list with no scores::
+
+    return ItemList(items, scores=np.nan)
+
+Components do need to be able to handle items in ``items`` that were not seen
+at training time.  If the component has saved the training item vocabulary, the
+easiest way to do this is to use :meth:`~lenskit.data.ItemList.numbers`: with
+``missing="negative"``::
+
+    i_nums = items.numbers(vocabulary=self.items, missing="negative")
+    scorable_mask = i_nums >= 0
+
+Component Training
+~~~~~~~~~~~~~~~~~~
+
+Components that need to train models on training data must implement the
+:class:`~lenskit.training.Trainable` protocol, either directly or through a
+helper implementation like :class:`~lenskit.training.IterativeTraining`.  The
+core of the ``Trainable`` protocol is the
+:meth:`~lenskit.training.Trainable.train` method, which takes a
+:class:`~lenskit.data.Dataset` and :class:`~lenskit.training.TrainingOptions`
+and trains the model.
+
+The details of training will vary significantly from model to model.  Typically,
+though, it follows the following steps:
+
+1.  Extract, prepare, and preprocess training data as needed for the model.
+2.  Compute the model's parameters, either directly (i.e.
+    :class:`~lenskit.basic.BiasScorer`) or through an optimization method (i.e.
+    :class:`~lenskit.basic.ImplicitMFScorer`).
+3.  Finalize the model parameters and clean up any temporary data.
+
+Learned model parameters are then stored as attributes on the component class,
+either directly or in a container object (such as a PyTorch
+:class:`~torch.nn.Module`).
+
+.. note::
+
+    If the model is already trained and the
+    :attr:`~lenskit.training.TrainingOptions.retrain` is ``False``, then the
+    ``train`` method should return without any training.
+    :class:`~lenskit.training.IterativeTraining` handles this automatically.
+
+Further Reading
+~~~~~~~~~~~~~~~
+
+See :ref:`conventions` for more conventions for component design and configuration.
diff --git a/docs/guide/index.rst b/docs/guide/index.rst
@@ -37,6 +37,7 @@ guide to how to use LensKit for research, education, and other purposes.
     scorers
     rankers
     other-components
+    implementing
 
 .. toctree::
     :caption: Experiments

diff --git a/docs/guide/pipeline.rst b/docs/guide/pipeline.rst
@@ -37,7 +37,6 @@ like user history and candidate set lookup.
     as well as by Haystack_.
 
 .. _Haystack: https://docs.haystack.deepset.ai/docs/pipelines
-.. _POPROX: https://ccri-poprox.github.io/poprox-researcher-manual/reference/recommender/poprox_recommender.pipeline.html
 
 .. _pipeline-construct:
 
@@ -138,7 +137,7 @@ These input connections are specified via keyword arguments to the
 should be wired.
 
 
-You can also use :meth:`PipelineBuilder.default_conection` to specify default
+You can also use :meth:`PipelineBuilder.default_connection` to specify default
 connections. For example, you can specify a default for inputs named ``user``::
 
     pipe.default_connection('user', user_history)
@@ -192,7 +191,7 @@ The :meth:`~Pipeline.run` method takes two types of inputs:
     altered scores).
 
 *   Keyword arguments specifying the values for the pipeline's inputs, as defined by
-    calls to :meth:`Pipeline.create_input`.
+    calls to :meth:`PipelineBuilder.create_input`.
 
 Pipeline execution logically proceeds in the following steps:
 
@@ -222,7 +221,7 @@ itself, e.g.:
 * ``item-embedder``
 
 Component nodes can also have *aliases*, allowing them to be accessed by more
-than one name. Use :meth:`Pipeline.alias` to define these aliases.
+than one name. Use :meth:`PipelineBuilder.alias` to define these aliases.
 
 Various LensKit facilities recognize several standard component names used by
 the standard pipeline builders, and we recommend you use them in your own
@@ -255,7 +254,7 @@ Pipelines are defined by the following:
 * The components and inputs (nodes)
 * The component input connections (edges)
 * The component configurations (see :class:`Component`)
-* The components' learned parameters (see :class:`Trainable`)
+* The components' learned parameters (see :class:`~lenskit.training.Trainable`)
 
 LensKit supports serializing both pipeline descriptions (components,
 connections, and configurations) and pipeline parameters.  There are
@@ -265,10 +264,10 @@ two ways to save a pipeline or part thereof:
     pipeline; it has the usual downsides of pickling (arbitrary code execution,
     etc.). LensKit uses pickling to share pipelines with worker processes for
     parallel batch operations.
-2.  Save the pipeline configuration with :meth:`Pipeline.get_config`.  This saves
-    the components, their configurations, and their connections, but **not** any
-    learned parameter data.  A new pipeline can be constructed from such a
-    configuration can be reloaded with :meth:`Pipeline.from_config`.
+2.  Save the pipeline configuration (:attr:`Pipeline.config`, using :func:`~pydantic.BaseModel.model_dump_json`).  This saves the components,
+    their configurations, and their connections, but **not** any learned
+    parameter data.  A new pipeline can be constructed from such a configuration
+    can be reloaded with :meth:`Pipeline.from_config`.
 
 ..
     3.  Save the pipeline parameters with :meth:`Pipeline.save_params`.  This saves
@@ -307,8 +306,8 @@ two ways to save a pipeline or part thereof:
 
 .. _standard-pipelines:
 
-Standard Layouts
-~~~~~~~~~~~~~~~~
+Standard Pipelines
+~~~~~~~~~~~~~~~~~~
 
 The standard recommendation pipeline, produced by either of the approaches
 described above in :ref:`pipeline-construct`, looks like this:
@@ -370,6 +369,9 @@ to be trained.
 Components also must be pickleable, as LensKit uses pickling for shared memory
 parallelism in its batch-inference code.
 
+See :ref:`component-impl` for more information on implementing your own
+components.
+
 .. _component-config:
 
 Configuring Components
@@ -389,6 +391,8 @@ configuration object if one is provided, or instantiating the configuration
 class with defaults or from keyword arguments.  In most cases, you don't need
 to define a constructor for a component.
 
+See :ref:`config-conventions` for standard configuration option names.
+
 .. admonition:: Motivation
     :class: note
 
@@ -411,59 +415,6 @@ to define a constructor for a component.
     -   The base class can provide well-defined and complete string
         representations for free to all component implementations.
 
-.. _component-impl:
-
-Implementing Components
------------------------
-
-Implementing a component therefore consists of a few steps:
-
-1.  Defining the configuration class.
-2.  Defining the component class, with its `config` attribute declaration.
-3.  Defining a `__call__` method for the component class that performs the
-    component's actual computation.
-4.  If the component supports training, implementing the :class:`Trainable`
-    protocol by defining a :meth:`Trainable.train` method.
-
-A simple example component that computes a linear weighted blend of the scores
-from two other components could look like this:
-
-.. literalinclude:: examples/blendcomp.py
-
-This component can be instantiated with its defaults:
-
-.. testsetup::
-
-    from blendcomp import LinearBlendScorer, LinearBlendConfig
-
-
-.. doctest::
-
-    >>> LinearBlendScorer()
-    <LinearBlendScorer {
-        "mix_weight": 0.5
-    }>
-
-You an instantiate it with its configuration class:
-
-.. doctest::
-
-    >>> LinearBlendScorer(LinearBlendConfig(mix_weight=0.2))
-    <LinearBlendScorer {
-        "mix_weight": 0.2
-    }>
-
-Finally, you can directly pass configuration parameters to the component constructor:
-
-.. doctest::
-
-    >>> LinearBlendScorer(mix_weight=0.7)
-    <LinearBlendScorer {
-        "mix_weight": 0.7
-    }>
-
-See :ref:`conventions` for more conventions for component design.
-
 Adding Components to the Pipeline
 ---------------------------------
 
@@ -490,6 +441,20 @@ You can add components to the pipeline in two ways:
 When you use the second approach, :meth:`PipelineBuilder.build` instantiates the
 component from the provided configuration.
 
+Modifying Pipelines
+~~~~~~~~~~~~~~~~~~~
+
+Pipelines, once constructed, are immutable (and modifying the pipeline, its
+configuration, or its internal data structures is undefined behavior).  However,
+you can create a new pipeline from an existing one with added or changed
+components.  To do this:
+
+1.  Create a builder from the pipeline with :meth:`Pipeline.modify`, which
+    returns a :class:`PipelineBuilder`.
+2.  Add new components, or replace existing ones with
+    :meth:`PipelineBuilder.replace_component`.
+3.  Build the modified pipeline with :meth:`PipelineBuilder.build`.
+
 POPROX and Other Integrators
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~