release: v2.5.0

eonu · Dec 27, 2024 · a54dcdb · a54dcdb
2 parents b5a4b0f + cf52c29
commit a54dcdb
Show file tree

Hide file tree

Showing 108 changed files with 4,113 additions and 195 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -46,7 +46,7 @@ jobs:
       name: python/default
     steps:
       - coveralls/upload:
-          carryforward: 3.11, 3.12
+          carryforward: 3.11, 3.12, 3.13
           parallel_finished: true
 
 workflows:
@@ -56,7 +56,7 @@ workflows:
       - tests:
           matrix:
             parameters:
-              version: ["3.11", "3.12"]
+              version: ["3.11", "3.12", "3.13"]
       - coverage:
           requires:
             - tests
diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,2 @@
+[run]
+omit = "sequentia/model_selection/_validation.py"
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+*.ipynb linguist-documentation
diff --git a/.gitignore b/.gitignore
@@ -94,3 +94,6 @@ venv.bak/
 
 # Changelog entry
 ENTRY.md
+
+# Jupyter Notebook checkpoints
+*.ipynb_checkpoints/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -388,6 +388,21 @@ Nothing, initial release!
 
 </details>
 
+## [v2.5.0](https://github.com/eonu/sequentia/releases/tag/v2.5.0) - 2024-12-27
+
+### Documentation
+
+- update copyright notice ([#255](https://github.com/eonu/sequentia/issues/255))
+
+### Features
+
+- add `mise.toml` and support `numpy>=2` ([#254](https://github.com/eonu/sequentia/issues/254))
+- add python v3.13 support ([#253](https://github.com/eonu/sequentia/issues/253))
+- add library benchmarks ([#256](https://github.com/eonu/sequentia/issues/256))
+- add `model_selection` sub-package for hyper-parameters ([#257](https://github.com/eonu/sequentia/issues/257))
+- add model spec support to `HMMClassifier.__init__` ([#258](https://github.com/eonu/sequentia/issues/258))
+- add `HMMClassifier.fit` multiprocessing ([#259](https://github.com/eonu/sequentia/issues/259))
+
 ## [v2.0.2](https://github.com/eonu/sequentia/releases/tag/v2.0.2) - 2024-04-13
 
 ### Bug Fixes

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -50,6 +50,6 @@ We are thankful for their work and all the communities who have paved the way wi
 ---
 
 <p align="center">
-  <b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
+  <b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
   <em>Authored and maintained by Edwin Onuonga.</em>
 </p>
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -105,6 +105,6 @@ By contributing, you agree that your contributions will be licensed under the re
 ---
 
 <p align="center">
-  <b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
+  <b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
   <em>Authored and maintained by Edwin Onuonga.</em>
 </p>
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2019-2025 Edwin Onuonga (eonu) <ed@eonu.net>
+Copyright (c) 2019 Edwin Onuonga (eonu) <ed@eonu.net>
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -34,6 +34,7 @@
     <a href="#about">About</a> ·
     <a href="#build-status">Build Status</a> ·
     <a href="#features">Features</a> ·
+    <a href="#installation">Installation</a> ·
     <a href="#documentation">Documentation</a> ·
     <a href="#examples">Examples</a> ·
     <a href="#acknowledgments">Acknowledgments</a> ·
@@ -68,12 +69,15 @@ Some examples of how Sequentia can be used on sequence data include:
 
 ### Models
 
-The following models provided by Sequentia all support variable length sequences.
-
 #### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))
 
+Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length.
+When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet
+effective inference algorithm.
+
 - [x] Classification
 - [x] Regression
+- [x] Variable length sequences
 - [x] Multivariate real-valued observations
 - [x] Sakoe–Chiba band global warping constraint
 - [x] Dependent and independent feature warping (DTWD/DTWI)
@@ -82,19 +86,82 @@ The following models provided by Sequentia all support variable length sequences
 
 #### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))
 
-Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)
+A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as 
+a series of observations that are emitted from a collection of latent hidden states which form
+an underlying Markov chain. Each hidden state has an emission distribution that models its observations.
+
+Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [[1]](#references) is used to 
+derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters 
+based on the provided training sequence data.
 
 - [x] Classification
-- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
-- [x] Univariate categorical observations (discrete emissions)
+- [x] Variable length sequences
+- [x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
+- [x] Univariate categorical observations (modeled with discrete emissions)
 - [x] Linear, left-right and ergodic topologies
 - [x] Multi-processed predictions
 
 ### Scikit-Learn compatibility
 
-**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
+**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
+
+The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html), 
+which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide 
+sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
+
+### Similar libraries
+
+As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.
+
+||**`sequentia`**|[`aeon`](https://github.com/aeon-toolkit/aeon)|[`tslearn`](https://github.com/tslearn-team/tslearn)|[`sktime`](https://github.com/sktime/sktime)|[`pyts`](https://github.com/johannfaouzi/pyts)|
+|-|:-:|:-:|:-:|:-:|:-:|
+|Scikit-Learn compatible|✅|✅|✅|✅|✅|
+|Multivariate sequences|✅|✅|✅|✅|❌|
+|Variable length sequences|✅|✅|➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
+|No padding required|✅|❌|➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
+|Classification|✅|✅|✅|✅|✅|
+|Regression|✅|✅|✅|✅|❌|
+|Preprocessing|✅|✅|✅|✅|✅|
+|Multiprocessing|✅|✅|✅|✅|✅|
+|Custom weighting|✅|✅|✅|✅|✅|
+|Sakoe-Chiba band constraint|✅|✅|✅|✅|✅|
+|Itakura paralellogram constraint|❌|✅|✅|✅|✅|
+|Dependent DTW (DTWD)|✅|✅|✅|✅|❌|
+|Independent DTW (DTWI)|✅|❌|❌|❌|✅|
+|Custom DTW measures|❌<sup>4</sup>|✅|❌|✅|✅|
+
+- <sup>1</sup>`tslearn` supports variable length sequences with padding, but doesn't seem to mask the padding.
+- <sup>2</sup>`sktime` does not support variable length sequences, so they are padded (and padding is not masked).
+- <sup>3</sup>`pyts` does not support variable length sequences, so they are padded (and padding is not masked).
+- <sup>4</sup>`sequentia` only supports [`dtaidistance`](https://github.com/wannesm/dtaidistance), which is one of the fastest DTW libraries as it is written in C.
+
+### Benchmarks
+
+To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.
+
+The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/sections/datasets/digits.html) was used for benchmarking and consists of:
+
+- 3000 recordings of 10 spoken digits (0-9)
+  - 50 recordings of each digit for each of 6 speakers
+  - 1500 used for training, 1500 used for testing (split via label stratification)
+- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
+  - Only the first feature was used as not all of the above libraries support multivariate sequences
+- Sequence length statistics: (min 6, median 17, max 92)
+
+Each result measures the total time taken to complete training and prediction repeated 10 times.
+
+All of the above libraries support multiprocessing, and prediction was performed using 16 workers.
 
-In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
+<sup>*</sup>: `sktime`, `tslearn` and `pyts` seem to not mask padding, which may result in incorrect predictions.
+
+<img src="benchmarks/benchmark.svg" width="100%"/>
+
+> **Device information**:
+> - Product: ThinkPad T14s (Gen 6)
+> - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz)
+> - Memory: 64 GB LPDDR5X-7500MHz
+> - Solid State Drive: 1 TB SSD M.2 2280 PCIe Gen4 Performance TLC Opal 
+> - Operating system: Fedora Linux 41 (Workstation Edition)
 
 ## Installation
 
@@ -104,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
 pip install sequentia
 ```
 
-### C library compilation
+### C libraries
 
-For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
+For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.
 
 Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.
 
-You can use the following to check if the appropriate C libraries have been installed.
+You can use the following to check if the appropriate C libraries are available.
 
 ```python
 from dtaidistance import dtw
@@ -127,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.
 
 ## Examples
 
-Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
+Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.
 
-This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
+This example also shows a typical preprocessing workflow, as well as compatibility with 
+Scikit-Learn for pipelining and hyper-parameter optimization.
 
-```python
-import numpy as np
+---
 
-from sklearn.preprocessing import scale
-from sklearn.decomposition import PCA
-from sklearn.pipeline import Pipeline
+First, we create some sample multivariate input data consisting of three sequences with two features.
 
-from sequentia.models import KNNClassifier
-from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
+- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
+- Sequence lengths are provided separately and used to decode the sequences when needed.
 
-# Create input data
-# - Sequentia expects sequences to be concatenated into a single array
-# - Sequence lengths are provided separately and used to decode the sequences when needed
-# - This avoids the need for complex structures such as lists of arrays with different lengths
+This avoids the need for complex structures such as lists of nested arrays with different lengths, 
+or a 3D array with wasteful and annoying padding.
+
+```python
+import numpy as np
 
-# Sequences
+# Sequence data
 X = np.array([
     # Sequence 1 - Length 3
     [1.2 , 7.91],
@@ -168,27 +234,99 @@ lengths = np.array([3, 5, 2])
 
 # Sequence classes
 y = np.array([0, 1, 1])
+```
+
+With this data, we can train a `KNNClassifier` and use it for prediction and scoring.
+
+**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths 
+to be provided in addition to the sequence data `X` and labels `y`.
+
+```python
+from sequentia.models import KNNClassifier
+
+# Initialize and fit the classifier
+clf = KNNClassifier(k=1)
+clf.fit(X, y, lengths=lengths)
+
+# Make predictions based on the provided sequences
+y_pred = clf.predict(X, lengths=lengths)
+
+# Make predicitons based on the provided sequences and calculate accuracy
+acc = clf.score(X, y, lengths=lengths)
+```
+
+Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:
+
+1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
+2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
+3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
+4. Pass the resulting transformed data into a `KNNClassifier`.
+
+**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to 
+apply the specified transformation to each sequence in `X` individually, rather than using 
+[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
+array once, treating it as a single sequence.
 
-# Create a transformation pipeline that feeds into a KNNClassifier
-# 1. Individually denoise each sequence by applying a median filter for each feature
-# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
-# 3. Reduce the dimensionality of the data to a single feature by using PCA
-# 4. Pass the resulting transformed data into a KNNClassifier
+```python
+from sklearn.preprocessing import scale
+from sklearn.decomposition import PCA
+from sklearn.pipeline import Pipeline
+
+from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
+
+# Create a preprocessing pipeline that feeds into a KNNClassifier
 pipeline = Pipeline([
     ('denoise', IndependentFunctionTransformer(median_filter)),
     ('scale', IndependentFunctionTransformer(scale)),
     ('pca', PCA(n_components=1)),
     ('knn', KNNClassifier(k=1))
 ])
 
-# Fit the pipeline to the data - lengths must be provided
+# Fit the pipeline to the data
 pipeline.fit(X, y, lengths=lengths)
 
-# Predict classes for the sequences and calculate accuracy - lengths must be provided
+# Predict classes for the sequences and calculate accuracy
 y_pred = pipeline.predict(X, lengths=lengths)
+
+# Make predicitons based on the provided sequences and calculate accuracy
 acc = pipeline.score(X, y, lengths=lengths)
 ```
 
+For hyper-parameter optimization, Sequentia provides a `sequentia.model_selection` sub-package
+that includes most of the hyper-parameter search and cross-validation methods provided by 
+[`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html), 
+but adapted to work with sequences.
+
+For instance, we can perform a grid search with k-fold cross-validation stratifying over labels
+in order to find an optimal value for the number of neighbors in `KNNClassifier` for the 
+above pipeline.
+
+```python
+from sequentia.model_selection import StratifiedKFold, GridSearchCV
+
+# Define hyper-parameter search and specify cross-validation method
+search = GridSearchCV(
+    # Re-use the above pipeline
+    estimator=Pipeline([
+        ('denoise', IndependentFunctionTransformer(median_filter)),
+        ('scale', IndependentFunctionTransformer(scale)),
+        ('pca', PCA(n_components=1)),
+        ('knn', KNNClassifier(k=1))
+    ]),
+    # Try a range of values of k
+    param_grid={"knn__k": [1, 2, 3, 4, 5]},
+    # Specify k-fold cross-validation with label stratification using 4 splits
+    cv=StratifiedKFold(n_splits=4),
+)
+
+# Perform cross-validation over accuracy and retrieve the best model
+search.fit(X, y, lengths=lengths)
+clf = search.best_estimator_
+
+# Make predicitons using the best model and calculate accuracy
+acc = clf.score(X, y, lengths=lengths)
+```
+
 ## Acknowledgments
 
 In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
@@ -262,12 +400,12 @@ All contributions to this repository are greatly appreciated. Contribution guide
 
 Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.
 
-Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
+Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
 Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).
 
 ---
 
 <p align="center">
-  <b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
+  <b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
   <em>Authored and maintained by Edwin Onuonga.</em>
 </p>
diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py
@@ -0,0 +1,8 @@
+# Copyright (c) 2019 Sequentia Developers.
+# Distributed under the terms of the MIT License (see the LICENSE file).
+# SPDX-License-Identifier: MIT
+# This source code is part of the Sequentia project (https://github.com/eonu/sequentia).
+
+"""Collection of runtime benchmarks for Python packages
+providing dynamic time warping k-nearest neighbors algorithms.
+"""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[run]
		omit = "sequentia/model_selection/_validation.py"