Skip to content

Commit

Permalink
release: v2.5.0
Browse files Browse the repository at this point in the history
release: v2.5.0
  • Loading branch information
eonu authored Dec 27, 2024
2 parents b5a4b0f + cf52c29 commit a54dcdb
Show file tree
Hide file tree
Showing 108 changed files with 4,113 additions and 195 deletions.
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
name: python/default
steps:
- coveralls/upload:
carryforward: 3.11, 3.12
carryforward: 3.11, 3.12, 3.13
parallel_finished: true

workflows:
Expand All @@ -56,7 +56,7 @@ workflows:
- tests:
matrix:
parameters:
version: ["3.11", "3.12"]
version: ["3.11", "3.12", "3.13"]
- coverage:
requires:
- tests
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[run]
omit = "sequentia/model_selection/_validation.py"
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb linguist-documentation
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,6 @@ venv.bak/

# Changelog entry
ENTRY.md

# Jupyter Notebook checkpoints
*.ipynb_checkpoints/
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,21 @@ Nothing, initial release!

</details>

## [v2.5.0](https://github.com/eonu/sequentia/releases/tag/v2.5.0) - 2024-12-27

### Documentation

- update copyright notice ([#255](https://github.com/eonu/sequentia/issues/255))

### Features

- add `mise.toml` and support `numpy>=2` ([#254](https://github.com/eonu/sequentia/issues/254))
- add python v3.13 support ([#253](https://github.com/eonu/sequentia/issues/253))
- add library benchmarks ([#256](https://github.com/eonu/sequentia/issues/256))
- add `model_selection` sub-package for hyper-parameters ([#257](https://github.com/eonu/sequentia/issues/257))
- add model spec support to `HMMClassifier.__init__` ([#258](https://github.com/eonu/sequentia/issues/258))
- add `HMMClassifier.fit` multiprocessing ([#259](https://github.com/eonu/sequentia/issues/259))

## [v2.0.2](https://github.com/eonu/sequentia/releases/tag/v2.0.2) - 2024-04-13

### Bug Fixes
Expand Down
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,6 @@ We are thankful for their work and all the communities who have paved the way wi
---

<p align="center">
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<em>Authored and maintained by Edwin Onuonga.</em>
</p>
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,6 @@ By contributing, you agree that your contributions will be licensed under the re
---

<p align="center">
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<em>Authored and maintained by Edwin Onuonga.</em>
</p>
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2019-2025 Edwin Onuonga (eonu) <ed@eonu.net>
Copyright (c) 2019 Edwin Onuonga (eonu) <ed@eonu.net>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
204 changes: 171 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
<a href="#about">About</a> ·
<a href="#build-status">Build Status</a> ·
<a href="#features">Features</a> ·
<a href="#installation">Installation</a> ·
<a href="#documentation">Documentation</a> ·
<a href="#examples">Examples</a> ·
<a href="#acknowledgments">Acknowledgments</a> ·
Expand Down Expand Up @@ -68,12 +69,15 @@ Some examples of how Sequentia can be used on sequence data include:

### Models

The following models provided by Sequentia all support variable length sequences.

#### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))

Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length.
When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet
effective inference algorithm.

- [x] Classification
- [x] Regression
- [x] Variable length sequences
- [x] Multivariate real-valued observations
- [x] Sakoe–Chiba band global warping constraint
- [x] Dependent and independent feature warping (DTWD/DTWI)
Expand All @@ -82,19 +86,82 @@ The following models provided by Sequentia all support variable length sequences

#### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))

Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)
A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as
a series of observations that are emitted from a collection of latent hidden states which form
an underlying Markov chain. Each hidden state has an emission distribution that models its observations.

Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [[1]](#references) is used to
derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters
based on the provided training sequence data.

- [x] Classification
- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
- [x] Univariate categorical observations (discrete emissions)
- [x] Variable length sequences
- [x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
- [x] Univariate categorical observations (modeled with discrete emissions)
- [x] Linear, left-right and ergodic topologies
- [x] Multi-processed predictions

### Scikit-Learn compatibility

**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**

The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html),
which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide
sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.

### Similar libraries

As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.

||**`sequentia`**|[`aeon`](https://github.com/aeon-toolkit/aeon)|[`tslearn`](https://github.com/tslearn-team/tslearn)|[`sktime`](https://github.com/sktime/sktime)|[`pyts`](https://github.com/johannfaouzi/pyts)|
|-|:-:|:-:|:-:|:-:|:-:|
|Scikit-Learn compatible||||||
|Multivariate sequences||||||
|Variable length sequences|||➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
|No padding required|||➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
|Classification||||||
|Regression||||||
|Preprocessing||||||
|Multiprocessing||||||
|Custom weighting||||||
|Sakoe-Chiba band constraint||||||
|Itakura paralellogram constraint||||||
|Dependent DTW (DTWD)||||||
|Independent DTW (DTWI)||||||
|Custom DTW measures|❌<sup>4</sup>|||||

- <sup>1</sup>`tslearn` supports variable length sequences with padding, but doesn't seem to mask the padding.
- <sup>2</sup>`sktime` does not support variable length sequences, so they are padded (and padding is not masked).
- <sup>3</sup>`pyts` does not support variable length sequences, so they are padded (and padding is not masked).
- <sup>4</sup>`sequentia` only supports [`dtaidistance`](https://github.com/wannesm/dtaidistance), which is one of the fastest DTW libraries as it is written in C.

### Benchmarks

To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.

The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/sections/datasets/digits.html) was used for benchmarking and consists of:

- 3000 recordings of 10 spoken digits (0-9)
- 50 recordings of each digit for each of 6 speakers
- 1500 used for training, 1500 used for testing (split via label stratification)
- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
- Only the first feature was used as not all of the above libraries support multivariate sequences
- Sequence length statistics: (min 6, median 17, max 92)

Each result measures the total time taken to complete training and prediction repeated 10 times.

All of the above libraries support multiprocessing, and prediction was performed using 16 workers.

In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
<sup>*</sup>: `sktime`, `tslearn` and `pyts` seem to not mask padding, which may result in incorrect predictions.

<img src="benchmarks/benchmark.svg" width="100%"/>

> **Device information**:
> - Product: ThinkPad T14s (Gen 6)
> - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz)
> - Memory: 64 GB LPDDR5X-7500MHz
> - Solid State Drive: 1 TB SSD M.2 2280 PCIe Gen4 Performance TLC Opal
> - Operating system: Fedora Linux 41 (Workstation Edition)
## Installation

Expand All @@ -104,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
pip install sequentia
```

### C library compilation
### C libraries

For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.

Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.

You can use the following to check if the appropriate C libraries have been installed.
You can use the following to check if the appropriate C libraries are available.

```python
from dtaidistance import dtw
Expand All @@ -127,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.

## Examples

Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.

This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
This example also shows a typical preprocessing workflow, as well as compatibility with
Scikit-Learn for pipelining and hyper-parameter optimization.

```python
import numpy as np
---

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
First, we create some sample multivariate input data consisting of three sequences with two features.

from sequentia.models import KNNClassifier
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
- Sequence lengths are provided separately and used to decode the sequences when needed.

# Create input data
# - Sequentia expects sequences to be concatenated into a single array
# - Sequence lengths are provided separately and used to decode the sequences when needed
# - This avoids the need for complex structures such as lists of arrays with different lengths
This avoids the need for complex structures such as lists of nested arrays with different lengths,
or a 3D array with wasteful and annoying padding.

```python
import numpy as np

# Sequences
# Sequence data
X = np.array([
# Sequence 1 - Length 3
[1.2 , 7.91],
Expand All @@ -168,27 +234,99 @@ lengths = np.array([3, 5, 2])

# Sequence classes
y = np.array([0, 1, 1])
```

With this data, we can train a `KNNClassifier` and use it for prediction and scoring.

**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths
to be provided in addition to the sequence data `X` and labels `y`.

```python
from sequentia.models import KNNClassifier

# Initialize and fit the classifier
clf = KNNClassifier(k=1)
clf.fit(X, y, lengths=lengths)

# Make predictions based on the provided sequences
y_pred = clf.predict(X, lengths=lengths)

# Make predicitons based on the provided sequences and calculate accuracy
acc = clf.score(X, y, lengths=lengths)
```

Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:

1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
4. Pass the resulting transformed data into a `KNNClassifier`.

**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to
apply the specified transformation to each sequence in `X` individually, rather than using
[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
array once, treating it as a single sequence.

# Create a transformation pipeline that feeds into a KNNClassifier
# 1. Individually denoise each sequence by applying a median filter for each feature
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
# 3. Reduce the dimensionality of the data to a single feature by using PCA
# 4. Pass the resulting transformed data into a KNNClassifier
```python
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sequentia.preprocessing import IndependentFunctionTransformer, median_filter

# Create a preprocessing pipeline that feeds into a KNNClassifier
pipeline = Pipeline([
('denoise', IndependentFunctionTransformer(median_filter)),
('scale', IndependentFunctionTransformer(scale)),
('pca', PCA(n_components=1)),
('knn', KNNClassifier(k=1))
])

# Fit the pipeline to the data - lengths must be provided
# Fit the pipeline to the data
pipeline.fit(X, y, lengths=lengths)

# Predict classes for the sequences and calculate accuracy - lengths must be provided
# Predict classes for the sequences and calculate accuracy
y_pred = pipeline.predict(X, lengths=lengths)

# Make predicitons based on the provided sequences and calculate accuracy
acc = pipeline.score(X, y, lengths=lengths)
```

For hyper-parameter optimization, Sequentia provides a `sequentia.model_selection` sub-package
that includes most of the hyper-parameter search and cross-validation methods provided by
[`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html),
but adapted to work with sequences.

For instance, we can perform a grid search with k-fold cross-validation stratifying over labels
in order to find an optimal value for the number of neighbors in `KNNClassifier` for the
above pipeline.

```python
from sequentia.model_selection import StratifiedKFold, GridSearchCV

# Define hyper-parameter search and specify cross-validation method
search = GridSearchCV(
# Re-use the above pipeline
estimator=Pipeline([
('denoise', IndependentFunctionTransformer(median_filter)),
('scale', IndependentFunctionTransformer(scale)),
('pca', PCA(n_components=1)),
('knn', KNNClassifier(k=1))
]),
# Try a range of values of k
param_grid={"knn__k": [1, 2, 3, 4, 5]},
# Specify k-fold cross-validation with label stratification using 4 splits
cv=StratifiedKFold(n_splits=4),
)

# Perform cross-validation over accuracy and retrieve the best model
search.fit(X, y, lengths=lengths)
clf = search.best_estimator_

# Make predicitons using the best model and calculate accuracy
acc = clf.score(X, y, lengths=lengths)
```

## Acknowledgments

In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
Expand Down Expand Up @@ -262,12 +400,12 @@ All contributions to this repository are greatly appreciated. Contribution guide

Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.

Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).

---

<p align="center">
<b>Sequentia</b> &copy; 2019-2025, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<b>Sequentia</b> &copy; 2019, Edwin Onuonga - Released under the <a href="https://opensource.org/licenses/MIT">MIT</a> license.<br/>
<em>Authored and maintained by Edwin Onuonga.</em>
</p>
8 changes: 8 additions & 0 deletions benchmarks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Copyright (c) 2019 Sequentia Developers.
# Distributed under the terms of the MIT License (see the LICENSE file).
# SPDX-License-Identifier: MIT
# This source code is part of the Sequentia project (https://github.com/eonu/sequentia).

"""Collection of runtime benchmarks for Python packages
providing dynamic time warping k-nearest neighbors algorithms.
"""
Loading

0 comments on commit a54dcdb

Please sign in to comment.