`sklearn-mrmr`: MRMR feature selection for `scikit-learn`

Release date: August 30, 2024, v.0.1

This repo provides a Python library that implements scikit-learn-compatible feature selection via Minimum Redunancy - Maximum Relevance. It aims to work seamlessly with scikit-learn's pipelines, hyperparameter optimization, and models. Both regression and classification tasks are supported. The number of features selected by MRMR is itself a hyperparameter, and can be tuned using scikit-learn's pipeline and grid search functionality.

Other repos on Github implement MRMR in Python, they often lack compatibility with scikit-learn, limiting their utility.

MRMR evaluates a feature's score based on its relevance to the target variable and its redundancy with other features. The goal is to select the features that have strong relationships with the target variable, and also minimally redundant.

Four variants of MRMR are implemented. The canonical variant uses mututal information (MI) to calculate redundancy and relevance. However, since MI can be an resource-heavy process, other formulations have been proposed. (Although this library uses scikit-learn's implementation of mututal information, which is quite optimized and offers parallel processing.) A second variant was developed, that uses the F-test to calculate relevance and Pearson correlation to calculate redundancy. This proves to be much faster, and without a clear loss in performance. Additionally, variants may use substraction or division.

Variants using subtraction:

MI: $$f^{canonical}(X_i) = MI(Y, X_i) - \frac{1}{S} \sum_{X_s \in S} MI(X_s, X_i)$$

F-test: $$f^{Ftest}(X_i) = F(Y, X_i) - \frac{1}{S} \sum_{X_s \in S} \rho(X_s, X_i)$$

Note that MRMR is not guaranteed to improve your model's performance. As with anything ML, its effectiveness depends on your data and modeling strategy. My (anecdotal) experience seems to suggest that MRMR is particularly beneficial in scenarios involving high model complexity / many correlated features. The benefit can come as either improved performance or decreased variance.

Installation

To install from this Github repo, clone this repo and install:

python setup.py install

Example

See demo.py for a example of how to use this library with scikit-learn's functionality.

Contact

Feel free to e-mail me any problems or thoughts: benhorvath@gmail.com

References

Original MRMR paper: "Minimum redundancy feature selection from microarray gene expression data"
Uber's more recent paper: "Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform"
Samuele Mazzanti's "MRMR Explained: Exactly How You Wished Someone Explained to You"

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
sklearn-mrmr		sklearn-mrmr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`sklearn-mrmr`: MRMR feature selection for `scikit-learn`

Installation

Example

Contact

References

About

Releases

Packages

Languages

License

benhorvath/sklearn-mrmr

Folders and files

Latest commit

History

Repository files navigation

sklearn-mrmr: MRMR feature selection for scikit-learn

Installation

Example

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`sklearn-mrmr`: MRMR feature selection for `scikit-learn`

Packages