Feature Transform

Build Scikit ColumnTransformers by specifying configs.

See also TorchArc to build PyTorch models by specifying architectures.

Installation

pip install feature_transform

Usage

specify column transformers in a YAML spec file, e.g. at spec_filepath = "./example/spec/basic.yaml"
import feature_transform as ft.
1. (optional) if you have custom sklearn estimator/preprocessor, e.g. Dummy, register it with ft.register_class(Dummy)
build with: col_tfm = ft.build(spec_filepath)

The returned object is a sklearn ColumnTransformer ready for normal use.

See more examples below, then see how it works at the end.

Example: build ColumnTransformer from spec file

from pathlib import Path

import joblib
import yaml
from sklearn import datasets

import feature_transform as ft

filepath = Path(".") / "feature_transform" / "example" / "spec" / "basic.yaml"

# The following are equivalent:

# 1. build from YAML spec file
col_tfm = ft.build(filepath)

# 2. build from dictionary
with filepath.open("r") as f:
    spec_dict = yaml.safe_load(f)
col_tfm = ft.build(spec_dict)

# 3. use the underlying Pydantic validator to build the col_tfm
spec = ft.Spec(**spec_dict)
col_tfm = spec.build()

Next, load demo data for examples below:

# ================================================
# Load demo data

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)

x_df.columns
# Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
#        'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
#        'proanthocyanins', 'color_intensity', 'hue',
#        'od280/od315_of_diluted_wines', 'proline'],
#       dtype='object')

Example: basic

Spec file: feature_transform/example/spec/basic.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]

col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")

# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_xs = loaded_col_tfm.transform(x_df)

ColumnTransformer col_tfm:

Example: basic with pandas/polars dataframe

Spec file: feature_transform/example/spec/basic.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]

col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")
# to use with dataframe, set output to "pandas" or "polars"
col_tfm.set_output(transform="pandas")

feat_x_df = col_tfm.fit_transform(x_df)
feat_x_df
# 	standardscaler__alcohol	standardscaler__total_phenols	robustscaler__ash
# 0	1.518613	0.808997	0.201439
# 1	0.246290	0.568648	-0.633094
# ...

feat_x_df.describe()
# 	standardscaler__alcohol	standardscaler__total_phenols	robustscaler__ash
# count	1.780000e+02	178.000000	178.000000
# mean	-8.382808e-16	0.000000	0.018754
# std	1.002821e+00	1.002821	0.789479
# ...

# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")

# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_x_df = loaded_col_tfm.transform(x_df)

ColumnTransformer col_tfm:

Example: specify name; use int columns

Spec file: feature_transform/example/spec/name-intcol.yaml

transformers:
  - name: std
    transformer:
      preprocessing.StandardScaler:
    columns: [0, 5]
  - name: robust
    transformer:
      preprocessing.RobustScaler:
    columns: [2]

col_tfm = ft.build(ft.SPEC_DIR / "name-intcol.yaml")

feat_xs = col_tfm.fit_transform(x_df)
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:

Example: pipeline

Spec file: feature_transform/example/spec/pipeline.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      Pipeline:
        - impute.SimpleImputer:
            strategy: constant
        - preprocessing.RobustScaler:
    columns: [ash]

col_tfm = ft.build(ft.SPEC_DIR / "pipeline.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:

Example: ColumnTransformer settings

Spec file: feature_transform/example/spec/settings.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]
# use all processors
n_jobs: -1
# for more kwargs see https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

col_tfm = ft.build(ft.SPEC_DIR / "settings.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:

Example: full X, y feature transform with save/load

Spec file (x): feature_transform/example/spec/wine/x.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols, flavanoids, nonflavanoid_phenols, od280/od315_of_diluted_wines]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash, alcalinity_of_ash, proanthocyanins, hue]
  - transformer:
      preprocessing.PowerTransformer:
    columns: [malic_acid, magnesium, color_intensity, proline]
n_jobs: -1

Spec file (y): feature_transform/example/spec/wine/y.yaml

transformers:
  - transformer:
      preprocessing.OneHotEncoder:
        sparse_output: False
    columns: [target]

import joblib
from sklearn import datasets

import feature_transform as ft

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)
y_df = y_sr.to_frame()  # ColumnTransformer takes only dataframe/matrix as input

x_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "x.yaml")
y_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "y.yaml")

# fit-transform
feat_xs = x_col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  1.03481896, ...,  1.69074868,
#          0.45145022,  1.06254129],
#        ...,

feat_ys = y_col_tfm.fit_transform(y_df)
feat_ys
# array([[1., 0., 0.],
#        ...,

# save for later use
joblib.dump(x_col_tfm, "x_col_tfm.joblib")
joblib.dump(y_col_tfm, "y_col_tfm.joblib")


# ... later, e.g. during batch inference
loaded_x_col_tfm = joblib.load("x_col_tfm.joblib")
feat_xs = loaded_x_col_tfm.transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  1.03481896, ...,  1.69074868,
#          0.45145022,  1.06254129],
#        ...,

ColumnTransformer x_col_tfm:

ColumnTransformer y_col_tfm:

Example: use helper to suggest spec

Most of the time, data preprocessing steps can be determined with rules-of-thumb; ft.suggest does exactly that (see feature_transform/helper.py for details). This produces spec_dict that can be used directly with ft.build or for further editing.

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)

# suggest spec_dict - use directly or save to yaml for further editing
spec_dict = ft.suggest(x_df)
col_tfm = ft.build(spec_dict)

# fit-transform
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 0.8973384 ,  0.20143885, -0.90697674, ...,  0.80804954,
#         -0.43546273,  1.69074868],
#         ...,

ColumnTransformer col_tfm:

Example: more

See more examples:

demo notebook from above feature_transform/example/notebook/demo.py
spec files feature_transform/example/spec/
unit tests test/validator/test_spec.py

How does it work

Feature Transform simply builds sklearn ColumnTransformer and its estimators/pipelines with 1-1 mapping from a spec file:

Spec is defined via Pydantic feature_transform/validator/. This defines:
- spec: the Estimator, Pipeline, ColumnTransformer
If spec specifies:
1. transformers=list[(name, transformer, columns)], then use ColumnTransformer
2. transformers=list[(transformer, columns)], then use make_column_transformer with auto-generated names

See more in the pydantic spec definition:

feature_transform/validator/spec.py: the spec used by feature_transform

Guiding principles

The design of Feature Transform is guided as follows:

simple: the module spec is straightforward:
1. it is simply sklearn class name with kwargs.
2. it supports official sklearn estimators, Pipeline, and custom-defined modules registered via ft.register_class
expressive: it can be used to build both simple and advanced ColumnTransformer easily
portable: it returns ColumnTransformer that can be used anywhere; it is not a framework.
parametrizable: data-based feature transformation unlocks fast experimentation, e.g. by building logic for hyperparameter / data feature search

Development

Setup

Install uv for dependency management if you haven't already. Then run:

# setup virtualenv
uv sync

Unit Tests

uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
feature_transform		feature_transform
images		images
test		test
.gitignore		.gitignore
.releaserc.json		.releaserc.json
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Transform

Installation

Usage

Example: build ColumnTransformer from spec file

Example: basic

Example: basic with pandas/polars dataframe

Example: specify name; use int columns

Example: pipeline

Example: ColumnTransformer settings

Example: full X, y feature transform with save/load

Example: use helper to suggest spec

Example: more

How does it work

Guiding principles

Development

Setup

Unit Tests

About

Releases 8

Packages

Languages

License

kengz/feature_transform

Folders and files

Latest commit

History

Repository files navigation

Feature Transform

Installation

Usage

Example: build ColumnTransformer from spec file

Example: basic

Example: basic with pandas/polars dataframe

Example: specify name; use int columns

Example: pipeline

Example: ColumnTransformer settings

Example: full X, y feature transform with save/load

Example: use helper to suggest spec

Example: more

How does it work

Guiding principles

Development

Setup

Unit Tests

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages