Skip to content

Latest commit

 

History

History
402 lines (288 loc) · 10.3 KB

README.md

File metadata and controls

402 lines (288 loc) · 10.3 KB

Feature Transform

Test

Build Scikit ColumnTransformers by specifying configs.

See also TorchArc to build PyTorch models by specifying architectures.

Installation

pip install feature_transform

Usage

  1. specify column transformers in a YAML spec file, e.g. at spec_filepath = "./example/spec/basic.yaml"
  2. import feature_transform as ft.
    1. (optional) if you have custom sklearn estimator/preprocessor, e.g. Dummy, register it with ft.register_class(Dummy)
  3. build with: col_tfm = ft.build(spec_filepath)

The returned object is a sklearn ColumnTransformer ready for normal use.

See more examples below, then see how it works at the end.


Example: build ColumnTransformer from spec file

from pathlib import Path

import joblib
import yaml
from sklearn import datasets

import feature_transform as ft

filepath = Path(".") / "feature_transform" / "example" / "spec" / "basic.yaml"

# The following are equivalent:

# 1. build from YAML spec file
col_tfm = ft.build(filepath)

# 2. build from dictionary
with filepath.open("r") as f:
    spec_dict = yaml.safe_load(f)
col_tfm = ft.build(spec_dict)

# 3. use the underlying Pydantic validator to build the col_tfm
spec = ft.Spec(**spec_dict)
col_tfm = spec.build()

Next, load demo data for examples below:

# ================================================
# Load demo data

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)

x_df.columns
# Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
#        'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
#        'proanthocyanins', 'color_intensity', 'hue',
#        'od280/od315_of_diluted_wines', 'proline'],
#       dtype='object')

Example: basic

Spec file: feature_transform/example/spec/basic.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")

# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_xs = loaded_col_tfm.transform(x_df)

ColumnTransformer col_tfm:


Example: basic with pandas/polars dataframe

Spec file: feature_transform/example/spec/basic.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")
# to use with dataframe, set output to "pandas" or "polars"
col_tfm.set_output(transform="pandas")

feat_x_df = col_tfm.fit_transform(x_df)
feat_x_df
# 	standardscaler__alcohol	standardscaler__total_phenols	robustscaler__ash
# 0	1.518613	0.808997	0.201439
# 1	0.246290	0.568648	-0.633094
# ...

feat_x_df.describe()
# 	standardscaler__alcohol	standardscaler__total_phenols	robustscaler__ash
# count	1.780000e+02	178.000000	178.000000
# mean	-8.382808e-16	0.000000	0.018754
# std	1.002821e+00	1.002821	0.789479
# ...

# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")

# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_x_df = loaded_col_tfm.transform(x_df)

ColumnTransformer col_tfm:


Example: specify name; use int columns

Spec file: feature_transform/example/spec/name-intcol.yaml

transformers:
  - name: std
    transformer:
      preprocessing.StandardScaler:
    columns: [0, 5]
  - name: robust
    transformer:
      preprocessing.RobustScaler:
    columns: [2]
col_tfm = ft.build(ft.SPEC_DIR / "name-intcol.yaml")

feat_xs = col_tfm.fit_transform(x_df)
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:


Example: pipeline

Spec file: feature_transform/example/spec/pipeline.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      Pipeline:
        - impute.SimpleImputer:
            strategy: constant
        - preprocessing.RobustScaler:
    columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "pipeline.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:


Example: ColumnTransformer settings

Spec file: feature_transform/example/spec/settings.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash]
# use all processors
n_jobs: -1
# for more kwargs see https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html
col_tfm = ft.build(ft.SPEC_DIR / "settings.yaml")

feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  0.20143885],
#        ...,

ColumnTransformer col_tfm:


Example: full X, y feature transform with save/load

Spec file (x): feature_transform/example/spec/wine/x.yaml

transformers:
  - transformer:
      preprocessing.StandardScaler:
    columns: [alcohol, total_phenols, flavanoids, nonflavanoid_phenols, od280/od315_of_diluted_wines]
  - transformer:
      preprocessing.RobustScaler:
    columns: [ash, alcalinity_of_ash, proanthocyanins, hue]
  - transformer:
      preprocessing.PowerTransformer:
    columns: [malic_acid, magnesium, color_intensity, proline]
n_jobs: -1

Spec file (y): feature_transform/example/spec/wine/y.yaml

transformers:
  - transformer:
      preprocessing.OneHotEncoder:
        sparse_output: False
    columns: [target]
import joblib
from sklearn import datasets

import feature_transform as ft

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)
y_df = y_sr.to_frame()  # ColumnTransformer takes only dataframe/matrix as input

x_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "x.yaml")
y_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "y.yaml")

# fit-transform
feat_xs = x_col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  1.03481896, ...,  1.69074868,
#          0.45145022,  1.06254129],
#        ...,

feat_ys = y_col_tfm.fit_transform(y_df)
feat_ys
# array([[1., 0., 0.],
#        ...,

# save for later use
joblib.dump(x_col_tfm, "x_col_tfm.joblib")
joblib.dump(y_col_tfm, "y_col_tfm.joblib")


# ... later, e.g. during batch inference
loaded_x_col_tfm = joblib.load("x_col_tfm.joblib")
feat_xs = loaded_x_col_tfm.transform(x_df)
feat_xs
# array([[ 1.51861254,  0.80899739,  1.03481896, ...,  1.69074868,
#          0.45145022,  1.06254129],
#        ...,

ColumnTransformer x_col_tfm:

ColumnTransformer y_col_tfm:


Example: use helper to suggest spec

Most of the time, data preprocessing steps can be determined with rules-of-thumb; ft.suggest does exactly that (see feature_transform/helper.py for details). This produces spec_dict that can be used directly with ft.build or for further editing.

x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)

# suggest spec_dict - use directly or save to yaml for further editing
spec_dict = ft.suggest(x_df)
col_tfm = ft.build(spec_dict)

# fit-transform
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 0.8973384 ,  0.20143885, -0.90697674, ...,  0.80804954,
#         -0.43546273,  1.69074868],
#         ...,

ColumnTransformer col_tfm:


Example: more

See more examples:

How does it work

Feature Transform simply builds sklearn ColumnTransformer and its estimators/pipelines with 1-1 mapping from a spec file:

  1. Spec is defined via Pydantic feature_transform/validator/. This defines:
    • spec: the Estimator, Pipeline, ColumnTransformer
  2. If spec specifies:
    1. transformers=list[(name, transformer, columns)], then use ColumnTransformer
    2. transformers=list[(transformer, columns)], then use make_column_transformer with auto-generated names

See more in the pydantic spec definition:

Guiding principles

The design of Feature Transform is guided as follows:

  1. simple: the module spec is straightforward:
    1. it is simply sklearn class name with kwargs.
    2. it supports official sklearn estimators, Pipeline, and custom-defined modules registered via ft.register_class
  2. expressive: it can be used to build both simple and advanced ColumnTransformer easily
  3. portable: it returns ColumnTransformer that can be used anywhere; it is not a framework.
  4. parametrizable: data-based feature transformation unlocks fast experimentation, e.g. by building logic for hyperparameter / data feature search

Development

Setup

Install uv for dependency management if you haven't already. Then run:

# setup virtualenv
uv sync

Unit Tests

uv run pytest