This is an Open Source Python based pipeline for Kaggle tabular data competitions. Although it is customized for Kaggle TPS August 2022, with limited code changes, this project can be used as a pipeline for any tabular data competition. This project includes APIs for most of the ML competition related tasks:
- data processing
- visualization
- feature engineering
- training
- ensembling
- feature selection
- hyperparameter optimization
- experiment tracking
- submission of prediction to kaggle
- data
- features - location for parquet files containing engineered features
- processed - location for parquet files containing raw data after initial processing
- raw - location for parquet files containing raw data (train, test, sample submission)
- fi - location to store feature importances in CSV files
- fi_fig - location to store plots capturing feature importances
- hpo - location to save hyperparameter optimization artifacts
- logs - location for logs generated by python modules
- notebooks - Any Jupyter notebook can be saved here
- oof - Out of fold predictions are saved here
- src
- common - package containing common utility functions
- config - package containing configuration related modules
- cv - package containing cross validation related functions
- fe - package containing feature engineering related functions
- fs - package containing feature selection related functions
- hpo - package containing hyperparameter optimization related functions
- modeling - package containing training/prediction related functions
- munging - package containing data processing/exploration related functions
- pre_process - package containing data pre-processing related functions
- scripts - location for fe, training scripts
- ts - package containing time series related functions
- viz - package containing data visualization related functions
- submissions - locations for predictions and submission scripts
- tracking - CSV file to track experiments
- I have borrowed the initial project structure and framework code from Kaggle Grandmaster, Rob Mulla's open sourced code
- Lot of utility functions are from "Approaching (Almost) Any Machine Learning Problem" by Abhishek Thakur
- I used some feature selection related code from SRK's github repository
Clone the source code from github under <PROJECT_HOME> directory.
> git clone
This will create the following directory structure:
> <PROJECT_HOME>/kaggle_pipeline_tps_aug_22
Create conda env:
> conda env create --file environment.yml
Go to
and activate conda environment:> conda activate py_k
Go to the raw data directory at
. Download dataset from Kaggle (Kaggle API should be configured following link):> kaggle competitions download -c tabular-playground-series-aug-2022
Unzip the data:
> unzip
Set the value of variable
with the absolute path of<PROJECT_HOME>/kaggle_pipeline_tps_aug_22
To process raw data into parquet format, go to
. Execute the following:> python -m src.scripts.data_processing.process_raw_data
This will create 3 parquet files under
representing train, test and sample_submission CSVs -
To trigger feature engineering, go to
. Execute the following:> python -m src.scripts.data_processing.create_features
This will create a parquet file containing all the engineered features under
To train the baseline model with LGBM,
. Execute the following:> python -m
This will create the submission file under
. Out of Fold predictions under<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/oof
and CSVs capturing feature importances under<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/fi
Result of the experiment will be tracked at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/tracking/tracking.csv
To submit the submission file to kaggle, go to
:> python -m
Following is needed for visualizing plots for optuna using plotly (i.e. plotly dependency):
jupyter labextension install jupyterlab-plotly@4.14.3