Skip to content

Commit

Permalink
Merge pull request #70 from artefactory/feat/nested
Browse files Browse the repository at this point in the history
 ADD: Nested Logit Model
 ADD: Nested Logit Example Notebook
 ADD: Biogeme Nested pre-processing of SwissMetro Data
 ADD: HC dataset
  • Loading branch information
VincentAuriau authored Apr 26, 2024
2 parents d384fb4 + 194357d commit 6e06c2b
Show file tree
Hide file tree
Showing 20 changed files with 2,890 additions and 165 deletions.
30 changes: 17 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,19 @@ If you are new to choice modelling, you can check this [resource](https://www.pu
- [SwissMetro](./choice_learn/datasets/data/swissmetro.csv.gz) [[2]](#citation)
- [ModeCanada](./choice_learn/datasets/data/ModeCanada.csv.gz) [[3]](#citation)
- The [Train](./choice_learn/datasets/data/train_data.csv.gz) dataset [[5]](#citation)
- The [Heating](./choice_learn/datasets/data/heating_data.csv.gz) & [Electricity](./choice_learn/datasets/data/electricity.csv.gz) datasets from Kenneth Train described [here](https://rdrr.io/cran/mlogit/man/Electricity.html) and [here](https://rdrr.io/cran/mlogit/man/Heating.html)
- The [Heating](./choice_learn/datasets/data/heating_data.csv.gz), [HC](./choice_learn/datasets/data/HC.csv.gz) & [Electricity](./choice_learn/datasets/data/electricity.csv.gz) datasets from Kenneth Train described [here](https://rdrr.io/cran/mlogit/man/Electricity.html), [here](https://cran.r-project.org/web/packages/mlogit/vignettes/e2nlogit.html) and [here](https://rdrr.io/cran/mlogit/man/Heating.html)
- [Stated car preferences](./choice_learn/datasets/data/car.csv.gz) [[9]](#citation)
- The [TaFeng](./choice_learn/datasets/data/ta_feng.csv.zip) dataset from [Kaggle](https://www.kaggle.com/datasets/chiranjivdas09/ta-feng-grocery-dataset)
- The ICDM-2013 [Expedia](./choice_learn/datasets/expedia.py) dataset from [Kaggle](https://www.kaggle.com/c/expedia-personalized-sort) [[6]](#citation)

### Model estimation
- Ready-to-use models:
- Conditional MultiNomialLogit [[4]](#citation)[[Example]](notebooks/introduction/3_model_clogit.ipynb)
- Nested Logit [[10]](#citation) [[Example]](notebooks/models/nested_logit.ipynb)
- Latent Class MultiNomialLogit [[Example]](notebooks/models/latent_class_model.ipynb)
- RUMnet [[1]](#citation)[[Example]](notebooks/models/rumnet.ipynb)
- TasteNet [[7]](#citation)[[Example]](notebooks/models/tastenet.ipynb)
- (WIP) - Ready-to-use models to be implemented:
- Nested Logit
- [SHOPPER](https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/SHOPPER--A-probabilistic-model-of-consumer-choice-with-substitutes/10.1214/19-AOAS1265.full)
- Others ...
- Custom modelling is made easy by subclassing the ChoiceModel class [[Example]](notebooks/introduction/4_model_customization.ipynb)
Expand All @@ -69,11 +69,11 @@ If you are new to choice modelling, you can check this [resource](https://www.pu

## Getting Started

You can find the following [notebooks](notebooks/introduction/) to help you getting started with the package:
- [Generic and simple introduction](notebooks/introduction/1_introductive_example.ipynb)
- [Detailed explanations of data handling depending on the data format](notebooks/introduction/2_data_handling.ipynb)
- [A detailed example of conditional logit estimation](notebooks/introduction/3_model_clogit.ipynb)
- [Introduction to custom modelling and more complex parametrization](notebooks/introduction/4_model_customization.ipynb)
You can find the following tutorials to help you getting started with the package:
- Generic and simple introduction [[notebook]](notebooks/introduction/1_introductive_example.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/1_introductive_example/)
- Detailed explanations of data handling depending on the data format [[noteboook]](notebooks/introduction/2_data_handling.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/2_data_handling/)
- A detailed example of conditional logit estimation [[notebook]](notebooks/introduction/3_model_clogit.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/3_model_clogit/)
- Introduction to custom modelling and more complex parametrization [[notebook]](notebooks/introduction/4_model_customization.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/4_model_customization/)

## Installation

Expand Down Expand Up @@ -184,11 +184,15 @@ The use of this software is under the MIT license, with no limitation of usage,
[6] [Personalize Expedia Hotel Searches - ICDM 2013](https://www.kaggle.com/c/expedia-personalized-sort), Ben Hamner, A.; Friedman, D.; SSA_Expedia. (2013)\
[7] [A Neural-embedded Discrete Choice Model: Learning Taste Representation with Strengthened Interpretability](https://arxiv.org/abs/2002.00922), Han, Y.; Calara Oereuran F.; Ben-Akiva, M.; Zegras, C. (2020)\
[8] [A branch-and-cut algorithm for the latent-class logit assortment problem](https://www.sciencedirect.com/science/article/pii/S0166218X12001072), Méndez-Díaz, I.; Miranda-Bront, J. J.; Vulcano, G.; Zabala, P. (2014)\
[9] [Stated Preferences for Car Choice in Mixed MNL models for discrete response.](https://www.jstor.org/stable/2678603), McFadden, D. and Kenneth Train (2000)
[9] [Stated Preferences for Car Choice in Mixed MNL models for discrete response.](https://www.jstor.org/stable/2678603), McFadden, D. and Kenneth Train (2000)\
[10] [Modeling the Choice of Residential Location](https://onlinepubs.trb.org/Onlinepubs/trr/1978/673/673-012.pdf), McFadden, D. (1978)

### Code and Repositories
- [1][RUMnet](https://github.com/antoinedesir/rumnet)
- [PyLogit](https://github.com/timothyb0912/pylogit)
- [Torch Choice](https://gsbdbi.github.io/torch-choice)
- [BioGeme](https://github.com/michelbierlaire/biogeme)
- [mlogit](https://github.com/cran/mlogit)

[1] [RUMnet](https://github.com/antoinedesir/rumnet)\
[7] TasteNet [[Repo1](https://github.com/YafeiHan-MIT/TasteNet-MNL)] [[Repo2](https://github.com/deborahmit/TasteNet-MNL)]

[PyLogit](https://github.com/timothyb0912/pylogit)\
[Torch Choice](https://gsbdbi.github.io/torch-choice)\
[BioGeme](https://github.com/michelbierlaire/biogeme)\
[mlogit](https://github.com/cran/mlogit)
5 changes: 4 additions & 1 deletion choice_learn/data/choice_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -811,9 +811,11 @@ def from_single_wide_df(
logging.warning("choice_format not undersood, defaulting to 'items_index'")

if shared_features_columns is not None:
shared_features_by_choice = df[shared_features_columns]
shared_features_by_choice = df[shared_features_columns].to_numpy()
shared_features_by_choice_names = shared_features_columns
else:
shared_features_by_choice = None
shared_features_by_choice_names = None

if items_features_suffixes is not None:
items_features_names = items_features_suffixes
Expand Down Expand Up @@ -887,6 +889,7 @@ def from_single_wide_df(

return ChoiceDataset(
shared_features_by_choice=shared_features_by_choice,
shared_features_by_choice_names=shared_features_by_choice_names,
items_features_by_choice=items_features_by_choice,
items_features_by_choice_names=items_features_names,
available_items_by_choice=available_items_by_choice,
Expand Down
2 changes: 2 additions & 0 deletions choice_learn/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from .base import (
load_car_preferences,
load_electricity,
load_hc,
load_heating,
load_modecanada,
load_swissmetro,
Expand All @@ -20,4 +21,5 @@
"load_tafeng",
"load_expedia",
"load_car_preferences",
"load_hc",
]
146 changes: 112 additions & 34 deletions choice_learn/datasets/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,8 +152,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
full_path = get_path(data_file_name, module=DATA_MODULE)
swiss_df = pd.read_csv(full_path)
swiss_df["CAR_HE"] = 0.0
# names, data = load_gzip(data_file_name)
# data = data.astype(int)

items = ["TRAIN", "SM", "CAR"]
shared_features_by_choice_names = [
Expand Down Expand Up @@ -182,29 +180,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
swiss_df[f"{item}_oh_{item}"] = 1
else:
swiss_df[f"{item2}_oh_{item}"] = 0
"""
# Adding dummy CAR_HE feature as 0 for consistency
names.append("CAR_HE")
data = np.hstack([data, np.zeros((data.shape[0], 1))])
session_features = slice_from_names(data, session_features_names, names)
sessions_items_features = np.stack(
[slice_from_names(data, features, names) for features in sessions_items_features_names],
axis=-1,
)
sessions_items_availabilities = slice_from_names(data, sessions_items_availabilities, names)
choices = data[:, names.index(choice_column)]
# Remove no choice
choice_done = np.where(choices > 0)[0]
session_features = session_features[choice_done]
sessions_items_features = sessions_items_features[choice_done]
sessions_items_availabilities = sessions_items_availabilities[choice_done]
choices = choices[choice_done]
# choices renormalization
choices = choices - 1
"""

if return_desc:
return description
Expand Down Expand Up @@ -330,7 +305,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
if preprocessing == "tutorial":
# swiss_df = pd.DataFrame(data, columns=names)
# Removing unknown choices
# swiss_df = swiss_df.loc[swiss_df.CHOICE != 0]
# Keep only commute an dbusiness trips
swiss_df = swiss_df.loc[swiss_df.PURPOSE.isin([1, 3])]

Expand Down Expand Up @@ -394,12 +368,52 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
items_features_by_choice_names=["cost", "travel_time", "headway", "seats"],
choices=choices,
)
if preprocessing == "biogeme_nested":
# Keep only commute an dbusiness trips
swiss_df = swiss_df.loc[swiss_df.PURPOSE.isin([1, 3])]

# Normalizing values by 100
swiss_df[["TRAIN_TT", "SM_TT", "CAR_TT"]] = (
swiss_df[["TRAIN_TT", "SM_TT", "CAR_TT"]] / 100.0
)

swiss_df["train_free_ticket"] = swiss_df.apply(
lambda row: (row["GA"] == 1).astype(int), axis=1
)
swiss_df["sm_free_ticket"] = swiss_df.apply(
lambda row: (row["GA"] == 1).astype(int), axis=1
)

swiss_df["train_travel_cost"] = swiss_df.apply(
lambda row: (row["TRAIN_CO"] * (1 - row["train_free_ticket"])) / 100, axis=1
)
swiss_df["sm_travel_cost"] = swiss_df.apply(
lambda row: (row["SM_CO"] * (1 - row["sm_free_ticket"])) / 100, axis=1
)
swiss_df["car_travel_cost"] = swiss_df.apply(lambda row: row["CAR_CO"] / 100, axis=1)

train_features = swiss_df[["train_travel_cost", "TRAIN_TT"]].to_numpy()
sm_features = swiss_df[["sm_travel_cost", "SM_TT"]].to_numpy()
car_features = swiss_df[["car_travel_cost", "CAR_TT"]].to_numpy()

items_features_by_choice = np.stack([train_features, sm_features, car_features], axis=1)

available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
# Re-Indexing choices from 1 to 3 to 0 to 2
choices = swiss_df.CHOICE.to_numpy()

return ChoiceDataset(
shared_features_by_choice=None,
items_features_by_choice=items_features_by_choice,
available_items_by_choice=available_items_by_choice,
shared_features_by_choice_names=None,
items_features_by_choice_names=["cost", "travel_time"],
choices=choices,
)
if preprocessing == "rumnet":
# swiss_df = pd.DataFrame(data, columns=names)
# swiss_df = swiss_df.loc[swiss_df.CHOICE != 0]
swiss_df["One"] = 1.0
swiss_df["Zero"] = 0.0
# choices = swiss_df.CHOICE.to_numpy() - 1

available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
items_features_by_choice = np.stack(
[
Expand All @@ -409,8 +423,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
],
axis=1,
)
# shared_features_by_choice = df[["GROUP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE",
# "AGE", "MALE", "INCOME", "GA", "ORIGIN", "DEST"]].to_numpy()

items_features_by_choice[:, :, 0] = items_features_by_choice[:, :, 0] / 1000
items_features_by_choice[:, :, 1] = items_features_by_choice[:, :, 1] / 5000
Expand Down Expand Up @@ -776,7 +788,6 @@ def load_electricity(
"""
_ = to_wide
data_file_name = "electricity.csv.gz"
# names, data = load_gzip(data_file_name)

description = """A sample of 2308 households in the United States.
- choice: the choice of the individual, one of 1, 2, 3, 4,
Expand Down Expand Up @@ -851,7 +862,6 @@ def load_train(
”Papers 9303, Laval-Recherche en Energie. https://ideas.repec.org/p/fth/lavaen/9303.html."""
_ = to_wide
data_file_name = "train_data.csv.gz"
# names, data = load_gzip(data_file_name)

full_path = get_path(data_file_name, module=DATA_MODULE)
train_df = pd.read_csv(full_path)
Expand Down Expand Up @@ -901,7 +911,6 @@ def load_car_preferences(
“Mixed MNL models for discrete response”, Journal of Applied Econometrics, 15(5), 447–470."""

data_file_name = "car.csv.gz"
# names, data = load_gzip(data_file_name)

full_path = get_path(data_file_name, module=DATA_MODULE)
cars_df = pd.read_csv(full_path)
Expand Down Expand Up @@ -938,3 +947,72 @@ def load_car_preferences(
choices_column="choice",
choice_format="items_id",
)


def load_hc(
as_frame=False,
return_desc=False,
):
"""Load and return the HC dataset from Kenneth Train.
Parameters
----------
as_frame : bool, optional
Whether to return the dataset as pd.DataFrame. If not, returned as ChoiceDataset,
by default False.
return_desc : bool, optional
Whether to return the description, by default False.
Returns
-------
ChoiceDataset
Loaded Train dataset
"""
desc = """HC contains data on the choice of heating and central cooling system for 250
single-family, newly built houses in California.
The alternatives are:
Gas central heat with cooling gcc,
Electric central resistence heat with cooling ecc,
Electric room resistence heat with cooling erc,
Electric heat pump, which provides cooling also hpc,
Gas central heat without cooling gc,
Electric central resistence heat without cooling ec,
Electric room resistence heat without cooling er.
Heat pumps necessarily provide both heating and cooling such that heat pump without cooling is
not an alternative.
The variables are:
depvar gives the name of the chosen alternative,
ich.alt are the installation cost for the heating portion of the system,
icca is the installation cost for cooling
och.alt are the operating cost for the heating portion of the system
occa is the operating cost for cooling
income is the annual income of the household
Note that the full installation cost of alternative gcc is ich.gcc+icca, and similarly for the
operating cost and for the other alternatives with cooling.
"""

data_file_name = "HC.csv.gz"

full_path = get_path(data_file_name, module=DATA_MODULE)
hc_df = pd.read_csv(full_path)

if return_desc:
return desc

if as_frame:
return hc_df

items_id = ["gcc", "ecc", "erc", "hpc", "gc", "ec", "er"]
return ChoiceDataset.from_single_wide_df(
df=hc_df,
shared_features_columns=["income"],
items_features_prefixes=["ich", "och", "occa", "icca"],
delimiter=".",
items_id=items_id,
choices_column="depvar",
choice_format="items_id",
)
Binary file added choice_learn/datasets/data/HC.csv.gz
Binary file not shown.
9 changes: 6 additions & 3 deletions choice_learn/models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
"""Models classes and functions."""
import logging

import tensorflow as tf

from .conditional_logit import ConditionalLogit
from .nested_logit import NestedLogit
from .simple_mnl import SimpleMNL
from .tastenet import TasteNet

if len(tf.config.list_physical_devices("GPU")) > 0:
print("GPU detected, importing GPU version of RUMnet.")
logging.info("GPU detected, importing GPU version of RUMnet.")
from .rumnet import GPURUMnet as RUMnet
else:
from .rumnet import CPURUMnet as RUMnet

print("No GPU detected, importing CPU version of RUMnet.")
logging.info("No GPU detected, importing CPU version of RUMnet.")

__all__ = ["ConditionalLogit", "RUMnet", "SimpleMNL", "TasteNet"]
__all__ = ["ConditionalLogit", "RUMnet", "SimpleMNL", "TasteNet", "NestedLogit"]
22 changes: 17 additions & 5 deletions choice_learn/models/base_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,15 @@ def __init__(
self.loss = tf_ops.CustomCategoricalCrossEntropy(
from_logits=False, label_smoothing=self.label_smoothing
)
self.exact_nll = tf_ops.CustomCategoricalCrossEntropy(
from_logits=False,
label_smoothing=0.0,
sparse=False,
axis=-1,
epsilon=1e-35,
name="exact_categorical_crossentropy",
reduction=tf.keras.losses.Reduction.AUTO,
)
self.callbacks = tf.keras.callbacks.CallbackList(callbacks, add_history=True, model=None)
self.callbacks.set_model(self)

Expand Down Expand Up @@ -462,7 +471,12 @@ def batch_predict(
y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
sample_weight=sample_weight,
),
"NegativeLogLikelihood": tf.keras.losses.CategoricalCrossentropy()(
# "NegativeLogLikelihood": tf.keras.losses.CategoricalCrossentropy()(
# y_pred=probabilities,
# y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
# sample_weight=sample_weight,
# ),
"Exact-NegativeLogLikelihood": self.exact_nll(
y_pred=probabilities,
y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
sample_weight=sample_weight,
Expand Down Expand Up @@ -584,7 +598,7 @@ def evaluate(self, choice_dataset, sample_weight=None, batch_size=-1, mode="eval
sample_weight=sample_weight,
)
if mode == "eval":
batch_losses.append(loss["NegativeLogLikelihood"])
batch_losses.append(loss["Exact-NegativeLogLikelihood"])
elif mode == "optim":
batch_losses.append(loss["optimized_loss"])
if batch_size != -1:
Expand Down Expand Up @@ -670,7 +684,7 @@ def f(params_1d):
assign_new_model_parameters(params_1d)
# calculate the loss
loss_value = self.evaluate(
dataset, sample_weight=sample_weight, batch_size=-1, mode="optim"
dataset, sample_weight=sample_weight, batch_size=-1, mode="eval"
)
if self.regularization is not None:
regularization = tf.reduce_sum(
Expand All @@ -681,7 +695,6 @@ def f(params_1d):
# calculate gradients and convert to 1D tf.Tensor
grads = tape.gradient(loss_value, self.trainable_weights)
grads = tf.dynamic_stitch(idx, grads)

# print out iteration & loss
f.iter.assign_add(1)

Expand Down Expand Up @@ -729,7 +742,6 @@ def _fit_with_lbfgs(self, dataset, sample_weight=None, verbose=0):

# convert initial model parameters to a 1D tf.Tensor
init_params = tf.dynamic_stitch(func.idx, self.trainable_weights)

# train the model with L-BFGS solver
results = tfp.optimizer.lbfgs_minimize(
value_and_gradients_function=func,
Expand Down
Loading

0 comments on commit 6e06c2b

Please sign in to comment.