Merge pull request #70 from artefactory/feat/nested

ADD: Nested Logit Model ADD: Nested Logit Example Notebook ADD: Biogeme Nested pre-processing of SwissMetro Data ADD: HC dataset
artefactory · Apr 26, 2024 · 6e06c2b · 6e06c2b
2 parents d384fb4 + 194357d
commit 6e06c2b
Show file tree

Hide file tree

Showing 20 changed files with 2,890 additions and 165 deletions.
diff --git a/README.md b/README.md
@@ -45,19 +45,19 @@ If you are new to choice modelling, you can check this [resource](https://www.pu
   - [SwissMetro](./choice_learn/datasets/data/swissmetro.csv.gz) [[2]](#citation)
   - [ModeCanada](./choice_learn/datasets/data/ModeCanada.csv.gz) [[3]](#citation)
   - The [Train](./choice_learn/datasets/data/train_data.csv.gz) dataset [[5]](#citation)
-  - The [Heating](./choice_learn/datasets/data/heating_data.csv.gz) & [Electricity](./choice_learn/datasets/data/electricity.csv.gz) datasets from Kenneth Train described [here](https://rdrr.io/cran/mlogit/man/Electricity.html) and [here](https://rdrr.io/cran/mlogit/man/Heating.html)
+  - The [Heating](./choice_learn/datasets/data/heating_data.csv.gz), [HC](./choice_learn/datasets/data/HC.csv.gz) & [Electricity](./choice_learn/datasets/data/electricity.csv.gz) datasets from Kenneth Train described [here](https://rdrr.io/cran/mlogit/man/Electricity.html), [here](https://cran.r-project.org/web/packages/mlogit/vignettes/e2nlogit.html) and [here](https://rdrr.io/cran/mlogit/man/Heating.html)
   - [Stated car preferences](./choice_learn/datasets/data/car.csv.gz) [[9]](#citation)
   - The [TaFeng](./choice_learn/datasets/data/ta_feng.csv.zip) dataset from [Kaggle](https://www.kaggle.com/datasets/chiranjivdas09/ta-feng-grocery-dataset)
   - The ICDM-2013 [Expedia](./choice_learn/datasets/expedia.py) dataset from [Kaggle](https://www.kaggle.com/c/expedia-personalized-sort) [[6]](#citation)
 
 ### Model estimation
 - Ready-to-use models:
   - Conditional MultiNomialLogit [[4]](#citation)[[Example]](notebooks/introduction/3_model_clogit.ipynb)
+  - Nested Logit [[10]](#citation) [[Example]](notebooks/models/nested_logit.ipynb)
   - Latent Class MultiNomialLogit [[Example]](notebooks/models/latent_class_model.ipynb)
   - RUMnet [[1]](#citation)[[Example]](notebooks/models/rumnet.ipynb)
   - TasteNet [[7]](#citation)[[Example]](notebooks/models/tastenet.ipynb)
 - (WIP) - Ready-to-use models to be implemented:
-  - Nested Logit
   - [SHOPPER](https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/SHOPPER--A-probabilistic-model-of-consumer-choice-with-substitutes/10.1214/19-AOAS1265.full)
   - Others ...
 - Custom modelling is made easy by subclassing the ChoiceModel class [[Example]](notebooks/introduction/4_model_customization.ipynb)
@@ -69,11 +69,11 @@ If you are new to choice modelling, you can check this [resource](https://www.pu
 
 ## Getting Started
 
-You can find the following [notebooks](notebooks/introduction/) to help you getting started with the package:
-- [Generic and simple introduction](notebooks/introduction/1_introductive_example.ipynb)
-- [Detailed explanations of data handling depending on the data format](notebooks/introduction/2_data_handling.ipynb)
-- [A detailed example of conditional logit estimation](notebooks/introduction/3_model_clogit.ipynb)
-- [Introduction to custom modelling and more complex parametrization](notebooks/introduction/4_model_customization.ipynb)
+You can find the following tutorials to help you getting started with the package:
+- Generic and simple introduction [[notebook]](notebooks/introduction/1_introductive_example.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/1_introductive_example/)
+- Detailed explanations of data handling depending on the data format [[noteboook]](notebooks/introduction/2_data_handling.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/2_data_handling/)
+- A detailed example of conditional logit estimation [[notebook]](notebooks/introduction/3_model_clogit.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/3_model_clogit/)
+- Introduction to custom modelling and more complex parametrization [[notebook]](notebooks/introduction/4_model_customization.ipynb)[[doc]](https://expert-dollop-1wemk8l.pages.github.io/notebooks/introduction/4_model_customization/)
 
 ## Installation
 
@@ -184,11 +184,15 @@ The use of this software is under the MIT license, with no limitation of usage,
 [6] [Personalize Expedia Hotel Searches - ICDM 2013](https://www.kaggle.com/c/expedia-personalized-sort), Ben Hamner, A.; Friedman, D.; SSA_Expedia. (2013)\
 [7] [A Neural-embedded Discrete Choice Model: Learning Taste Representation with Strengthened Interpretability](https://arxiv.org/abs/2002.00922), Han, Y.; Calara Oereuran F.; Ben-Akiva, M.; Zegras, C. (2020)\
 [8] [A branch-and-cut algorithm for the latent-class logit assortment problem](https://www.sciencedirect.com/science/article/pii/S0166218X12001072), Méndez-Díaz, I.; Miranda-Bront, J. J.; Vulcano, G.; Zabala, P. (2014)\
-[9] [Stated Preferences for Car Choice in Mixed MNL models for discrete response.](https://www.jstor.org/stable/2678603), McFadden, D. and Kenneth Train (2000)
+[9] [Stated Preferences for Car Choice in Mixed MNL models for discrete response.](https://www.jstor.org/stable/2678603), McFadden, D. and Kenneth Train (2000)\
+[10] [Modeling the Choice of Residential Location](https://onlinepubs.trb.org/Onlinepubs/trr/1978/673/673-012.pdf), McFadden, D. (1978)
 
 ### Code and Repositories
-- [1][RUMnet](https://github.com/antoinedesir/rumnet)
-- [PyLogit](https://github.com/timothyb0912/pylogit)
-- [Torch Choice](https://gsbdbi.github.io/torch-choice)
-- [BioGeme](https://github.com/michelbierlaire/biogeme)
-- [mlogit](https://github.com/cran/mlogit)
+
+[1] [RUMnet](https://github.com/antoinedesir/rumnet)\
+[7] TasteNet [[Repo1](https://github.com/YafeiHan-MIT/TasteNet-MNL)] [[Repo2](https://github.com/deborahmit/TasteNet-MNL)]
+
+[PyLogit](https://github.com/timothyb0912/pylogit)\
+[Torch Choice](https://gsbdbi.github.io/torch-choice)\
+[BioGeme](https://github.com/michelbierlaire/biogeme)\
+[mlogit](https://github.com/cran/mlogit)
diff --git a/choice_learn/data/choice_dataset.py b/choice_learn/data/choice_dataset.py
@@ -811,9 +811,11 @@ def from_single_wide_df(
             logging.warning("choice_format not undersood, defaulting to 'items_index'")
 
         if shared_features_columns is not None:
-            shared_features_by_choice = df[shared_features_columns]
+            shared_features_by_choice = df[shared_features_columns].to_numpy()
+            shared_features_by_choice_names = shared_features_columns
         else:
             shared_features_by_choice = None
+            shared_features_by_choice_names = None
 
         if items_features_suffixes is not None:
             items_features_names = items_features_suffixes
@@ -887,6 +889,7 @@ def from_single_wide_df(
 
         return ChoiceDataset(
             shared_features_by_choice=shared_features_by_choice,
+            shared_features_by_choice_names=shared_features_by_choice_names,
             items_features_by_choice=items_features_by_choice,
             items_features_by_choice_names=items_features_names,
             available_items_by_choice=available_items_by_choice,

diff --git a/choice_learn/datasets/__init__.py b/choice_learn/datasets/__init__.py
@@ -3,6 +3,7 @@
 from .base import (
     load_car_preferences,
     load_electricity,
+    load_hc,
     load_heating,
     load_modecanada,
     load_swissmetro,
@@ -20,4 +21,5 @@
     "load_tafeng",
     "load_expedia",
     "load_car_preferences",
+    "load_hc",
 ]
diff --git a/choice_learn/datasets/base.py b/choice_learn/datasets/base.py
@@ -152,8 +152,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
     full_path = get_path(data_file_name, module=DATA_MODULE)
     swiss_df = pd.read_csv(full_path)
     swiss_df["CAR_HE"] = 0.0
-    # names, data = load_gzip(data_file_name)
-    # data = data.astype(int)
 
     items = ["TRAIN", "SM", "CAR"]
     shared_features_by_choice_names = [
@@ -182,29 +180,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
                     swiss_df[f"{item}_oh_{item}"] = 1
                 else:
                     swiss_df[f"{item2}_oh_{item}"] = 0
-    """
-    # Adding dummy CAR_HE feature as 0 for consistency
-    names.append("CAR_HE")
-    data = np.hstack([data, np.zeros((data.shape[0], 1))])
-
-    session_features = slice_from_names(data, session_features_names, names)
-    sessions_items_features = np.stack(
-        [slice_from_names(data, features, names) for features in sessions_items_features_names],
-        axis=-1,
-    )
-    sessions_items_availabilities = slice_from_names(data, sessions_items_availabilities, names)
-    choices = data[:, names.index(choice_column)]
-
-    # Remove no choice
-    choice_done = np.where(choices > 0)[0]
-    session_features = session_features[choice_done]
-    sessions_items_features = sessions_items_features[choice_done]
-    sessions_items_availabilities = sessions_items_availabilities[choice_done]
-    choices = choices[choice_done]
-
-    # choices renormalization
-    choices = choices - 1
-    """
 
     if return_desc:
         return description
@@ -330,7 +305,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
     if preprocessing == "tutorial":
         # swiss_df = pd.DataFrame(data, columns=names)
         # Removing unknown choices
-        # swiss_df = swiss_df.loc[swiss_df.CHOICE != 0]
         # Keep only commute an dbusiness trips
         swiss_df = swiss_df.loc[swiss_df.PURPOSE.isin([1, 3])]
 
@@ -394,12 +368,52 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
             items_features_by_choice_names=["cost", "travel_time", "headway", "seats"],
             choices=choices,
         )
+    if preprocessing == "biogeme_nested":
+        # Keep only commute an dbusiness trips
+        swiss_df = swiss_df.loc[swiss_df.PURPOSE.isin([1, 3])]
+
+        # Normalizing values by 100
+        swiss_df[["TRAIN_TT", "SM_TT", "CAR_TT"]] = (
+            swiss_df[["TRAIN_TT", "SM_TT", "CAR_TT"]] / 100.0
+        )
+
+        swiss_df["train_free_ticket"] = swiss_df.apply(
+            lambda row: (row["GA"] == 1).astype(int), axis=1
+        )
+        swiss_df["sm_free_ticket"] = swiss_df.apply(
+            lambda row: (row["GA"] == 1).astype(int), axis=1
+        )
+
+        swiss_df["train_travel_cost"] = swiss_df.apply(
+            lambda row: (row["TRAIN_CO"] * (1 - row["train_free_ticket"])) / 100, axis=1
+        )
+        swiss_df["sm_travel_cost"] = swiss_df.apply(
+            lambda row: (row["SM_CO"] * (1 - row["sm_free_ticket"])) / 100, axis=1
+        )
+        swiss_df["car_travel_cost"] = swiss_df.apply(lambda row: row["CAR_CO"] / 100, axis=1)
+
+        train_features = swiss_df[["train_travel_cost", "TRAIN_TT"]].to_numpy()
+        sm_features = swiss_df[["sm_travel_cost", "SM_TT"]].to_numpy()
+        car_features = swiss_df[["car_travel_cost", "CAR_TT"]].to_numpy()
+
+        items_features_by_choice = np.stack([train_features, sm_features, car_features], axis=1)
+
+        available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
+        # Re-Indexing choices from 1 to 3 to 0 to 2
+        choices = swiss_df.CHOICE.to_numpy()
+
+        return ChoiceDataset(
+            shared_features_by_choice=None,
+            items_features_by_choice=items_features_by_choice,
+            available_items_by_choice=available_items_by_choice,
+            shared_features_by_choice_names=None,
+            items_features_by_choice_names=["cost", "travel_time"],
+            choices=choices,
+        )
     if preprocessing == "rumnet":
-        # swiss_df = pd.DataFrame(data, columns=names)
-        # swiss_df = swiss_df.loc[swiss_df.CHOICE != 0]
         swiss_df["One"] = 1.0
         swiss_df["Zero"] = 0.0
-        # choices = swiss_df.CHOICE.to_numpy() - 1
+
         available_items_by_choice = swiss_df[["TRAIN_AV", "SM_AV", "CAR_AV"]].to_numpy()
         items_features_by_choice = np.stack(
             [
@@ -409,8 +423,6 @@ def load_swissmetro(add_items_one_hot=False, as_frame=False, return_desc=False,
             ],
             axis=1,
         )
-        # shared_features_by_choice = df[["GROUP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE",
-        # "AGE", "MALE", "INCOME", "GA", "ORIGIN", "DEST"]].to_numpy()
 
         items_features_by_choice[:, :, 0] = items_features_by_choice[:, :, 0] / 1000
         items_features_by_choice[:, :, 1] = items_features_by_choice[:, :, 1] / 5000
@@ -776,7 +788,6 @@ def load_electricity(
     """
     _ = to_wide
     data_file_name = "electricity.csv.gz"
-    # names, data = load_gzip(data_file_name)
 
     description = """A sample of 2308 households in the United States.
     - choice: the choice of the individual, one of 1, 2, 3, 4,
@@ -851,7 +862,6 @@ def load_train(
     ”Papers 9303, Laval-Recherche en Energie. https://ideas.repec.org/p/fth/lavaen/9303.html."""
     _ = to_wide
     data_file_name = "train_data.csv.gz"
-    # names, data = load_gzip(data_file_name)
 
     full_path = get_path(data_file_name, module=DATA_MODULE)
     train_df = pd.read_csv(full_path)
@@ -901,7 +911,6 @@ def load_car_preferences(
     “Mixed MNL models for discrete response”, Journal of Applied Econometrics, 15(5), 447–470."""
 
     data_file_name = "car.csv.gz"
-    # names, data = load_gzip(data_file_name)
 
     full_path = get_path(data_file_name, module=DATA_MODULE)
     cars_df = pd.read_csv(full_path)
@@ -938,3 +947,72 @@ def load_car_preferences(
         choices_column="choice",
         choice_format="items_id",
     )
+
+
+def load_hc(
+    as_frame=False,
+    return_desc=False,
+):
+    """Load and return the HC dataset from Kenneth Train.
+
+    Parameters
+    ----------
+    as_frame : bool, optional
+        Whether to return the dataset as pd.DataFrame. If not, returned as ChoiceDataset,
+        by default False.
+    return_desc : bool, optional
+        Whether to return the description, by default False.
+
+    Returns
+    -------
+    ChoiceDataset
+        Loaded Train dataset
+    """
+    desc = """HC contains data on the choice of heating and central cooling system for 250
+    single-family, newly built houses in California.
+
+    The alternatives are:
+
+    Gas central heat with cooling gcc,
+    Electric central resistence heat with cooling ecc,
+    Electric room resistence heat with cooling erc,
+    Electric heat pump, which provides cooling also hpc,
+    Gas central heat without cooling gc,
+    Electric central resistence heat without cooling ec,
+    Electric room resistence heat without cooling er.
+    Heat pumps necessarily provide both heating and cooling such that heat pump without cooling is
+    not an alternative.
+
+    The variables are:
+
+    depvar gives the name of the chosen alternative,
+    ich.alt are the installation cost for the heating portion of the system,
+    icca is the installation cost for cooling
+    och.alt are the operating cost for the heating portion of the system
+    occa is the operating cost for cooling
+    income is the annual income of the household
+    Note that the full installation cost of alternative gcc is ich.gcc+icca, and similarly for the
+    operating cost and for the other alternatives with cooling.
+    """
+
+    data_file_name = "HC.csv.gz"
+
+    full_path = get_path(data_file_name, module=DATA_MODULE)
+    hc_df = pd.read_csv(full_path)
+
+    if return_desc:
+        return desc
+
+    if as_frame:
+        return hc_df
+
+    items_id = ["gcc", "ecc", "erc", "hpc", "gc", "ec", "er"]
+    return ChoiceDataset.from_single_wide_df(
+        df=hc_df,
+        shared_features_columns=["income"],
+        items_features_prefixes=["ich", "och", "occa", "icca"],
+        delimiter=".",
+        items_id=items_id,
+        choices_column="depvar",
+        choice_format="items_id",
+    )
diff --git a/choice_learn/datasets/data/HC.csv.gz b/choice_learn/datasets/data/HC.csv.gz
diff --git a/choice_learn/models/__init__.py b/choice_learn/models/__init__.py
@@ -1,16 +1,19 @@
 """Models classes and functions."""
+import logging
+
 import tensorflow as tf
 
 from .conditional_logit import ConditionalLogit
+from .nested_logit import NestedLogit
 from .simple_mnl import SimpleMNL
 from .tastenet import TasteNet
 
 if len(tf.config.list_physical_devices("GPU")) > 0:
-    print("GPU detected, importing GPU version of RUMnet.")
+    logging.info("GPU detected, importing GPU version of RUMnet.")
     from .rumnet import GPURUMnet as RUMnet
 else:
     from .rumnet import CPURUMnet as RUMnet
 
-    print("No GPU detected, importing CPU version of RUMnet.")
+    logging.info("No GPU detected, importing CPU version of RUMnet.")
 
-__all__ = ["ConditionalLogit", "RUMnet", "SimpleMNL", "TasteNet"]
+__all__ = ["ConditionalLogit", "RUMnet", "SimpleMNL", "TasteNet", "NestedLogit"]
diff --git a/choice_learn/models/base_model.py b/choice_learn/models/base_model.py
@@ -69,6 +69,15 @@ def __init__(
         self.loss = tf_ops.CustomCategoricalCrossEntropy(
             from_logits=False, label_smoothing=self.label_smoothing
         )
+        self.exact_nll = tf_ops.CustomCategoricalCrossEntropy(
+            from_logits=False,
+            label_smoothing=0.0,
+            sparse=False,
+            axis=-1,
+            epsilon=1e-35,
+            name="exact_categorical_crossentropy",
+            reduction=tf.keras.losses.Reduction.AUTO,
+        )
         self.callbacks = tf.keras.callbacks.CallbackList(callbacks, add_history=True, model=None)
         self.callbacks.set_model(self)
 
@@ -462,7 +471,12 @@ def batch_predict(
                 y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
                 sample_weight=sample_weight,
             ),
-            "NegativeLogLikelihood": tf.keras.losses.CategoricalCrossentropy()(
+            # "NegativeLogLikelihood": tf.keras.losses.CategoricalCrossentropy()(
+            #     y_pred=probabilities,
+            #     y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
+            #     sample_weight=sample_weight,
+            # ),
+            "Exact-NegativeLogLikelihood": self.exact_nll(
                 y_pred=probabilities,
                 y_true=tf.one_hot(choices, depth=probabilities.shape[1]),
                 sample_weight=sample_weight,
@@ -584,7 +598,7 @@ def evaluate(self, choice_dataset, sample_weight=None, batch_size=-1, mode="eval
                 sample_weight=sample_weight,
             )
             if mode == "eval":
-                batch_losses.append(loss["NegativeLogLikelihood"])
+                batch_losses.append(loss["Exact-NegativeLogLikelihood"])
             elif mode == "optim":
                 batch_losses.append(loss["optimized_loss"])
         if batch_size != -1:
@@ -670,7 +684,7 @@ def f(params_1d):
                 assign_new_model_parameters(params_1d)
                 # calculate the loss
                 loss_value = self.evaluate(
-                    dataset, sample_weight=sample_weight, batch_size=-1, mode="optim"
+                    dataset, sample_weight=sample_weight, batch_size=-1, mode="eval"
                 )
                 if self.regularization is not None:
                     regularization = tf.reduce_sum(
@@ -681,7 +695,6 @@ def f(params_1d):
             # calculate gradients and convert to 1D tf.Tensor
             grads = tape.gradient(loss_value, self.trainable_weights)
             grads = tf.dynamic_stitch(idx, grads)
-
             # print out iteration & loss
             f.iter.assign_add(1)
 
@@ -729,7 +742,6 @@ def _fit_with_lbfgs(self, dataset, sample_weight=None, verbose=0):
 
         # convert initial model parameters to a 1D tf.Tensor
         init_params = tf.dynamic_stitch(func.idx, self.trainable_weights)
-
         # train the model with L-BFGS solver
         results = tfp.optimizer.lbfgs_minimize(
             value_and_gradients_function=func,