From 7194e941d1d6f29b0b066ebc50428e7956b89fe2 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 18:26:38 -0400
Subject: [PATCH 01/14] first pass at vignette on forecast data

---
 .gitignore                  |   1 +
 DESCRIPTION                 |   8 ++
 vignettes/.gitignore        |   2 +
 vignettes/forecast_data.Rmd | 256 ++++++++++++++++++++++++++++++++++++
 4 files changed, 267 insertions(+)
 create mode 100644 vignettes/.gitignore
 create mode 100644 vignettes/forecast_data.Rmd

diff --git a/.gitignore b/.gitignore
index 75fff09..7b3f3ad 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,3 +6,4 @@
 .secrets
 .quarto
 docs
+inst/doc
diff --git a/DESCRIPTION b/DESCRIPTION
index 9565f50..79b8dad 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -22,7 +22,14 @@ Encoding: UTF-8
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.3.1
 Suggests: 
+    dplyr,
+    ggplot2,
+    hubVis,
+    knitr,
+    rmarkdown,
     testthat (>= 3.0.0)
+Remotes:
+    Infectious-Disease-Modeling-Hubs/hubVis
 Config/testthat/edition: 3
 URL: https://github.com/Infectious-Disease-Modeling-Hubs/hubExamples
 BugReports: https://github.com/Infectious-Disease-Modeling-Hubs/hubExamples/issues
@@ -30,3 +37,4 @@ Config/Needs/website: Infectious-Disease-Modeling-Hubs/hubStyle
 Depends: 
     R (>= 2.10)
 LazyData: true
+VignetteBuilder: knitr
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
new file mode 100644
index 0000000..097b241
--- /dev/null
+++ b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
new file mode 100644
index 0000000..258900d
--- /dev/null
+++ b/vignettes/forecast_data.Rmd
@@ -0,0 +1,256 @@
+---
+title: "Example forecast hub data"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Example forecast hub data}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  fig.height = 5,
+  fig.width = 8,
+  fig.align = "center"
+)
+```
+
+```{r setup}
+library(hubExamples)
+library(hubVis)
+library(dplyr)
+library(ggplot2)
+```
+
+# Overview
+
+The `hubExamples` package provides three data sets that contain example model output and
+target data for an example forecast hub: `forecast_outputs`, `forecast_target_ts`, and
+`forecast_target_observations`. These forecasts and target data are a subset of the model outputs and target data that are provided in the [example-complex-forecast-hub](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub). These data were in turn derived from forecast submissions and target data for the [FluSight Forecast Hub](https://github.com/cdcepi/Flusight-forecast-data) for the 2022/23 season.
+
+We will begin with a high level overview of these data objects, and then we will describe the different forecast targets in more detail in \@ref(forecast-targets).
+
+## Example forecast output data
+
+The example forecasts provided in `forecast_outputs` are derived from forecasts that were submitted to the FluSight hub from three models: `Flusight-baseline`, `MOBS-GLEAM_FLUH`, and `PSI-DICE`. The original forecasts submitted to the hub were in quantile format, but we have modified those submissions to provide examples of additional model output types and targets. The predictions for these other output types should be viewed only as illustrations of the data formats, not as real examples of forecasts. We will describe the methods used for creating other forecast output types below.
+
+The snippet below shows the format of the `forecast_outputs`.
+
+```{r}
+head(forecast_outputs)
+```
+
+This is a data frame with four groups of columns (see the [hubverse documentation](https://hubverse.io/en/latest/user-guide/model-output.html) for more information about these data formats):
+
+1. The `model_id` identifies the model that produced the predictions.
+2. Together, the `location`, `reference_date`, `horizon`, `target_end_date`, and `target` columns serve to identify a prediction task:
+    - The `location` column contains a FIPS code specifying the location being predicted.
+    - The `reference_date` is a date in ISO format that gives the Saturday ending the week the predictions were generated.
+    - The `horizon` gives the difference between the `reference_date` and the target date of the forecasts (`target_end_date`, see next item) in units of weeks. Informally, this describes "how far ahead" the predictions are targeting.
+    - The `target_end_date` is a date in ISO format that gives the Saturday ending the week being predicted. For example, if the `target_end_date` is `"2022-12-17"`, predictions are for a quantity relating to influenza activity in the week from Sunday, December 10, 2022 through Saturday, December 17, 2022.
+    - The `target` describes the target quantity for the prediction. In the above example, the `target` of `"wk flu hosp rate"` is the weekly rate of hospital admissions per 100,000 population. The targets included in this example will be described in other sections below.
+3. The `output_type` and `output_type_id` columns provide metadata about the model predictions.
+    - The `output_type` specifies the representation of the predictive distribution.
+    - The `output_type_id` gives additional identifying information about the predictions; the information in this column is specific to the `output_type`.
+4. The `value` contains the value of the model's prediction.
+
+The original hub submissions contained predictions for many locations and dates, and quantile forecasts were provided at 23 different quantile levels ranging from 0.01 to 0.99.  To make the example data more manageable, the `forecast_outputs` object contains a subset of these outputs for two locations (Massachusetts, FIPS code `"25"`, and Texas, FIPS code `"48"`) and two reference dates (2022-11-19 and 2022-12-17).  Additionally, for the quantile forecasts we have subset to seven quantile levels: 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.95.
+
+## Example forecast target data
+
+All predictions are for targets that are based on influenza hospital admissions as reported in the US National Healthcare Safety Network (NHSN). The `forecast_target_ts` object contains the observed values of these hospital admissions in a "time series format":
+
+```{r}
+head(forecast_target_ts)
+tail(forecast_target_ts)
+```
+
+The `forecast_target_observations` object contains the observed values for the prediction targets:
+
+```{r}
+head(forecast_target_observations)
+```
+
+This data frame has a subset of the columns in the `forecast_outputs` that is sufficient to identify the observed value corresponding to each prediction, including the `location`, `target_end_date`, `target`, `output_type`, and `output_type_id`, along with the observed target values, recorded in the `observation` columns. Note that the `reference_date`, and `horizon` columns are not needed in this data frame, since the `target_end_date` is sufficient to align observations with predicted values.
+
+# Further detail on the forecast targets
+
+The example forecast data contains the following combinations of `target` and `output_type`:
+
+```{r}
+forecast_outputs |> distinct(target, output_type)
+```
+
+We will describe each of these targets in the following sections.
+
+## The `wk inc flu hosp` target
+
+The `wk inc flu hosp` target represents weekly new hospital admissions with a confirmed influenza diagnosis.  We have predictions of this target with three output types: `quantile`, `mean`, and `median`.  The following plot shows the quantile and median predictions along with the observed hospital admission counts for Massachusetts and Texas.  Note that the quantile predictions were contributed directly by modelers to the FluSight hub, and median predictions correspond exactly to the quantile predictions at probability level 0.5.  We have obtained mean predictions from these using the [distfromq package](https://github.com/reichlab/distfromq) by estimating the full quantile function from the submitted quantile predictions, drawing a sample using the probability integral transform method, and computing the mean of those samples.
+
+```{r}
+plot_step_ahead_model_output(
+    model_output_data = forecast_outputs |>
+        filter(output_type %in% c("quantile", "median")),
+    target_data = forecast_target_ts |>
+        filter(location %in% c("25", "48"),
+               date >= "2022-10-01", date <= "2023-04-01"),
+    use_median_as_point = TRUE,
+    x_col_name = "target_end_date",
+    intervals = c(0.5, 0.8, 0.9),
+    facet = "location",
+    group = "reference_date",
+   interactive = FALSE
+)
+```
+
+For purposes of evaluating predictions, it can be helpful to join the observed target values, contained in `forecast_target_observations`, into the data frame of forecast outputs. This enables direct comparison of predictions and observations:
+
+```{r}
+forecast_outputs |>
+    left_join(forecast_target_observations)
+```
+
+## The `wk flu hosp rate` target
+
+The `wk flu hosp rate` target represents the rate of weekly confirmed influenza hospital admissions per 100,000 population. Note that this target was not included in the FluSight hub; we have introduced it here for illustrative purposes. We have used population values of 6,978,662 for Massachusetts and 29,914,599 for Texas. These population values are sourced from the [auxiliary data file](https://github.com/cdcepi/FluSight-forecast-hub/blob/main/auxiliary-data/locations.csv) provided by the FluSight hub, which are also reproduced in the [example-complex-forecast-hub](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub) repository.
+
+For this target, we created cumulative distribution function (CDF) predictions with evenly spaced CDF evaluation points ranging from 0.25 to 25 in increments of 0.25 hospitalizations per 100,000 population:
+
+```{r}
+forecast_outputs |>
+    filter(target == "wk flu hosp rate") |>
+    head()
+```
+
+For the CDF `output_type`, the `output_type_id` contains the value at which the predictive CDF was evaluated, and the `value` contains the predicted probability that the target is less than or equal to that evaluation point. In the above example, the `value` in the row with `output_type_id` equal to 1.5 contains the model's estimated probability that the rate of hospital admissions in Texas the week of December 17, 2022 would be less than or equal to 1.5 admissions per 100,000 population. Again, these CDF values were estimated from the original quantile forecasts using the methods in the `distfromq` package.
+
+The following plot illustrates the predictive CDFs produced by the `MOBS-GLEAM_FLUH` model for Massachusetts on the reference date 2022-12-17, with each `target_end_date` shown in a separate facet.  Also shown in orange is a CDF representing the observation for this target, which was between 9.75 and 10 hospitalizations per 100,000 population.  This CDF corresponds to a point mass at the observed value, with a value of 0 below the observation and a value of 1 above the observation.
+
+```{r}
+# Subset the forecasts and observations to those that we will plot
+forecasts_to_plot <- forecast_outputs |>
+    filter(
+        model_id == "MOBS-GLEAM_FLUH",
+        target == "wk flu hosp rate",
+        location == "25",
+        reference_date == "2022-12-17"
+    ) |>
+    mutate(output_type_id = as.numeric(output_type_id))
+head(forecasts_to_plot)
+
+target_observations_to_plot <- forecast_target_observations |>
+    filter(
+        target == "wk flu hosp rate",
+        location == "25",
+        target_end_date %in% forecasts_to_plot$target_end_date
+    ) |>
+    mutate(output_type_id = as.numeric(output_type_id))
+head(target_observations_to_plot)
+
+# We illustrate that the cdf values recorded in forecast_target_observations
+# correspond to a point mass at the observed hospitalization rate.
+first_one_ind <- min(which(target_observations_to_plot$observation == 1))
+target_observations_to_plot[(first_one_ind - 2):(first_one_ind + 2), ]
+
+# Make the plot
+ggplot() +
+    geom_line(
+        mapping = aes(x = output_type_id, y = value,
+                      color = "forecast", linetype = "forecast"),
+        data = forecasts_to_plot) +
+    geom_line(
+        mapping = aes(x = output_type_id, y = observation,
+                      color = "observation", linetype = "observation"),
+        data = target_observations_to_plot,
+    ) +
+    scale_color_manual(
+        "CDF",
+        values = c("black", "orange")) +
+    scale_linetype_manual(
+        "CDF",
+        values = c(1, 2)) +
+    facet_wrap(vars(target_end_date)) + 
+    xlab("output_type_id (units are hospital admissions per 100,000 population)") +
+    ylab("CDF value (units are probability)")
+```
+
+## The `wk flu hosp rate category` target
+
+The "wk flu hosp rate category" target represents a categorical intensity level of influenza activity, defined as "low" (hospital admissions rate per 100,000 $\leq$ 2.5), "moderate" (2.5 < admissions rate $\leq$ 5), "high" (5 < admissions rate $\leq$ 7.5), or "very high" (7.5 < admissions rate).  The `forecast_outputs` object has example forecasts for this target in a PMF format, with a probability assigned to each intensity category.  Again, forecasts of this target were not collected by the FluSight hub; we have derived predictions from the submitted quantile forecasts using the `distfromq` package.  For context, the following plot displays the observed data for the 2022/23 season on the scale of hospital admissions, with the boundaries of the intensity categories denoted with horizontal lines:
+
+```{r}
+# a data frame containing location FIPS codes and population values in units of 100,000 people
+population_values <- data.frame(
+    location = c("25", "48"),
+    population_100k = c(6978662, 29914599) / 100000
+)
+
+# compute observed hospital admission rates for the 2022/23 season
+observed_rates <- forecast_target_ts |>
+    filter(location %in% c("25", "48"),
+           date >= "2022-10-01", date <= "2023-04-01") |>
+    left_join(population_values) |>
+    mutate(rate = observation / population_100k)
+
+# plot along with intensity thresholds
+ggplot() +
+    geom_line(
+        mapping = aes(x = date, y = rate),
+        data = observed_rates
+    ) +
+    geom_hline(
+        mapping = aes(yintercept = y),
+        linetype = 2,
+        data = data.frame(y = c(2.5, 5, 7.5))
+    ) +
+    facet_wrap(vars(location))
+```
+
+Here is a plot showing the predictive distributions for these targets from the three included models.  Color indicates the predicted probability for each intensity category.  The observed category is indicated with a `+` in the plot, while unobserved categories are indicated with an `o`.  Here, the PMF value recorded in `forecast_target_observations` corresponds to a point mass at the observed category, with a value of 1 for the observed category and a value of 0 for the other categories.
+
+```{r}
+forecasts_to_plot <- forecast_outputs |>
+    filter(
+        target == "wk flu hosp rate category",
+        reference_date == "2022-12-17"
+    ) |>
+    mutate(
+        output_type_id = factor(output_type_id,
+                                levels = c("low", "moderate", "high", "very high"),
+                                ordered = TRUE)
+    )
+head(forecasts_to_plot)
+
+observations_to_plot <- forecast_target_observations |>
+    filter(
+        location %in% c("25", "48"),
+        target == "wk flu hosp rate category",
+        target_end_date %in% forecasts_to_plot$target_end_date
+    ) |>
+    mutate(
+        output_type_id = factor(output_type_id,
+                                levels = c("low", "moderate", "high", "very high"),
+                                ordered = TRUE)
+    )
+head(observations_to_plot)
+
+ggplot() +
+    geom_raster(
+        mapping = aes(x = target_end_date, y = output_type_id, fill = value),
+        data = forecasts_to_plot
+    ) +
+    scale_fill_viridis_c(breaks = seq(from = 0, to = 1, by = 0.2),
+                         limits = c(0, 1)) +
+    geom_point(
+        mapping = aes(x = target_end_date, y = output_type_id, shape = factor(observation)),
+        data = observations_to_plot
+    ) +
+    scale_shape_manual(
+        values = c(1, 3),
+        breaks = c(0, 1)
+    ) +
+    facet_grid(rows = vars(model_id), cols = vars(location)) +
+    ylab("output_type_id (intensity level category)")
+```

From 428bb8c65da33d1163404da7c3f09c45cb857906 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 18:37:32 -0400
Subject: [PATCH 02/14] attempt to fix linting and section reference

---
 vignettes/forecast_data.Rmd | 194 ++++++++++++++++++------------------
 1 file changed, 97 insertions(+), 97 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 258900d..cc5e7b9 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -30,7 +30,7 @@ The `hubExamples` package provides three data sets that contain example model ou
 target data for an example forecast hub: `forecast_outputs`, `forecast_target_ts`, and
 `forecast_target_observations`. These forecasts and target data are a subset of the model outputs and target data that are provided in the [example-complex-forecast-hub](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub). These data were in turn derived from forecast submissions and target data for the [FluSight Forecast Hub](https://github.com/cdcepi/Flusight-forecast-data) for the 2022/23 season.
 
-We will begin with a high level overview of these data objects, and then we will describe the different forecast targets in more detail in \@ref(forecast-targets).
+We will begin with a high level overview of these data objects, and then we will describe the different forecast targets in more detail in \@ref(further-detail-on-the-forecast-targets).
 
 ## Example forecast output data
 
@@ -91,17 +91,17 @@ The `wk inc flu hosp` target represents weekly new hospital admissions with a co
 
 ```{r}
 plot_step_ahead_model_output(
-    model_output_data = forecast_outputs |>
-        filter(output_type %in% c("quantile", "median")),
-    target_data = forecast_target_ts |>
-        filter(location %in% c("25", "48"),
-               date >= "2022-10-01", date <= "2023-04-01"),
-    use_median_as_point = TRUE,
-    x_col_name = "target_end_date",
-    intervals = c(0.5, 0.8, 0.9),
-    facet = "location",
-    group = "reference_date",
-   interactive = FALSE
+  model_output_data = forecast_outputs |>
+    filter(output_type %in% c("quantile", "median")),
+  target_data = forecast_target_ts |>
+    filter(location %in% c("25", "48"),
+           date >= "2022-10-01", date <= "2023-04-01"),
+  use_median_as_point = TRUE,
+  x_col_name = "target_end_date",
+  intervals = c(0.5, 0.8, 0.9),
+  facet = "location",
+  group = "reference_date",
+  interactive = FALSE
 )
 ```
 
@@ -109,7 +109,7 @@ For purposes of evaluating predictions, it can be helpful to join the observed t
 
 ```{r}
 forecast_outputs |>
-    left_join(forecast_target_observations)
+  left_join(forecast_target_observations)
 ```
 
 ## The `wk flu hosp rate` target
@@ -120,8 +120,8 @@ For this target, we created cumulative distribution function (CDF) predictions w
 
 ```{r}
 forecast_outputs |>
-    filter(target == "wk flu hosp rate") |>
-    head()
+  filter(target == "wk flu hosp rate") |>
+  head()
 ```
 
 For the CDF `output_type`, the `output_type_id` contains the value at which the predictive CDF was evaluated, and the `value` contains the predicted probability that the target is less than or equal to that evaluation point. In the above example, the `value` in the row with `output_type_id` equal to 1.5 contains the model's estimated probability that the rate of hospital admissions in Texas the week of December 17, 2022 would be less than or equal to 1.5 admissions per 100,000 population. Again, these CDF values were estimated from the original quantile forecasts using the methods in the `distfromq` package.
@@ -131,22 +131,22 @@ The following plot illustrates the predictive CDFs produced by the `MOBS-GLEAM_F
 ```{r}
 # Subset the forecasts and observations to those that we will plot
 forecasts_to_plot <- forecast_outputs |>
-    filter(
-        model_id == "MOBS-GLEAM_FLUH",
-        target == "wk flu hosp rate",
-        location == "25",
-        reference_date == "2022-12-17"
-    ) |>
-    mutate(output_type_id = as.numeric(output_type_id))
+  filter(
+    model_id == "MOBS-GLEAM_FLUH",
+    target == "wk flu hosp rate",
+    location == "25",
+    reference_date == "2022-12-17"
+  ) |>
+  mutate(output_type_id = as.numeric(output_type_id))
 head(forecasts_to_plot)
 
 target_observations_to_plot <- forecast_target_observations |>
-    filter(
-        target == "wk flu hosp rate",
-        location == "25",
-        target_end_date %in% forecasts_to_plot$target_end_date
-    ) |>
-    mutate(output_type_id = as.numeric(output_type_id))
+  filter(
+    target == "wk flu hosp rate",
+    location == "25",
+    target_end_date %in% forecasts_to_plot$target_end_date
+  ) |>
+  mutate(output_type_id = as.numeric(output_type_id))
 head(target_observations_to_plot)
 
 # We illustrate that the cdf values recorded in forecast_target_observations
@@ -156,24 +156,24 @@ target_observations_to_plot[(first_one_ind - 2):(first_one_ind + 2), ]
 
 # Make the plot
 ggplot() +
-    geom_line(
-        mapping = aes(x = output_type_id, y = value,
-                      color = "forecast", linetype = "forecast"),
-        data = forecasts_to_plot) +
-    geom_line(
-        mapping = aes(x = output_type_id, y = observation,
-                      color = "observation", linetype = "observation"),
-        data = target_observations_to_plot,
-    ) +
-    scale_color_manual(
-        "CDF",
-        values = c("black", "orange")) +
-    scale_linetype_manual(
-        "CDF",
-        values = c(1, 2)) +
-    facet_wrap(vars(target_end_date)) + 
-    xlab("output_type_id (units are hospital admissions per 100,000 population)") +
-    ylab("CDF value (units are probability)")
+  geom_line(
+    mapping = aes(x = output_type_id, y = value,
+                  color = "forecast", linetype = "forecast"),
+    data = forecasts_to_plot) +
+  geom_line(
+    mapping = aes(x = output_type_id, y = observation,
+                  color = "observation", linetype = "observation"),
+    data = target_observations_to_plot,
+  ) +
+  scale_color_manual(
+    "CDF",
+    values = c("black", "orange")) +
+  scale_linetype_manual(
+    "CDF",
+    values = c(1, 2)) +
+  facet_wrap(vars(target_end_date)) + 
+  xlab("output_type_id (units are hospital admissions per 100,000 population)") +
+  ylab("CDF value (units are probability)")
 ```
 
 ## The `wk flu hosp rate category` target
@@ -183,74 +183,74 @@ The "wk flu hosp rate category" target represents a categorical intensity level
 ```{r}
 # a data frame containing location FIPS codes and population values in units of 100,000 people
 population_values <- data.frame(
-    location = c("25", "48"),
-    population_100k = c(6978662, 29914599) / 100000
+  location = c("25", "48"),
+  population_100k = c(6978662, 29914599) / 100000
 )
 
 # compute observed hospital admission rates for the 2022/23 season
 observed_rates <- forecast_target_ts |>
-    filter(location %in% c("25", "48"),
-           date >= "2022-10-01", date <= "2023-04-01") |>
-    left_join(population_values) |>
-    mutate(rate = observation / population_100k)
+  filter(location %in% c("25", "48"),
+         date >= "2022-10-01", date <= "2023-04-01") |>
+  left_join(population_values) |>
+  mutate(rate = observation / population_100k)
 
 # plot along with intensity thresholds
 ggplot() +
-    geom_line(
-        mapping = aes(x = date, y = rate),
-        data = observed_rates
-    ) +
-    geom_hline(
-        mapping = aes(yintercept = y),
-        linetype = 2,
-        data = data.frame(y = c(2.5, 5, 7.5))
-    ) +
-    facet_wrap(vars(location))
+  geom_line(
+    mapping = aes(x = date, y = rate),
+    data = observed_rates
+  ) +
+  geom_hline(
+    mapping = aes(yintercept = y),
+    linetype = 2,
+    data = data.frame(y = c(2.5, 5, 7.5))
+  ) +
+  facet_wrap(vars(location))
 ```
 
 Here is a plot showing the predictive distributions for these targets from the three included models.  Color indicates the predicted probability for each intensity category.  The observed category is indicated with a `+` in the plot, while unobserved categories are indicated with an `o`.  Here, the PMF value recorded in `forecast_target_observations` corresponds to a point mass at the observed category, with a value of 1 for the observed category and a value of 0 for the other categories.
 
 ```{r}
 forecasts_to_plot <- forecast_outputs |>
-    filter(
-        target == "wk flu hosp rate category",
-        reference_date == "2022-12-17"
-    ) |>
-    mutate(
-        output_type_id = factor(output_type_id,
-                                levels = c("low", "moderate", "high", "very high"),
-                                ordered = TRUE)
-    )
+  filter(
+    target == "wk flu hosp rate category",
+    reference_date == "2022-12-17"
+  ) |>
+  mutate(
+    output_type_id = factor(output_type_id,
+                            levels = c("low", "moderate", "high", "very high"),
+                            ordered = TRUE)
+  )
 head(forecasts_to_plot)
 
 observations_to_plot <- forecast_target_observations |>
-    filter(
-        location %in% c("25", "48"),
-        target == "wk flu hosp rate category",
-        target_end_date %in% forecasts_to_plot$target_end_date
-    ) |>
-    mutate(
-        output_type_id = factor(output_type_id,
-                                levels = c("low", "moderate", "high", "very high"),
-                                ordered = TRUE)
-    )
+  filter(
+    location %in% c("25", "48"),
+    target == "wk flu hosp rate category",
+    target_end_date %in% forecasts_to_plot$target_end_date
+  ) |>
+  mutate(
+    output_type_id = factor(output_type_id,
+                            levels = c("low", "moderate", "high", "very high"),
+                            ordered = TRUE)
+  )
 head(observations_to_plot)
 
 ggplot() +
-    geom_raster(
-        mapping = aes(x = target_end_date, y = output_type_id, fill = value),
-        data = forecasts_to_plot
-    ) +
-    scale_fill_viridis_c(breaks = seq(from = 0, to = 1, by = 0.2),
-                         limits = c(0, 1)) +
-    geom_point(
-        mapping = aes(x = target_end_date, y = output_type_id, shape = factor(observation)),
-        data = observations_to_plot
-    ) +
-    scale_shape_manual(
-        values = c(1, 3),
-        breaks = c(0, 1)
-    ) +
-    facet_grid(rows = vars(model_id), cols = vars(location)) +
-    ylab("output_type_id (intensity level category)")
+  geom_raster(
+    mapping = aes(x = target_end_date, y = output_type_id, fill = value),
+    data = forecasts_to_plot
+  ) +
+  scale_fill_viridis_c(breaks = seq(from = 0, to = 1, by = 0.2),
+                       limits = c(0, 1)) +
+  geom_point(
+    mapping = aes(x = target_end_date, y = output_type_id, shape = factor(observation)),
+    data = observations_to_plot
+  ) +
+  scale_shape_manual(
+    values = c(1, 3),
+    breaks = c(0, 1)
+  ) +
+  facet_grid(rows = vars(model_id), cols = vars(location)) +
+  ylab("output_type_id (intensity level category)")
 ```

From c2a2280cf852d23ea7719d9ecbdeaac96f58924f Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 18:40:24 -0400
Subject: [PATCH 03/14] more linter appeasement

---
 vignettes/forecast_data.Rmd | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index cc5e7b9..89ebcd6 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -159,7 +159,8 @@ ggplot() +
   geom_line(
     mapping = aes(x = output_type_id, y = value,
                   color = "forecast", linetype = "forecast"),
-    data = forecasts_to_plot) +
+    data = forecasts_to_plot
+  ) +
   geom_line(
     mapping = aes(x = output_type_id, y = observation,
                   color = "observation", linetype = "observation"),
@@ -167,10 +168,12 @@ ggplot() +
   ) +
   scale_color_manual(
     "CDF",
-    values = c("black", "orange")) +
+    values = c("black", "orange")
+  ) +
   scale_linetype_manual(
     "CDF",
-    values = c(1, 2)) +
+    values = c(1, 2)
+  ) +
   facet_wrap(vars(target_end_date)) + 
   xlab("output_type_id (units are hospital admissions per 100,000 population)") +
   ylab("CDF value (units are probability)")

From 5eac357e9ca19ceec31fdcee15ea309f8821136b Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 18:48:15 -0400
Subject: [PATCH 04/14] give up on section reference

---
 vignettes/forecast_data.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 89ebcd6..7101270 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -30,7 +30,7 @@ The `hubExamples` package provides three data sets that contain example model ou
 target data for an example forecast hub: `forecast_outputs`, `forecast_target_ts`, and
 `forecast_target_observations`. These forecasts and target data are a subset of the model outputs and target data that are provided in the [example-complex-forecast-hub](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub). These data were in turn derived from forecast submissions and target data for the [FluSight Forecast Hub](https://github.com/cdcepi/Flusight-forecast-data) for the 2022/23 season.
 
-We will begin with a high level overview of these data objects, and then we will describe the different forecast targets in more detail in \@ref(further-detail-on-the-forecast-targets).
+We begin with a high level overview of these data objects and then we describe the different forecast targets in more detail.
 
 ## Example forecast output data
 

From 04726febdb40c5627d297b92d453440af16785d5 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 18:50:56 -0400
Subject: [PATCH 05/14] delete a space

---
 vignettes/forecast_data.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 7101270..de5e833 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -174,7 +174,7 @@ ggplot() +
     "CDF",
     values = c(1, 2)
   ) +
-  facet_wrap(vars(target_end_date)) + 
+  facet_wrap(vars(target_end_date)) +
   xlab("output_type_id (units are hospital admissions per 100,000 population)") +
   ylab("CDF value (units are probability)")
 ```

From dec509bd6af98f241190a8331f88b6fb851f1942 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 19:07:32 -0400
Subject: [PATCH 06/14] fix typo

---
 vignettes/forecast_data.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index de5e833..9eecfda 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -49,7 +49,7 @@ This is a data frame with four groups of columns (see the [hubverse documentatio
     - The `location` column contains a FIPS code specifying the location being predicted.
     - The `reference_date` is a date in ISO format that gives the Saturday ending the week the predictions were generated.
     - The `horizon` gives the difference between the `reference_date` and the target date of the forecasts (`target_end_date`, see next item) in units of weeks. Informally, this describes "how far ahead" the predictions are targeting.
-    - The `target_end_date` is a date in ISO format that gives the Saturday ending the week being predicted. For example, if the `target_end_date` is `"2022-12-17"`, predictions are for a quantity relating to influenza activity in the week from Sunday, December 10, 2022 through Saturday, December 17, 2022.
+    - The `target_end_date` is a date in ISO format that gives the Saturday ending the week being predicted. For example, if the `target_end_date` is `"2022-12-17"`, predictions are for a quantity relating to influenza activity in the week from Sunday, December 11, 2022 through Saturday, December 17, 2022.
     - The `target` describes the target quantity for the prediction. In the above example, the `target` of `"wk flu hosp rate"` is the weekly rate of hospital admissions per 100,000 population. The targets included in this example will be described in other sections below.
 3. The `output_type` and `output_type_id` columns provide metadata about the model predictions.
     - The `output_type` specifies the representation of the predictive distribution.

From 531bf1cc90d93b558328298e7d9748bcaca90f86 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 19:19:01 -0400
Subject: [PATCH 07/14] a few minor updates to text

---
 vignettes/forecast_data.Rmd | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 9eecfda..a0acb7b 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -40,6 +40,7 @@ The snippet below shows the format of the `forecast_outputs`.
 
 ```{r}
 head(forecast_outputs)
+str(forecast_outputs)
 ```
 
 This is a data frame with four groups of columns (see the [hubverse documentation](https://hubverse.io/en/latest/user-guide/model-output.html) for more information about these data formats):
@@ -54,13 +55,13 @@ This is a data frame with four groups of columns (see the [hubverse documentatio
 3. The `output_type` and `output_type_id` columns provide metadata about the model predictions.
     - The `output_type` specifies the representation of the predictive distribution.
     - The `output_type_id` gives additional identifying information about the predictions; the information in this column is specific to the `output_type`.
-4. The `value` contains the value of the model's prediction.
+4. The `value` column contains the value of the model's prediction.
 
 The original hub submissions contained predictions for many locations and dates, and quantile forecasts were provided at 23 different quantile levels ranging from 0.01 to 0.99.  To make the example data more manageable, the `forecast_outputs` object contains a subset of these outputs for two locations (Massachusetts, FIPS code `"25"`, and Texas, FIPS code `"48"`) and two reference dates (2022-11-19 and 2022-12-17).  Additionally, for the quantile forecasts we have subset to seven quantile levels: 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.95.
 
 ## Example forecast target data
 
-All predictions are for targets that are based on influenza hospital admissions as reported in the US National Healthcare Safety Network (NHSN). The `forecast_target_ts` object contains the observed values of these hospital admissions in a "time series format":
+All predictions are for targets that are based on influenza hospital admissions as reported in the US National Healthcare Safety Network (NHSN). The `forecast_target_ts` object contains the observed values of these hospital admissions in a time series format:
 
 ```{r}
 head(forecast_target_ts)
@@ -73,7 +74,7 @@ The `forecast_target_observations` object contains the observed values for the p
 head(forecast_target_observations)
 ```
 
-This data frame has a subset of the columns in the `forecast_outputs` that is sufficient to identify the observed value corresponding to each prediction, including the `location`, `target_end_date`, `target`, `output_type`, and `output_type_id`, along with the observed target values, recorded in the `observation` columns. Note that the `reference_date`, and `horizon` columns are not needed in this data frame, since the `target_end_date` is sufficient to align observations with predicted values.
+This data frame has a subset of the columns in the `forecast_outputs` that is sufficient to identify the observed value corresponding to each prediction, including the `location`, `target_end_date`, `target`, `output_type`, and `output_type_id`, along with the observed target values, recorded in the `observation` column. Note that the `reference_date` and `horizon` columns are not needed in this data frame, since the `target_end_date` is sufficient to align observations with predictions.
 
 # Further detail on the forecast targets
 
@@ -87,7 +88,7 @@ We will describe each of these targets in the following sections.
 
 ## The `wk inc flu hosp` target
 
-The `wk inc flu hosp` target represents weekly new hospital admissions with a confirmed influenza diagnosis.  We have predictions of this target with three output types: `quantile`, `mean`, and `median`.  The following plot shows the quantile and median predictions along with the observed hospital admission counts for Massachusetts and Texas.  Note that the quantile predictions were contributed directly by modelers to the FluSight hub, and median predictions correspond exactly to the quantile predictions at probability level 0.5.  We have obtained mean predictions from these using the [distfromq package](https://github.com/reichlab/distfromq) by estimating the full quantile function from the submitted quantile predictions, drawing a sample using the probability integral transform method, and computing the mean of those samples.
+The `wk inc flu hosp` target represents weekly new hospital admissions with a confirmed influenza diagnosis.  We have predictions of this target with three output types: `quantile`, `mean`, and `median`.  The following plot shows the quantile and median predictions along with the observed hospital admission counts for Massachusetts and Texas.  Note that the quantile predictions were contributed directly by modelers to the FluSight hub, and median predictions correspond exactly to the quantile predictions at probability level 0.5.  We have obtained mean predictions from the quantile forecasts using the [distfromq package](https://github.com/reichlab/distfromq) by estimating the full quantile function from the submitted quantile predictions, drawing a sample using the probability integral transform method, and computing the mean of those samples.
 
 ```{r}
 plot_step_ahead_model_output(
@@ -105,11 +106,12 @@ plot_step_ahead_model_output(
 )
 ```
 
-For purposes of evaluating predictions, it can be helpful to join the observed target values, contained in `forecast_target_observations`, into the data frame of forecast outputs. This enables direct comparison of predictions and observations:
+For purposes of evaluating predictions, it can be helpful to join the observed target values, contained in `forecast_target_observations`, into the data frame of forecast outputs. This enables direct comparison of predictions and observations. We illustrate this here, omitting some columns from the display for convenience:
 
 ```{r}
 forecast_outputs |>
-  left_join(forecast_target_observations)
+  left_join(forecast_target_observations) |>
+  select(-model_id, -reference_date, -horizon)
 ```
 
 ## The `wk flu hosp rate` target
@@ -121,6 +123,7 @@ For this target, we created cumulative distribution function (CDF) predictions w
 ```{r}
 forecast_outputs |>
   filter(target == "wk flu hosp rate") |>
+  select(-model_id, -reference_date, -horizon) |>
   head()
 ```
 
@@ -147,7 +150,9 @@ target_observations_to_plot <- forecast_target_observations |>
     target_end_date %in% forecasts_to_plot$target_end_date
   ) |>
   mutate(output_type_id = as.numeric(output_type_id))
-head(target_observations_to_plot)
+target_observations_to_plot |>
+  select(-model_id, -reference_date, -horizon) |>
+  head()
 
 # We illustrate that the cdf values recorded in forecast_target_observations
 # correspond to a point mass at the observed hospitalization rate.
@@ -224,7 +229,9 @@ forecasts_to_plot <- forecast_outputs |>
                             levels = c("low", "moderate", "high", "very high"),
                             ordered = TRUE)
   )
-head(forecasts_to_plot)
+forecasts_to_plot |>
+  select(-model_id, -reference_date, -horizon) |>
+  head()
 
 observations_to_plot <- forecast_target_observations |>
   filter(
@@ -237,7 +244,8 @@ observations_to_plot <- forecast_target_observations |>
                             levels = c("low", "moderate", "high", "very high"),
                             ordered = TRUE)
   )
-head(observations_to_plot)
+observations_to_plot |>
+  head()
 
 ggplot() +
   geom_raster(

From fe936789c354715e4660164cd59cfcca7a284e4a Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 19:23:25 -0400
Subject: [PATCH 08/14] fix bug i introduced

---
 vignettes/forecast_data.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index a0acb7b..986990f 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -151,7 +151,7 @@ target_observations_to_plot <- forecast_target_observations |>
   ) |>
   mutate(output_type_id = as.numeric(output_type_id))
 target_observations_to_plot |>
-  select(-model_id, -reference_date, -horizon) |>
+  select(-reference_date, -horizon) |>
   head()
 
 # We illustrate that the cdf values recorded in forecast_target_observations

From 13bd458c3f6758843e8662b9c8b884fe12416425 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Thu, 25 Apr 2024 19:30:22 -0400
Subject: [PATCH 09/14] fix more bugs i introduced

---
 vignettes/forecast_data.Rmd | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 986990f..7e74381 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -150,9 +150,7 @@ target_observations_to_plot <- forecast_target_observations |>
     target_end_date %in% forecasts_to_plot$target_end_date
   ) |>
   mutate(output_type_id = as.numeric(output_type_id))
-target_observations_to_plot |>
-  select(-reference_date, -horizon) |>
-  head()
+head(target_observations_to_plot)
 
 # We illustrate that the cdf values recorded in forecast_target_observations
 # correspond to a point mass at the observed hospitalization rate.

From 01bd33c25053f7daa2421a7cc375cbde41ea3084 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Sat, 27 Apr 2024 13:31:29 -0400
Subject: [PATCH 10/14] updates to text around introduction of model outputs

---
 vignettes/forecast_data.Rmd | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 7e74381..539373a 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -46,8 +46,8 @@ str(forecast_outputs)
 This is a data frame with four groups of columns (see the [hubverse documentation](https://hubverse.io/en/latest/user-guide/model-output.html) for more information about these data formats):
 
 1. The `model_id` identifies the model that produced the predictions.
-2. Together, the `location`, `reference_date`, `horizon`, `target_end_date`, and `target` columns serve to identify a prediction task:
-    - The `location` column contains a FIPS code specifying the location being predicted.
+2. Together, the `location`, `reference_date`, `horizon`, `target_end_date`, and `target` columns are referred to as "task id variables" because they serve to identify a prediction task:
+    - The `location` column contains a FIPS code identifying the location being predicted.
     - The `reference_date` is a date in ISO format that gives the Saturday ending the week the predictions were generated.
     - The `horizon` gives the difference between the `reference_date` and the target date of the forecasts (`target_end_date`, see next item) in units of weeks. Informally, this describes "how far ahead" the predictions are targeting.
     - The `target_end_date` is a date in ISO format that gives the Saturday ending the week being predicted. For example, if the `target_end_date` is `"2022-12-17"`, predictions are for a quantity relating to influenza activity in the week from Sunday, December 11, 2022 through Saturday, December 17, 2022.
@@ -59,6 +59,8 @@ This is a data frame with four groups of columns (see the [hubverse documentatio
 
 The original hub submissions contained predictions for many locations and dates, and quantile forecasts were provided at 23 different quantile levels ranging from 0.01 to 0.99.  To make the example data more manageable, the `forecast_outputs` object contains a subset of these outputs for two locations (Massachusetts, FIPS code `"25"`, and Texas, FIPS code `"48"`) and two reference dates (2022-11-19 and 2022-12-17).  Additionally, for the quantile forecasts we have subset to seven quantile levels: 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.95.
 
+The task id variables used and values of those variables are specific to each forecast hub. For example, a hub collecting predictions for locations other than US states would use a different location identifier than FIPS codes, and a hub might introduce additional task id variables such as an identifier of age group or disease variant depending on the goals of the hub. See the hubverse documentation for further information about [task id variables](https://hubverse.io/en/latest/user-guide/tasks.html#task-id-variables).
+
 ## Example forecast target data
 
 All predictions are for targets that are based on influenza hospital admissions as reported in the US National Healthcare Safety Network (NHSN). The `forecast_target_ts` object contains the observed values of these hospital admissions in a time series format:

From cdd5af09cc35de7cbba340a3097e2a16d84c113c Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Sat, 27 Apr 2024 13:39:44 -0400
Subject: [PATCH 11/14] try width 100

---
 vignettes/forecast_data.Rmd | 1 +
 1 file changed, 1 insertion(+)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 539373a..d9d2c9c 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -15,6 +15,7 @@ knitr::opts_chunk$set(
   fig.width = 8,
   fig.align = "center"
 )
+options(width = 100)
 ```
 
 ```{r setup}

From ca152c71fdd05a86a9aa4e118e36c3e51e74d361 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Sat, 27 Apr 2024 13:52:35 -0400
Subject: [PATCH 12/14] updates to code style, commenting, and width

---
 vignettes/forecast_data.Rmd | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index d9d2c9c..6ba2537 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -15,7 +15,7 @@ knitr::opts_chunk$set(
   fig.width = 8,
   fig.align = "center"
 )
-options(width = 100)
+options(width = 110)
 ```
 
 ```{r setup}
@@ -41,7 +41,6 @@ The snippet below shows the format of the `forecast_outputs`.
 
 ```{r}
 head(forecast_outputs)
-str(forecast_outputs)
 ```
 
 This is a data frame with four groups of columns (see the [hubverse documentation](https://hubverse.io/en/latest/user-guide/model-output.html) for more information about these data formats):
@@ -190,7 +189,8 @@ ggplot() +
 The "wk flu hosp rate category" target represents a categorical intensity level of influenza activity, defined as "low" (hospital admissions rate per 100,000 $\leq$ 2.5), "moderate" (2.5 < admissions rate $\leq$ 5), "high" (5 < admissions rate $\leq$ 7.5), or "very high" (7.5 < admissions rate).  The `forecast_outputs` object has example forecasts for this target in a PMF format, with a probability assigned to each intensity category.  Again, forecasts of this target were not collected by the FluSight hub; we have derived predictions from the submitted quantile forecasts using the `distfromq` package.  For context, the following plot displays the observed data for the 2022/23 season on the scale of hospital admissions, with the boundaries of the intensity categories denoted with horizontal lines:
 
 ```{r}
-# a data frame containing location FIPS codes and population values in units of 100,000 people
+# a data frame containing location FIPS codes and population values
+# in units of 100,000 people
 population_values <- data.frame(
   location = c("25", "48"),
   population_100k = c(6978662, 29914599) / 100000
@@ -220,6 +220,8 @@ ggplot() +
 Here is a plot showing the predictive distributions for these targets from the three included models.  Color indicates the predicted probability for each intensity category.  The observed category is indicated with a `+` in the plot, while unobserved categories are indicated with an `o`.  Here, the PMF value recorded in `forecast_target_observations` corresponds to a point mass at the observed category, with a value of 1 for the observed category and a value of 0 for the other categories.
 
 ```{r}
+# extract a subset of forecasts to plot and
+# set the output_type_id to be an ordered factor
 forecasts_to_plot <- forecast_outputs |>
   filter(
     target == "wk flu hosp rate category",
@@ -234,6 +236,7 @@ forecasts_to_plot |>
   select(-model_id, -reference_date, -horizon) |>
   head()
 
+# extract the corresponding observations
 observations_to_plot <- forecast_target_observations |>
   filter(
     location %in% c("25", "48"),
@@ -248,15 +251,19 @@ observations_to_plot <- forecast_target_observations |>
 observations_to_plot |>
   head()
 
+# plot the predictions and observations
 ggplot() +
   geom_raster(
     mapping = aes(x = target_end_date, y = output_type_id, fill = value),
     data = forecasts_to_plot
   ) +
-  scale_fill_viridis_c(breaks = seq(from = 0, to = 1, by = 0.2),
-                       limits = c(0, 1)) +
+  scale_fill_viridis_c(
+    breaks = seq(from = 0, to = 1, by = 0.2),
+    limits = c(0, 1)
+  ) +
   geom_point(
-    mapping = aes(x = target_end_date, y = output_type_id, shape = factor(observation)),
+    mapping = aes(x = target_end_date, y = output_type_id,
+                  shape = factor(observation)),
     data = observations_to_plot
   ) +
   scale_shape_manual(

From dc8cf40731a999ee8245d1f8868d4c075b1ce5c1 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Sat, 27 Apr 2024 14:00:30 -0400
Subject: [PATCH 13/14] another update to width, note about scrolling

---
 vignettes/forecast_data.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index 6ba2537..b5337d9 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -15,7 +15,7 @@ knitr::opts_chunk$set(
   fig.width = 8,
   fig.align = "center"
 )
-options(width = 110)
+options(width = 120)
 ```
 
 ```{r setup}
@@ -37,7 +37,7 @@ We begin with a high level overview of these data objects and then we describe t
 
 The example forecasts provided in `forecast_outputs` are derived from forecasts that were submitted to the FluSight hub from three models: `Flusight-baseline`, `MOBS-GLEAM_FLUH`, and `PSI-DICE`. The original forecasts submitted to the hub were in quantile format, but we have modified those submissions to provide examples of additional model output types and targets. The predictions for these other output types should be viewed only as illustrations of the data formats, not as real examples of forecasts. We will describe the methods used for creating other forecast output types below.
 
-The snippet below shows the format of the `forecast_outputs`.
+The snippet below shows the format of the `forecast_outputs` (note: here and throughout the document, you may need to scroll to the right within displays of code output to see all data frame columns).
 
 ```{r}
 head(forecast_outputs)

From 9055205a4ffca2fcf06422ab0b21d4a6d66f1ce4 Mon Sep 17 00:00:00 2001
From: Evan Ray <elray@umass.edu>
Date: Sat, 27 Apr 2024 15:05:35 -0400
Subject: [PATCH 14/14] misc updates

---
 vignettes/forecast_data.Rmd | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/vignettes/forecast_data.Rmd b/vignettes/forecast_data.Rmd
index b5337d9..823c706 100644
--- a/vignettes/forecast_data.Rmd
+++ b/vignettes/forecast_data.Rmd
@@ -59,7 +59,7 @@ This is a data frame with four groups of columns (see the [hubverse documentatio
 
 The original hub submissions contained predictions for many locations and dates, and quantile forecasts were provided at 23 different quantile levels ranging from 0.01 to 0.99.  To make the example data more manageable, the `forecast_outputs` object contains a subset of these outputs for two locations (Massachusetts, FIPS code `"25"`, and Texas, FIPS code `"48"`) and two reference dates (2022-11-19 and 2022-12-17).  Additionally, for the quantile forecasts we have subset to seven quantile levels: 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, and 0.95.
 
-The task id variables used and values of those variables are specific to each forecast hub. For example, a hub collecting predictions for locations other than US states would use a different location identifier than FIPS codes, and a hub might introduce additional task id variables such as an identifier of age group or disease variant depending on the goals of the hub. See the hubverse documentation for further information about [task id variables](https://hubverse.io/en/latest/user-guide/tasks.html#task-id-variables).
+The task id variables used and values of those variables are specific to each modeling hub. For example, a hub collecting predictions for locations other than US states would use a different location identifier than FIPS codes, and a hub might introduce additional task id variables such as an identifier of age group or disease variant depending on the goals of the hub. See the hubverse documentation for further information about [task id variables](https://hubverse.io/en/latest/user-guide/tasks.html#task-id-variables).
 
 ## Example forecast target data
 
@@ -90,7 +90,7 @@ We will describe each of these targets in the following sections.
 
 ## The `wk inc flu hosp` target
 
-The `wk inc flu hosp` target represents weekly new hospital admissions with a confirmed influenza diagnosis.  We have predictions of this target with three output types: `quantile`, `mean`, and `median`.  The following plot shows the quantile and median predictions along with the observed hospital admission counts for Massachusetts and Texas.  Note that the quantile predictions were contributed directly by modelers to the FluSight hub, and median predictions correspond exactly to the quantile predictions at probability level 0.5.  We have obtained mean predictions from the quantile forecasts using the [distfromq package](https://github.com/reichlab/distfromq) by estimating the full quantile function from the submitted quantile predictions, drawing a sample using the probability integral transform method, and computing the mean of those samples.
+The `wk inc flu hosp` target represents weekly new hospital admissions with a confirmed influenza diagnosis.  We have predictions of this target with three output types: `quantile`, `mean`, and `median`.  The following plot shows the quantile and median predictions along with the observed hospital admission counts for Massachusetts and Texas.  Note that the quantile predictions were contributed directly by modelers to the FluSight hub, and median predictions correspond exactly to the quantile predictions at probability level 0.5.  We have obtained mean predictions from the quantile forecasts using the [distfromq package](https://github.com/reichlab/distfromq) by estimating the full quantile function from the submitted quantile predictions, drawing sample from that distribution using the probability integral transform method, and computing the mean of those samples.
 
 ```{r}
 plot_step_ahead_model_output(
@@ -112,6 +112,7 @@ For purposes of evaluating predictions, it can be helpful to join the observed t
 
 ```{r}
 forecast_outputs |>
+  filter(target == "wk inc flu hosp") |>
   left_join(forecast_target_observations) |>
   select(-model_id, -reference_date, -horizon)
 ```
@@ -186,7 +187,7 @@ ggplot() +
 
 ## The `wk flu hosp rate category` target
 
-The "wk flu hosp rate category" target represents a categorical intensity level of influenza activity, defined as "low" (hospital admissions rate per 100,000 $\leq$ 2.5), "moderate" (2.5 < admissions rate $\leq$ 5), "high" (5 < admissions rate $\leq$ 7.5), or "very high" (7.5 < admissions rate).  The `forecast_outputs` object has example forecasts for this target in a PMF format, with a probability assigned to each intensity category.  Again, forecasts of this target were not collected by the FluSight hub; we have derived predictions from the submitted quantile forecasts using the `distfromq` package.  For context, the following plot displays the observed data for the 2022/23 season on the scale of hospital admissions, with the boundaries of the intensity categories denoted with horizontal lines:
+The "wk flu hosp rate category" target represents a categorical intensity level of influenza activity, defined as "low" (hospital admissions rate per 100,000 $\leq$ 2.5), "moderate" (2.5 < admissions rate $\leq$ 5), "high" (5 < admissions rate $\leq$ 7.5), or "very high" (7.5 < admissions rate).  The `forecast_outputs` object has example forecasts for this target in a PMF format, with a probability assigned to each intensity category.  Again, forecasts of this target were not collected by the FluSight hub; we have derived predictions from the submitted quantile forecasts using the `distfromq` package.  For context, the following plot displays the observed data for the 2022/23 season on the scale of hospital admissions per 100,000 population, with the boundaries of the intensity categories denoted with horizontal lines:
 
 ```{r}
 # a data frame containing location FIPS codes and population values
@@ -264,7 +265,9 @@ ggplot() +
   geom_point(
     mapping = aes(x = target_end_date, y = output_type_id,
                   shape = factor(observation)),
-    data = observations_to_plot
+    color = "#888888",
+    size = 3, stroke = 2,
+    data = observations_to_plot,
   ) +
   scale_shape_manual(
     values = c(1, 3),