This folder contains a set of subdirectories, one for each model, that contains submitted model output files for that model. The structure of these directories and their contents follows general hubverse model output guidelines. Specific documentation for the COVID-19 Forecast Hub follows.
All forecasts should be submitted directly to the model-output/ folder. Data in this directory should be added to the repository through a pull request so that automatic data validation checks are run.
These instructions provide detail about the data format as well as validation that you can do prior to this pull request. In addition, we describe metadata that each model should provide in the model-metadata folder.
Table of Contents
- What is a forecast
- Target data
- Data formatting
- Forecast file format
- Forecast data validation
- Weekly ensemble build
- Policy on late submissions
Models are asked to make specific quantitative forecasts about data that will be observed in the future. These forecasts are interpreted as "unconditional" predictions about the future. That is, they are not predictions only for a limited set of possible future scenarios in which a certain set of conditions (e.g. vaccination uptake is strong, or new social-distancing mandates are put in place) hold about the future -- rather, they should characterize uncertainty across all reasonable future scenarios. In practice, all forecasting models make some assumptions about how current trends in data may change and impact the forecasted outcome; some teams select a "most likely" scenario or combine predictions across multiple scenarios that may occur. Forecasts submitted to this repository will be evaluated against observed data.
We note that other modeling efforts, such as the Influenza Scenario Modeling Hub, have been launched to collect and aggregate model outputs from "scenario projection" models. These models create longer-term projections under a specific set of assumptions about how the main drivers of the pandemic (such as non-pharmaceutical intervention compliance, or vaccination uptake) may change over time.
This project treats laboratory-confirmed COVID-19 hospital admissions data reported through CDC's NHSN (National Health Safety Network) system as the target ("gold standard") data for forecasting. The specific forecasting target is epiweekly total incident hospital admissions.
Details on data schemas and endpoints will be updated as information becomes available.
NHSN's Hospital Respiratory Reporting page contains a useful overview of the dataset.
The automatic checks in place for forecast files submitted to this repository validates both the filename and file contents to ensure the file can be used in the visualization and ensemble forecasting.
Each model that submits forecasts for this project will have a unique subdirectory within the model-output/ directory in this GitHub repository where forecasts will be submitted. Each subdirectory must be named
team-model
where
team
is the team name andmodel
is the name of your model.
Both team and model should be less than 15 characters and not include hyphens or other special characters, with the exception of "_".
The combination of team
and model
should be unique from any other model in the project.
The metadata file will be saved within the model-metdata directory in the Hub's GitHub repository. It should be a YAML file with base name {team}-{model}
, and extension .yml
or .yaml
, e.g.
exampleteam-examplemodel.yml
otherteam-othermodel.yaml
Details on the content and formatting of metadata files are provided in the model-metadata README.
Each forecast file should have the following format
{YYYY-MM-DD}-{team}-{model}.csv
or
{YYYY-MM-DD}-{team}-{model}.parquet
depending on whether the team is submitting forecasts as .csv
files or as .parquet
files.
where
YYYY
is the 4 digit year,MM
is the 2 digit month,DD
is the 2 digit day,team
is the abbreviated team name, andmodel
is the abbreviated name of your model.
The date YYYY-MM-DD is the reference_date
. This should be the Saturday following the submission date. For example, submission from the team above for a reference date of November 2, 2024 will be named:
2024-11-02-exampleteam-examplemodel.csv
The team
and model
in this file must match the team
and model
in
the directory this file is in. Both team
and model
should be less
than 15 characters, alpha-numeric and underscores only, with no spaces
or hyphens. Submission of both targets- quantiles and samples must be in the same weekly csv or parquet submission file.
The file must be a comma-separated value (csv) file with the following columns (in any order):
reference_date
target
horizon
target_end_date
location
output_type
output_type_id
value
No additional columns are allowed.
The value in each row of the file is either a quantile or sample for a particular combination of location, date, and horizon.
Values in the reference_date
column must be a date in the ISO format
YYYY-MM-DD
This is the date from which all forecasts should be considered. This date is the Saturday following the submission Due Date, corresponding to the last day of the epiweek when submissions are made. The reference_date
should be the same as the date in the filename but is included here to facilitate validation and analysis.
Values in the target
column must be a character (string) and be the following specific target:
wk inc covid hosp
Values in the horizon
column indicate the number of weeks between the reference_date
and the target_end_date
. For submissions to the COVID-19 Forecast Hub, this should be a number between -1 and 3. It indicates the epidemiological week ("epiweek") being forecast/nowcast relative to the epiweek containing the forecast submission date ("the submission epiweek").
A horizon
of -1 indicates that the prediction is a nowcast for ultimately reported data from the epiweek prior to the submission epiweek. A horizon
of 1 indicates that the prediction is a forecast for the epiweek following submission epiweek.
Note that the COVID-19 Forecast Hub uses US CDC / MMWR epiweeks, which begin on Sunday and end on Saturday, not ISO epiweeks.
Values in the target_end_date
column must be a date in the format
YYYY-MM-DD
This should be a Saturday, the last date of the forecast target's US CDC epiweek. Within each row of the submission file, the target_end_date
should be equal to the reference_date
+ horizon
* (7 days).
Values in the location
column must be one of the "locations" in this file which includes 2-digit numeric FIPS codes for U.S. states, territories, and districts, as well as the "US" as a two-character code for national forecasts.
Values in the output_type
column should be one of
quantile
samples
This value indicates whether that row corresponds to a quantile forecast or sample trajectories for weekly incident hospital admissions. Samples can either encode both temporal and spatial dependency across forecast horizon
s and location
s or just encode temporal dependency across horizon
but treats each location
independently.
Values in the output_type_id
column specify identifying information for the output type.
When the predictions are quantiles, values in the output_type_id
column are a quantile probability level in the format
0.###
This value indicates the quantile probability level for the value
in this row.
Teams must provide the following 23 quantiles:
[
0.01,
0.025,
0.05,
0.10,
0.15,
0.20,
0.25,
0.30,
0.35,
0.40,
0.45,
0.50,
0.55,
0.60,
0.65,
0.70,
0.75,
0.80,
0.85,
0.90,
0.95,
0.975,
0.99
]
When the predictions are samples, values in the output_type_id
column are indexes for the samples. The output_type_id
is used to indicate the dependence across multiple task id variables when samples come from a joint predictive distribution. For example, samples from a joint predictive distribution across horizon
s for a given location
, will share output_type_id
for predictions for different horizon
s within a same location
, as shown in the table below:
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-10-15 | -1 | MA | sample | s0 | - |
2024-10-15 | 0 | MA | sample | s0 | - |
2024-10-15 | 1 | MA | sample | s0 | - |
2024-10-15 | -1 | NH | sample | s1 | - |
2024-10-15 | 0 | NH | sample | s1 | - |
2024-10-15 | 1 | NH | sample | s1 | - |
2024-10-15 | -1 | MA | sample | s2 | - |
2024-10-15 | 0 | MA | sample | s2 | - |
2024-10-15 | 1 | MA | sample | s2 | - |
2024-10-15 | -1 | NH | sample | s3 | - |
2024-10-15 | 0 | NH | sample | s3 | - |
2024-10-15 | 1 | NH | sample | s3 | - |
Here, output_type_id = s0
and output_type_id = s1
specifies that the predictions
for horizons -1, 0, and 1 are part of the same joint distribution. Samples from joint
distribution across horizon
s and location
s can be specified by shared output_type_id
across location
s and horizon
s as shown in the example below:
origin_date | horizon | location | output_type | output_type_id | value |
---|---|---|---|---|---|
2024-10-15 | -1 | MA | sample | S0 | - |
2024-10-15 | 0 | MA | sample | S0 | - |
2024-10-15 | 1 | MA | sample | S0 | - |
2024-10-15 | -1 | NH | sample | S0 | - |
2024-10-15 | 0 | NH | sample | S0 | - |
2024-10-15 | 1 | NH | sample | S0 | - |
2024-10-15 | -1 | MA | sample | S1 | - |
2024-10-15 | 0 | MA | sample | S1 | - |
2024-10-15 | 1 | MA | sample | S1 | - |
2024-10-15 | -1 | NH | sample | S1 | - |
2024-10-15 | 0 | NH | sample | S1 | - |
2024-10-15 | 1 | NH | sample | S1 | - |
The above table shows two samples indexed by output_type_id:
S1
and S2
from a joint predictive distribution across location
s and horizon
s.
More details on sample output can be found in the hubverse documentation of sample output type.
Values in the value
column are non-negative numbers indicating the "quantile" or "sample" prediction for this row. For a "quantile" prediction, value
is the inverse of the cumulative distribution function (CDF) for the target, location, and quantile associated with that row. For example, the 2.5 and 97.5 quantiles for a given target and location should capture 95% of the predicted values and correspond to the central 95% Prediction Interval.
To ensure proper data formatting, pull requests for new data in
model-output/
will be automatically run. Optionally, you may also run these validations locally.
When a pull request is submitted, the data are validated through Github Actions which runs the tests present in the hubValidations package. The intent for these tests are to validate the requirements above. Please let us know if you are facing issues while running the tests.
Optionally, you may validate a forecast file locally before submitting it to the hub in a pull request. Note that this is not required, since the validations will also run on the pull request. To run the validations locally, follow the steps described here.
Every Thursday morning, we will generate a CovidHub ensemble hospital admission forecast using valid forecast submissions in the current week by the Wednesday 11PM ET deadline. Some or all participant forecasts may be combined into an ensemble forecast to be published in real-time along with the participant forecasts. In addition, some or all forecasts may be displayed alongside the output of a baseline model for comparison.
In order to ensure that forecasting is done in real-time, all forecasts are required to be submitted to this repository by 11 PM ET on Wednesdays each week. We do not accept late forecasts.
Forecasts will be evaluated using a variety of metrics, including the weighted interval score (WIS).