generalized target data generation to both flu and covid #35

matthewcornell · 2024-12-03T15:49:38Z

This PR generalizes predtimechart target data generation to work with both flu and covid. The main points:

I renamed generate_target_json_files_FluSight.py to generate_target_json_files.py.
Its main() function takes a new ptc_config_file parameter, making the signature (hub_dir, ptc_config_file, target_out_dir). This is similar to generate_json_files.py's main() signature: (hub_dir, ptc_config_file, options_file_out, forecasts_out_dir, regenerate).
I had to specify a schema in get_target_data_df()'s read_csv() call to force parsing the location column as String for the covid case. (Inferred schema worked for flu due to a "US" row coming sooner in the file.)
Various function renames and consolidations into one file.
I added an optional target_data_file_name field to predtimechart-config.yml files. See the tests for examples. Updated ptc_schema.py and HubConfig to support it.
I did not add column name fields to it (they remain hard coded - 'date', 'value', and 'location') since they're the same for flu and covid.
Added files and tests for covid target generation.
Updated pyproject.toml to rename the generate_target_json_files_FluSight entry point to generate_target_json_files.
Updated README.md.

elray1

Looks good to me! I had a few minor suggestions and questions.

src/hub_predtimechart/app/generate_target_json_files.py

src/hub_predtimechart/hub_config.py

elray1 · 2024-12-03T16:18:33Z

src/hub_predtimechart/app/generate_target_json_files.py

    target_out_dir = Path(target_out_dir)
-    target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv')
+    target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name)


There was discussion on Slack about making the target_data_file_name optional so that a hub without target data could still use the tool. My understanding is that right now, this get_target_data_df call will end up raising an error if target_data_file_name is None though? Is that right?

Handling this case seems like a low priority to me (we don't currently have users asking for this feature). Should we file an issue to handle this later, or handle it now? And if we defer handling it to later, perhaps we should just make the target_data_file_name field required for now?

Yes, target_data_file_name is optional. If it's missing or incorrect, it should be caught by get_target_data_df(), which catches FileNotFoundError.

Thanks. What I'm seeing in get_target_data_df() is that it catches and then re-raises the FileNotFoundError. Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out? I guess I was expecting to see some kind of early exit from target file creation if hub_config.target_data_file_name is None, but maybe I'm not seeing how the pieces of the full workflow fit together.

Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?

I was originally thinking that the FileNotFoundError is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.

So do you two want me to log an error and exit if the file is missing, but not actually raise an error?

My vote would be:

If a file path for target data was specified but doesn't exist, we should throw an error. Something was misconfigured and we're not able to complete the task we were asked to do. We should raise an error to tell people about the problem.

If a file path for target data was not specified (we get None here), we gracefully exit without an error. We were unable to produce target data, but we were not asked to produce target data, so there's not actually an error condition here.

I agree with Evan's breakdown.

OK, I implemented this. I do logger.info() for the first bullet, and logger.error() for the second. In both cases I simply return, not sys.exit(). How's that?

In both cases I simply return, not sys.exit(). How's that?

Almost! If I understand correctly, both of these will run through without causing a non-zero exit code on the CLI? Instead, if you emit a logger.error(), then that should be paired with a sys.exit(1).

Makes sense. Fixed.

tests/hub_predtimechart/test_generate_target_data.py

matthewcornell · 2024-12-03T16:46:02Z

Thanks, Evan! I incorporated your suggestions.

zkamvar

Thank you, @matthewcornell! This looks good and is a big step toward generalization of target data generation 🎉

I have one big suggestion to not have ptc_generate_target_json_files fail if there is no target data specified in the config file.

The only other thing I would suggest (but this may be for the future): instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings. For example:

target_data_cols:
  date: date_onset
  value: observed
  location: province

pyproject.toml

src/hub_predtimechart/app/generate_target_json_files.py

zkamvar · 2024-12-03T17:13:23Z

src/hub_predtimechart/app/generate_target_json_files.py

    target_out_dir = Path(target_out_dir)
-    target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv')
+    target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name)


Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?

I was originally thinking that the FileNotFoundError is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.

src/hub_predtimechart/app/generate_target_json_files.py

matthewcornell · 2024-12-03T17:39:31Z

instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings

Yes, I thought of that too, but didn't document my decision. I suggest we leave it hard coded for now. I like how you used a dictionary in your example.

zkamvar · 2024-12-03T18:07:06Z

instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings

Yes, I thought of that too, but didn't document my decision. I suggest we leave it hard coded for now. I like how you used a dictionary in your example.

Would you like me to open an issue to track this feature?

this complies with hubverse-org/hub-dashboard-predtimechart#35

This complies with hubverse-org/hub-dashboard-predtimechart#35 and also adds the task ID text.

generalized target data generation to both flu and covid

6492920

matthewcornell requested review from bsweger, elray1 and zkamvar and removed request for bsweger December 3, 2024 15:50

elray1 requested changes Dec 3, 2024

View reviewed changes

generalized target data generation to both flu and covid

f3157ec

zkamvar requested changes Dec 3, 2024

View reviewed changes

generalized target data generation to both flu and covid

c099ad1

zkamvar mentioned this pull request Dec 3, 2024

update build data to account for generalized target data generation hubverse-org/hub-dashboard-control-room#6

Closed

generalized target data generation to both flu and covid

758c92b

zkamvar mentioned this pull request Dec 3, 2024

define target data column definitions in config file #36

Open

matthewcornell requested review from elray1 and zkamvar December 3, 2024 19:00

zkamvar approved these changes Dec 3, 2024

View reviewed changes

zkamvar mentioned this pull request Dec 3, 2024

update to comply with new hub-dashboard-predtimechart tool hubverse-org/hub-dashboard-control-room#7

Merged

matthewcornell merged commit 1dc5f3e into main Dec 3, 2024
1 check passed

matthewcornell deleted the target_data branch December 3, 2024 19:55

zkamvar added a commit to reichlab/flusight-dashboard that referenced this pull request Dec 3, 2024

specify target data file name

1c63506

this complies with hubverse-org/hub-dashboard-predtimechart#35

zkamvar added a commit to hubverse-org/hub-dashboard-template that referenced this pull request Dec 3, 2024

Update config for new version of ptc dashboard tool

1ee1301

This complies with hubverse-org/hub-dashboard-predtimechart#35 and also adds the task ID text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generalized target data generation to both flu and covid #35

generalized target data generation to both flu and covid #35

matthewcornell commented Dec 3, 2024 •

edited

Loading

elray1 left a comment

elray1 Dec 3, 2024

matthewcornell Dec 3, 2024

elray1 Dec 3, 2024

zkamvar Dec 3, 2024

matthewcornell Dec 3, 2024

elray1 Dec 3, 2024 •

edited

Loading

zkamvar Dec 3, 2024

matthewcornell Dec 3, 2024

zkamvar Dec 3, 2024

matthewcornell Dec 3, 2024

matthewcornell commented Dec 3, 2024

zkamvar left a comment

zkamvar Dec 3, 2024

matthewcornell commented Dec 3, 2024

zkamvar commented Dec 3, 2024

generalized target data generation to both flu and covid #35

generalized target data generation to both flu and covid #35

Conversation

matthewcornell commented Dec 3, 2024 • edited Loading

elray1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewcornell commented Dec 3, 2024

zkamvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewcornell commented Dec 3, 2024

zkamvar commented Dec 3, 2024

matthewcornell commented Dec 3, 2024 •

edited

Loading

elray1 Dec 3, 2024 •

edited

Loading