Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalized target data generation to both flu and covid #35

Merged
merged 4 commits into from
Dec 3, 2024

Conversation

matthewcornell
Copy link
Collaborator

@matthewcornell matthewcornell commented Dec 3, 2024

This PR generalizes predtimechart target data generation to work with both flu and covid. The main points:

  • I renamed generate_target_json_files_FluSight.py to generate_target_json_files.py.
  • Its main() function takes a new ptc_config_file parameter, making the signature (hub_dir, ptc_config_file, target_out_dir). This is similar to generate_json_files.py's main() signature: (hub_dir, ptc_config_file, options_file_out, forecasts_out_dir, regenerate).
  • I had to specify a schema in get_target_data_df()'s read_csv() call to force parsing the location column as String for the covid case. (Inferred schema worked for flu due to a "US" row coming sooner in the file.)
  • Various function renames and consolidations into one file.
  • I added an optional target_data_file_name field to predtimechart-config.yml files. See the tests for examples. Updated ptc_schema.py and HubConfig to support it.
  • I did not add column name fields to it (they remain hard coded - 'date', 'value', and 'location') since they're the same for flu and covid.
  • Added files and tests for covid target generation.
  • Updated pyproject.toml to rename the generate_target_json_files_FluSight entry point to generate_target_json_files.
  • Updated README.md.

@matthewcornell matthewcornell requested review from bsweger, elray1 and zkamvar and removed request for bsweger December 3, 2024 15:50
Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had a few minor suggestions and questions.

src/hub_predtimechart/app/generate_target_json_files.py Outdated Show resolved Hide resolved
src/hub_predtimechart/app/generate_target_json_files.py Outdated Show resolved Hide resolved
src/hub_predtimechart/hub_config.py Show resolved Hide resolved
src/hub_predtimechart/hub_config.py Show resolved Hide resolved
target_out_dir = Path(target_out_dir)
target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv')
target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was discussion on Slack about making the target_data_file_name optional so that a hub without target data could still use the tool. My understanding is that right now, this get_target_data_df call will end up raising an error if target_data_file_name is None though? Is that right?

Handling this case seems like a low priority to me (we don't currently have users asking for this feature). Should we file an issue to handle this later, or handle it now? And if we defer handling it to later, perhaps we should just make the target_data_file_name field required for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, target_data_file_name is optional. If it's missing or incorrect, it should be caught by get_target_data_df(), which catches FileNotFoundError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. What I'm seeing in get_target_data_df() is that it catches and then re-raises the FileNotFoundError. Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out? I guess I was expecting to see some kind of early exit from target file creation if hub_config.target_data_file_name is None, but maybe I'm not seeing how the pieces of the full workflow fit together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?

I was originally thinking that the FileNotFoundError is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you two want me to log an error and exit if the file is missing, but not actually raise an error?

Copy link
Contributor

@elray1 elray1 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be:

  • If a file path for target data was specified but doesn't exist, we should throw an error. Something was misconfigured and we're not able to complete the task we were asked to do. We should raise an error to tell people about the problem.
  • If a file path for target data was not specified (we get None here), we gracefully exit without an error. We were unable to produce target data, but we were not asked to produce target data, so there's not actually an error condition here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Evan's breakdown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I implemented this. I do logger.info() for the first bullet, and logger.error() for the second. In both cases I simply return, not sys.exit(). How's that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases I simply return, not sys.exit(). How's that?

Almost! If I understand correctly, both of these will run through without causing a non-zero exit code on the CLI? Instead, if you emit a logger.error(), then that should be paired with a sys.exit(1).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Fixed.

@matthewcornell
Copy link
Collaborator Author

Thanks, Evan! I incorporated your suggestions.

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @matthewcornell! This looks good and is a big step toward generalization of target data generation 🎉

I have one big suggestion to not have ptc_generate_target_json_files fail if there is no target data specified in the config file.

The only other thing I would suggest (but this may be for the future): instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings. For example:

target_data_cols:
  date: date_onset
  value: observed
  location: province

pyproject.toml Show resolved Hide resolved
src/hub_predtimechart/app/generate_target_json_files.py Outdated Show resolved Hide resolved
target_out_dir = Path(target_out_dir)
target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv')
target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?

I was originally thinking that the FileNotFoundError is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.

src/hub_predtimechart/app/generate_target_json_files.py Outdated Show resolved Hide resolved
@matthewcornell
Copy link
Collaborator Author

instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings

Yes, I thought of that too, but didn't document my decision. I suggest we leave it hard coded for now. I like how you used a dictionary in your example.

@zkamvar
Copy link
Member

zkamvar commented Dec 3, 2024

instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings

Yes, I thought of that too, but didn't document my decision. I suggest we leave it hard coded for now. I like how you used a dictionary in your example.

Would you like me to open an issue to track this feature?

@matthewcornell matthewcornell merged commit 1dc5f3e into main Dec 3, 2024
1 check passed
@matthewcornell matthewcornell deleted the target_data branch December 3, 2024 19:55
zkamvar added a commit to reichlab/flusight-dashboard that referenced this pull request Dec 3, 2024
zkamvar added a commit to hubverse-org/hub-dashboard-template that referenced this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants