-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalized target data generation to both flu and covid #35
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I had a few minor suggestions and questions.
target_out_dir = Path(target_out_dir) | ||
target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv') | ||
target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was discussion on Slack about making the target_data_file_name
optional so that a hub without target data could still use the tool. My understanding is that right now, this get_target_data_df
call will end up raising an error if target_data_file_name
is None
though? Is that right?
Handling this case seems like a low priority to me (we don't currently have users asking for this feature). Should we file an issue to handle this later, or handle it now? And if we defer handling it to later, perhaps we should just make the target_data_file_name
field required for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, target_data_file_name is optional. If it's missing or incorrect, it should be caught by get_target_data_df()
, which catches FileNotFoundError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. What I'm seeing in get_target_data_df()
is that it catches and then re-raises the FileNotFoundError
. Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out? I guess I was expecting to see some kind of early exit from target file creation if hub_config.target_data_file_name is None
, but maybe I'm not seeing how the pieces of the full workflow fit together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?
I was originally thinking that the FileNotFoundError
is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you two want me to log an error and exit if the file is missing, but not actually raise an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be:
- If a file path for target data was specified but doesn't exist, we should throw an error. Something was misconfigured and we're not able to complete the task we were asked to do. We should raise an error to tell people about the problem.
- If a file path for target data was not specified (we get
None
here), we gracefully exit without an error. We were unable to produce target data, but we were not asked to produce target data, so there's not actually an error condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Evan's breakdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I implemented this. I do logger.info() for the first bullet, and logger.error() for the second. In both cases I simply return
, not sys.exit()
. How's that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases I simply return, not sys.exit(). How's that?
Almost! If I understand correctly, both of these will run through without causing a non-zero exit code on the CLI? Instead, if you emit a logger.error()
, then that should be paired with a sys.exit(1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Fixed.
Thanks, Evan! I incorporated your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @matthewcornell! This looks good and is a big step toward generalization of target data generation 🎉
I have one big suggestion to not have ptc_generate_target_json_files
fail if there is no target data specified in the config file.
The only other thing I would suggest (but this may be for the future): instead of having hard-coded names for date, value, and location, we could implement it as a dictionary in the config file like we do the location mappings. For example:
target_data_cols:
date: date_onset
value: observed
location: province
target_out_dir = Path(target_out_dir) | ||
target_data_df = get_target_data_df(hub_dir, 'target-hospital-admissions.csv') | ||
target_data_df = get_target_data_df(hub_dir, hub_config.target_data_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this ultimately mean that if the ptc config file doesn't contain that field, the generation of target data files will error out?
I was originally thinking that the FileNotFoundError
is reasonable, but thinking about it from the perspective of running a pipeline, I would actually want it to emit a message saying that there is no target data to generate instead of an error because when I'm running the pipeline, I want errors to show up for hub dashboards that do specify target data. If they don't have target data to generate and an error shows up, it's just noise.
Yes, I thought of that too, but didn't document my decision. I suggest we leave it hard coded for now. I like how you used a dictionary in your example. |
Would you like me to open an issue to track this feature? |
this complies with hubverse-org/hub-dashboard-predtimechart#35
This complies with hubverse-org/hub-dashboard-predtimechart#35 and also adds the task ID text.
This PR generalizes predtimechart target data generation to work with both flu and covid. The main points:
generate_target_json_files_FluSight.py
togenerate_target_json_files.py
.main()
function takes a newptc_config_file
parameter, making the signature(hub_dir, ptc_config_file, target_out_dir)
. This is similar togenerate_json_files.py
'smain()
signature:(hub_dir, ptc_config_file, options_file_out, forecasts_out_dir, regenerate)
.get_target_data_df()
'sread_csv()
call to force parsing thelocation
column as String for the covid case. (Inferred schema worked for flu due to a "US" row coming sooner in the file.)target_data_file_name
field topredtimechart-config.yml
files. See the tests for examples. Updatedptc_schema.py
andHubConfig
to support it.pyproject.toml
to rename thegenerate_target_json_files_FluSight
entry point togenerate_target_json_files
.README.md
.