Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the new t_start/t_unit attributes in RandomVariable, remove padding whenever possible (hopefully everywhere). #216

Closed
Tracked by #185
gvegayon opened this issue Jun 26, 2024 · 9 comments · Fixed by #262
Labels
pyrenew related to pyrenew internals

Comments

@gvegayon
Copy link
Member

gvegayon commented Jun 26, 2024

Implementation notes

The main question right now is when to apply the time adjustments (i.e., padding, shifting, etc.). Note that it is not possible to request all calls to RandomVariable.sample() to output data of the same length as the observed data. Take the seeding process for example, which starts before the observation process, so it has a different length.

During RV instantiation

We could have the RVs be time-aware right during instantiation. The following points should be considered for this approach:

  1. The benefit: Mostly computational as we would reduce the number of calculations and may be able to pre-allocate some data (unsure about this last bit).

  2. Requirement(s): We would need to add a new argument for all constructors, namely, model_t_unit and n_obs, to adequately broadcast/aggregate the RV to match the model. So that way, all RandomVariable.sample() calls would generate the same length array.

  3. Pitfall(s): You need to have a clear idea of the full-time range of the model before running the model. Adding a seeding process or extending a time series beyond observed data becomes problematic.

    What happens if the length of the output of a call to RandomVariable.sample() is not easy to get before calling Model.run()? Is this case relevant? I can't think of a specific method/model in which this could happen, but if it does, it would be a pain.

During model setup

During the setup of the model, which is automatically called as a function of the Model.run() method, we could extract t_unit and t_start from all RVs and store that information for adjusting the RVs output as needed.

  1. The benefit: It would still be done as a init step, so it would save computations making the whole thing faster.

  2. Requirement(s): Require a way to list all RVs instances within a model. This wouldn't need to be recursive. As long as RVs within the model sample method are identified, it should be OK. An alternative would be to add a way to record RVs for the model; this could be added to the Model.validate() method.

    Using the retrieved information, the model could wrap RandomVariable.sample to broadcast/pad/summarize data as needed. This could be implemented via a decorator.

  3. Pitfall(s): Like the previous method, it would probably be complicated to anticipate the final size of all arrays involved in the process.

During model run

This would be the simplest way to approach the issue. During run time, the model is fully aware of the length of the RVs as we can directly inspect the resulting arrays by calling RandomVariable.sample():

  1. Benefit(s): Less complicated implementation. We adjust/query the array sizes when needed. This is also less automagical, which could be an issue.

  2. Requirement(s): The resulting tuples and namedtuples from the RandomVariable.sample() method are not time-aware. Therefore, if we wanted to sync these arrays, we would have to do so using both the RV member and the output from the sample function. Following the time.qmd vignette, the actual implementation would be along these lines (an awkward implementation):

    Rt_aligned, infections_aligned = align([
        [self.rt_rv, Rt],
        [self.inf_rv, infections]
    ])

    To avoid a clunky implementation, we may need to create a wrapper class of the output from RandomVariable.sample(). That way, the previous code would be reduced to:

    Rt_aligned, infections_aligned = align([Rt, infections])
  3. Pitfall(s): Even if we can live with that way of always passing both the RV and the array, we would still need to deal with potential computational bottle-necks, as the calculations adjusting size may be relatively inefficient.

@gvegayon gvegayon changed the title Using the new attributes, remove padding whenever possible (hopefully everywhere). Using the new timeseries_* attributes in RandomVariable, remove padding whenever possible (hopefully everywhere). Jun 26, 2024
@gvegayon gvegayon added this to the 🐏 Micropachycephalosaurus milestone Jun 26, 2024
@gvegayon
Copy link
Member Author

gvegayon commented Jul 3, 2024

(all content now moved to the main section of the issue)

@gvegayon
Copy link
Member Author

gvegayon commented Jul 9, 2024

@CDCgov/multisignal-epi-inference-devs, I just finished a first pass on the notes about the implementation of the time-aware component. Please take a look and let me know what you think and if I'm missing something. Once I get some feedback, I can start working on a proposal for approaching this issue. I am taking more time to think about this problem as it could possible have a big impact across the package.

@damonbayer
Copy link
Collaborator

damonbayer commented Jul 9, 2024

Thanks for writing this up, @gvegayon!

I am inferring that you think it is important or desirable to have all RV's output objects of the same length. Is that correct? If so, why? For example, why would you want the overdispersion parameter of a observation process to have the same length as the number of latent infections? Is the idea that all of these parameters could or should be time-varying?

Quotes that I read that gave me this impression:

Note that it is not possible to request all calls to RandomVariable.sample() to output data of the same length as the observed data.

So that way, all RandomVariable.sample() calls would generate the same length array.

@gvegayon
Copy link
Member Author

gvegayon commented Jul 9, 2024

Great question, @damonbayer! That would be my first first approach. As we have done in other places with the model, matching lengths and padding with 0/NaN just works. We could try to be more fancy about it, but keeping all of the same length should make things easier, I believe. Let me think more about it.

@damonbayer
Copy link
Collaborator

Happy to discuss more in detail, but, overall, I think some combination of "During model setup" and "During model run" will be the best approach.

@gvegayon
Copy link
Member Author

Thanks for today's discussion, @damonbayer and @dylanhmorris. This is what I got out of it:

  • Not everything needs to be of length $n$. Mainly, it would be good to allow scalars to stay as scalars. Everything else will be length-adjusted as needed (so not always).

  • Since the information about time is relevant to the sample output, we will bring it closer by creating a NameTuple instance called TimeArray. This will replace the Jax arrays, allowing us to pass along information about time to returns of RandomVariable.sample().

  • Again, regarding standardizing length, not all need to be $n$. As I implement this, I will keep array resizing to a minimum. We will have a better idea of how good/bad would be to have everything matching in length.

  • There will be an issue when retrieving stored sites. Internally, TimeArray will hold all the needed information, but the stored arrays won't have access to time-related information. So, for instance, if RandomVariable.sample() returns a biweekly array, outside of the model call we won't be able to much it just yet. This is a shortcoming that we will need to address in the future.

@gvegayon gvegayon linked a pull request Jul 11, 2024 that will close this issue
@damonbayer damonbayer added the pyrenew related to pyrenew internals label Jul 12, 2024
@damonbayer damonbayer modified the milestones: N Sprint, O sprint Jul 19, 2024
@gvegayon
Copy link
Member Author

As a note for our future selves: We decided to call these SampledValue instead of TimeArray. The PR associated with this issue ensures all RandomVariable.sample() calls return tuple/namedtuple of SampledValue.

@dylanhmorris dylanhmorris changed the title Using the new timeseries_* attributes in RandomVariable, remove padding whenever possible (hopefully everywhere). Using the new t_start/t_unit attributes in RandomVariable, remove padding whenever possible (hopefully everywhere). Jul 25, 2024
@damonbayer
Copy link
Collaborator

@dylanhmorris @gvegayon We did not address padding in #262

@damonbayer
Copy link
Collaborator

Closing this for now, as padding is not really used in any of the models, and I am not convinced it should be.

@damonbayer damonbayer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyrenew related to pyrenew internals
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants