Skip to content

Implementing multistart version of theta_est using multiple sampling methods #3575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sscini
Copy link

@sscini sscini commented Apr 23, 2025

Fixes # .

Summary/Motivation:

Currently, the optimization is only done from a single initial value. This implementation adds the ability to specify multiple initial values using selected sampling techniques: from a random uniform distribution, using Latin Hypercube Sampling, or using Sobol Quasi-Monte Carlo sampling.

Changes proposed in this PR:

  • All changes made adding pseudocode in comments
  • Added inputs needed for multistart simulation
  • Added a function to generate points using the selected method
  • Added theta_est_multistart to work for the multistart process

TODO before converting from draft:

  • Receive feedback from collaborators on logical setup
  • Convert finalized pseudocode
  • Test and debug
  • Confirm function with examples

Legal Acknowledgement

By contributing to this software project, I have read the contribution guide and agree to the following terms and conditions for my contribution:

  1. I agree my contributions are submitted under the BSD license.
  2. I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

@sscini
Copy link
Author

sscini commented Apr 23, 2025

@djlaky @adowling2 Please provide early feedback.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Dynamic saving using flush, add.

Copy link
Member

@adowling2 adowling2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from our in-person discussion/informal code review

# # If only one restart, return an empty list
# return []

# return {theta_names[i]: initial_theta[i] for i in range(len(theta_names))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed adding a "dataframe" sampling method that uses multistart points defined by the user. This is helpful if we want to try the same set of multistart points for multiple experiments.

"Multistart is not supported in the deprecated parmest interface")
)

assert isinstance(n_restarts, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check that this is > 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at other Pyomo code fgor exampels of throwing exceptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @adowling2 here, you need to throw an exception so you can test the exception is caught.

)


results = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to create a dataframe and then add rows as you go. Or you could preallocate the dataframe size because you know how many restarts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even have your generate_samples function generate this empty dataframe.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Extend existing tests for parmest to include multistart, add.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Models provided need to include bounds, add exception

Copy link
Member

@adowling2 adowling2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more comments for you to consider are you continue to refine this.

upper_bound = np.array([parmest_model.find_component(name).ub for name in theta_names])
# Check if the lower and upper bounds are defined
if np.any(np.isnan(lower_bound)) or np.any(np.isnan(upper_bound)):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably already know this, but you will need to check all the errors are raised when expected.

)

if self.method == "random":
np.random.seed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to skip setting the random seed if seed=None (default)?

Copy link
Author

@sscini sscini May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is none for all the functions I use that set seed, so if it receives seed = None, it would work as expected. Would skipping it still be best practice?

elif self.method == "latin_hypercube":
# Generate theta values using Latin hypercube sampling
sampler = scipy.stats.qmc.LatinHypercube(d=len(theta_names), seed=seed)
samples = sampler.random(n=self.n_restarts+1)[1:] # Skip the first sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you skipping the first sample? Please explain in the comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment in code to explain as well. The first sample generated using qmc.sobol is always the origin (zero vector). I thought logic applied to all qmc methods, but no only sobol. So to get nonzero points, you need to skip first sample

# Generate theta values using Latin hypercube sampling
sampler = scipy.stats.qmc.LatinHypercube(d=len(theta_names), seed=seed)
samples = sampler.random(n=self.n_restarts+1)[1:] # Skip the first sample
theta_vals_multistart = np.array([lower_bound + (upper_bound - lower_bound) * theta for theta in samples])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain what you are doing here. I suspect that this function returns an LHS refined on [0, 1] and you are rescaling. But it is best to be explicit.

@@ -921,6 +1020,116 @@ def theta_est(
cov_n=cov_n,
)

def theta_est_multistart(
self,
buffer=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to explain the buffer in the doc string.

"Multistart is not supported in the deprecated parmest interface"
)

assert isinstance(self.n_restarts, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all of these with more descriptive error messages. Remember that we need tests for each error message.

# This method will be used to generate the initial theta values for multistart
# optimization. It will take the theta names and the initial theta values
# and return a dictionary of theta names and their corresponding values.
def _generate_initial_theta(self, parmest_model, seed=None, n_restarts=None, multistart_sampling_method=None, user_provided=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods (even private) should have a description of each input, what type is expected, and what it's for in a brief yet coherent sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's probably a good idea to make sure each of the inputs are valid, in some sense, when called. For instance, if parmest_model is not a model in a specific form, there may be an error. We didn't require a model input for DoE, instead made it None and handled the case where this was sent.

# optimization. It will take the theta names and the initial theta values
# and return a dictionary of theta names and their corresponding values.
def _generate_initial_theta(self, parmest_model, seed=None, n_restarts=None, multistart_sampling_method=None, user_provided=None):
if n_restarts == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like just sending a warning, and not returning. For example, n_restarts might be 1 by default. You should check if n_restarts is an int as well. Then, if n_restarts is 1, you should send a warning that the tool is intended for this number to be greater than one and solve as normal.


# Get the theta names and initial theta values
theta_names = self._return_theta_names()
initial_theta = [parmest_model.find_component(name)() for name in theta_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use the suffix for this? The suffix value shouldn't change, but the theta value may if the model has been solved for some reason. I don't know if this is a potential issue but I think that grabbing these values from the suffixes would be more dummy-proof.

return print("No multistart optimization needed. Please use normal theta_est()")

# Get the theta names and initial theta values
theta_names = self._return_theta_names()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strings? We went through the effort of trying to avoid using strings to do tasks because they are fragile. Is it possible to stick with suffixes and use the actual component/component ID?

if multistart_sampling_method == "random":
np.random.seed(seed)
# Generate random theta values
theta_vals_multistart = np.random.uniform(lower_bound, upper_bound, size=len(theta_names))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any merit in allowing the user to pull from a non-uniform distribution? Unsure what extra machinery is required but maybe for a future PR/enhancement.

"The length of theta_names and initial_theta must be the same."
)

if multistart_sampling_method == "random":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily like that the first method is called "random". Aren't most of the methods inherently based on random sampling? It should probably be more explicit, like uniform sampling, or random uniform, or something like this.

samples = sampler.random(n=n_restarts+1)[1:]
theta_vals_multistart = np.array([lower_bound + (upper_bound - lower_bound) * theta for theta in samples])

elif multistart_sampling_method == "user_provided":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably should have a comment to start each different method that describes what you're doing. It is not entirely obvious what's going on when you haven't yet read the code.


if isinstance(user_provided, np.ndarray):
# Check if the user provided numpy array has the same number of rows as the number of restarts
if user_provided.shape[0] != n_restarts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance (from last comment), it is unclear that a 2-D numpy array is provided where the first dimension is the number of restarts and the second dimension in the number of theta names, being a N x M matrix representing the N instances of M parameters to be "multi-started".

Copy link
Contributor

@djlaky djlaky May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is especially problematic with N = M, as the users would get results back and not know that they have made a mistake.


# Add the output info values to the dataframe, starting values as nan
for i in range(len(theta_names)):
df_multistart[f'converged_{theta_names[i]}'] = np.nan
Copy link
Contributor

@djlaky djlaky May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all string characters legal for pandas names? I would think so but this line seems (maybe) dangerous? For instance, what if we start getting into block structuring with multi-index parameters? We should make sure there is a test to ensure the system is robust. I believe I have this on the back burner to make one for Pyomo.DoE as well.

theta_sampling_method: string, optional
Method used to sample theta values. Options are "random", "latin_hypercube", or "sobol".
Default is "random".
solver: string, optional
Copy link
Contributor

@djlaky djlaky May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to pass custom solver options? I'm not super familiar with this but if this is called separate of the standard parmest call, you need to make sure the user can specify their own solver options.


# Store the results in a list or DataFrame
# depending on the number of restarts
results_df.iloc[i, len(theta_names):len(theta_names) + len(theta_names)] = converged_theta_vals[i, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have the general format of the pandas frame described in text so lines like this can be checked by reviewers. Currently, it is very hard to check to make sure this is correct without running code ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

4 participants