Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to read in a list of points to scan from a .csv file #320

Closed
wants to merge 2 commits into from

Conversation

runtingt
Copy link

@runtingt runtingt commented Apr 8, 2024

Adds option --fromfile to EnhancedCombine that allows the reading of points from a .csv file. Some checks for number of points and the requested algo (i.e. only 'fixed' is allowed) are performed.

@runtingt runtingt marked this pull request as draft April 9, 2024 16:26
@runtingt runtingt marked this pull request as ready for review April 9, 2024 16:37
@runtingt
Copy link
Author

Documentation added in Combine docs: cms-analysis/HiggsAnalysis-CombinedLimit#937

@runtingt
Copy link
Author

runtingt commented May 1, 2024

I think this should now be ready to merge, any suggestions for improvements?

Copy link
Collaborator

@ajgilbert ajgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @runtingt, thanks for the PR. This is useful functionality, but I've added some comments that should be addressed first before we merge. Note also that combineTool.py already has a (not well documented) --generate option, which can be used to construct multiple jobs out of a series of options, e.g.:

> combineTool.py -M MultiDimFit --algo fixed --generate n:fixedPointPOIs::0,,r1=1,r2=2:1,,r1=2,r2=3 --dry-run
[DRY-RUN]: combine --algo fixed -M MultiDimFit --fixedPointPOIs r1=1,r2=2 -n .Test.0
[DRY-RUN]: combine --algo fixed -M MultiDimFit --fixedPointPOIs r1=2,r2=3 -n .Test.1

One alternative could be to simply append an extra argument to the internal list of generate options, e.g.

self.args.generate.append(
            "n:fixedPointPOIs::" + ":".join(["{},,".format(i) + ",".join(["{}={}".format(k,v) for k,v in X.items()]) for i, X in enumerate(pd.read_csv("test.csv").to_dict(orient="records"))]))

but also fine to stick with the current implementation.

@@ -32,6 +33,8 @@ def attach_intercept_args(self, group):
'-m', '--mass', help='Supports range strings for multiple masses, e.g. "120:130:5,140 will produce three combine calls with mass values of 120, 125, 130 and 140"')
group.add_argument(
'--points', help='For use with "-M MultiDimFit --algo grid" to split scan points into separate jobs')
group.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function describes what we call intercept args - ones that are already used by combine that we want to extract/manipulate. Since you are adding a new arg, this should go in the attach_args method below.

@@ -88,6 +91,32 @@ def run_method(self):
subbed_vars[('SEED',)] = [(sval,) for sval in seed_vals]
self.passthru.extend(['-s', '%(SEED)s'])

# Handle the --fromfile option
if '--algo' in self.passthru:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, it seems you want to intercept the usual --algo option used by combine. Better to add this to the attach_intercept_args method above (and remember to put it back in the final args again)

self.passthru.extend(['--points', self.args.points])
groupby = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this declaration closer to where it used for the first time (though I'm not sure its needed)

with open(self.args.fromfile) as f:
points_df = pd.read_csv(f)
if not self.args.points:
self.args.points = len(points_df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better not to mix your new implementation with the existing one for --points and --split-points - these are specifically about controlling the options in --algo grid. If you need additional options for your method you can add new arguments. Though it's not totally clear you need this. Is your intention that the argument to --points will limit the number of lines in your csv file which is ready?

if (self.args.split_points is not None and
self.args.split_points > 0 and
self.args.points is not None):
self.args.points is not None
and self.args.fromfile is None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, better not to have to modify this code

points_dict = points_df.iloc[r[0]].to_dict()
points_str = ','.join(f'{k}={v}' for k, v in points_dict.items())
points_array.append((r[0], r[1], points_str))
subbed_vars[('P_START', 'P_END', 'fixedPointPOIs')] = points_array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to track start and end indexes here? If I understand correctly, you just want to create one combine call for each row of your CSV file. In that case, you only need one index (to set in args.name), not two

# can only put the name option back now because we might have modified
# it from what the user specified
self.put_back_arg('name', '-n')
proto = 'combine ' + (' '.join(self.passthru))
if self.args.there:
proto = 'pushd %(DIR)s; combine ' + (' '.join(self.passthru))+'; popd'

for it in itertools.product(*list(subbed_vars.values())):
job_str = ''
for idx, it in enumerate(itertools.product(*list(subbed_vars.values()))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid changing this part of the code, I don't think it's necessary - combineTool.py already has a mechanism to group individual combine calls into single jobs (--merge X).

@runtingt
Copy link
Author

Hi @ajgilbert, thanks for the review and apologies for the delay. Since the scripts have now moved to Combine, I've actioned your comments there: cms-analysis/HiggsAnalysis-CombinedLimit#937

@runtingt runtingt closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants