-
Notifications
You must be signed in to change notification settings - Fork 0
Training
This page describes the training procedure for the goldilocks top tagger.
Follow these instructions to replicate the training procedure for any myriad of reasons (different physics objects, definitions, working points, samples, etc.).
There are two steps needed to perform training in Goldilocks.
Using the CMSSW and C++ environment, Goldilocks is used to prepare ntuples specifically for training. The input ntuples are flat ntuples prepared by the SUSY group. (The flat ntuples from C++ framework can be passed into the python framework using uproot.)
The two languages are used because of the tools available in their respective ecosystems: C++ (speed, ROOT libraries) & Python (advanced ML tools).
Goldilocks builds the Event
for each entry in the ROOT file.
Physics objects (AK8/AK4) are defined as structs within the framework (interface/physicsObjects.h
).
The Event
object is passed to other classes (histogrammer
, eventSelection
, etc.) that need information from the event.
DESIGN PHILOSOPHY: Classes that use information from the event to generate new information, e.g., kinematic reconstruction, should be called from the Event
class. To achieve this, pass the external classes structs of necessary information, then return the new object back to the Event
class. Thus, users can access all 'event-level' information from the Event
class and do not need to instantiate extra tools in running macros.
Running macros, to perform the event loop, are stored in the bin/
directory.
These macros outline the basic setup of the configuration, file loop, TTree loop (if necessary), and event loop.
The event selection and information for output files are also declared in this script.
Before running, confirm the options in config/training.txt
(or your custom configuration file) are appropriate!
To execute the framework:
$ source setup.csh
$ run_training config/training.txt
- The steering macro (see
bin/
) first initializes and sets the configurations- Declare settings/objects that are 'global' to all files being processed
- File loop
- Prepare output that is file-specific
- Initialize output file, cutflow histograms, efficiencies, etc.
- Prepare output that is file-specific
- TTree loop (for input files that have physics information in multiple TTrees)
- Declare objects that are 'global' to all events in the tree:
-
Event
object - output ttree
- histograms & efficiencies
-
- Declare objects that are 'global' to all events in the tree:
- Event Loop
- Build the
Event
object (jets, leptons, extras e.g., kinematic reconstruction) - Apply a selection(s), if desired
- Save information to TTree & histograms
- Build the
Different classes are used to achieve this workflow, and each one can be modified or extended by the user (inherit from these classes to build your own!).
Class | About |
---|---|
configuration | Class that contains all information for organization. Multiple functions that return basic information as well |
Event | Class that contains all of the information from the event -> loads information from TTree and re-organizes information into structs & functions, calculates weights, etc. |
eventSelection | Class to apply custom event selection (defined by user) |
histogrammer* | Class for generating histograms (interface between TH1/TH2 and Goldilocks) |
tools | Collection of functions for doing simple tasks common to different aspects of Goldilocks |
truthMatching | Class for determining the matching between truth and reconstructed objects |
Class | About |
---|---|
deepLearning | Class for handling the training/inference for machine learning tasks. The training aspect only prepares inputs as it is assumed training is done in a python environment. |
miniTree | Class similar to miniTree that is used exclusively for generating flat ntuples that are used in machine learning contexts |
histogrammer4ML | Class for saving histograms of features used in the machine learning (saved in flatTree4ML ) |
Class | About |
---|---|
ttbarReco | General reconstruction of the AK8+AK4 system + quality criteria |
If you add directories to the framework, ensure they will be compiled by checking BuildFile.xml
and bin/BuildFile.xml
.
If there are issues, it may be necessary to clean the directory and re-compile everything:
scram b clean
scram b -j8
It is also possible to submit batch jobs using the script python/submitBatchJobs.py
with the text file batch.txt
. For more information, please see the wiki page for batch jobs.
The actual training of the NN is performed in a python environment outside of CMSSW using the packages Asimov + HEP Plotter.
The uproot package loads information from the ROOT file, prepared in the previous step,
into a Pandas dataframe that is then easily used in the framework.
Scripts | Description |
---|---|
python/runAsimov.py |
Steering script that determines what options are set and what order to call the functions. |
python/plotlabels.py |
Labels (colors and binning) for samples and variables |
The relevant Asimov classes are called to perform all the training and plot making (Asimov works as an interface between HEP data and Keras) using the python/runAsimov.py
running macro.
This script loads the configuration file, an example is config/goldilocks.txt
here.
The options are passed to Asimov to set the network architecture and access specific features from the dataframe.
The features listed in the configuration file should match branches in the root input file(s).
Additionally, there is an option to apply a selection to the dataframe using the slices
option defined
here in the running script
and here in the Asimov class.
In the current setup, only AK4 jets with positive values for DeepCSV
values.
For more information please submit an issue or PR.