Skip to content

Training

Daniel Marley edited this page Oct 24, 2018 · 8 revisions

This page describes the training procedure for the goldilocks top tagger.
Follow these instructions to replicate the training procedure for any myriad of reasons (different physics objects, definitions, working points, samples, etc.).

There are two steps needed to perform training in Goldilocks.

Ntuple Production

Using the CMSSW and C++ environment, Goldilocks is used to prepare ntuples specifically for training. The input ntuples are flat ntuples prepared by the SUSY group. (The flat ntuples from C++ framework can be passed into the python framework using uproot.)

The two languages are used because of the tools available in their respective ecosystems: C++ (speed, ROOT libraries) & Python (advanced ML tools).

Goldilocks builds the Event for each entry in the ROOT file. Physics objects (AK8/AK4) are defined as structs within the framework (interface/physicsObjects.h). The Event object is passed to other classes (histogrammer, eventSelection, etc.) that need information from the event.

DESIGN PHILOSOPHY: Classes that use information from the event to generate new information, e.g., kinematic reconstruction, should be called from the Event class. To achieve this, pass the external classes structs of necessary information, then return the new object back to the Event class. Thus, users can access all 'event-level' information from the Event class and do not need to instantiate extra tools in running macros.

Running macros, to perform the event loop, are stored in the bin/ directory.
These macros outline the basic setup of the configuration, file loop, TTree loop (if necessary), and event loop.
The event selection and information for output files are also declared in this script.

Before running, confirm the options in config/training.txt (or your custom configuration file) are appropriate!
To execute the framework:

$ source setup.csh
$ run_training config/training.txt

Analysis Flow

  1. The steering macro (see bin/) first initializes and sets the configurations
    • Declare settings/objects that are 'global' to all files being processed
  2. File loop
    • Prepare output that is file-specific
      • Initialize output file, cutflow histograms, efficiencies, etc.
  3. TTree loop (for input files that have physics information in multiple TTrees)
    • Declare objects that are 'global' to all events in the tree:
      • Event object
      • output ttree
      • histograms & efficiencies
  4. Event Loop
    • Build the Event object (jets, leptons, extras e.g., kinematic reconstruction)
    • Apply a selection(s), if desired
    • Save information to TTree & histograms

Different classes are used to achieve this workflow, and each one can be modified or extended by the user (inherit from these classes to build your own!).

Standard classes:

Class About
configuration Class that contains all information for organization. Multiple functions that return basic information as well
Event Class that contains all of the information from the event -> loads information from TTree and re-organizes information into structs & functions, calculates weights, etc.
eventSelection Class to apply custom event selection (defined by user)
histogrammer* Class for generating histograms (interface between TH1/TH2 and Goldilocks)
tools Collection of functions for doing simple tasks common to different aspects of Goldilocks
truthMatching Class for determining the matching between truth and reconstructed objects

Machine learning:

Class About
deepLearning Class for handling the training/inference for machine learning tasks. The training aspect only prepares inputs as it is assumed training is done in a python environment.
miniTree Class similar to miniTree that is used exclusively for generating flat ntuples that are used in machine learning contexts
histogrammer4ML Class for saving histograms of features used in the machine learning (saved in flatTree4ML)

Kinematic reconstruction:

Class About
ttbarReco General reconstruction of the AK8+AK4 system + quality criteria

If you add directories to the framework, ensure they will be compiled by checking BuildFile.xml and bin/BuildFile.xml.
If there are issues, it may be necessary to clean the directory and re-compile everything:

scram b clean
scram b -j8

It is also possible to submit batch jobs using the script python/submitBatchJobs.py with the text file batch.txt. For more information, please see the wiki page for batch jobs.

Training

The actual training of the NN is performed in a python environment outside of CMSSW using the packages Asimov + HEP Plotter.
The uproot package loads information from the ROOT file, prepared in the previous step, into a Pandas dataframe that is then easily used in the framework.

Scripts Description
python/runAsimov.py Steering script that determines what options are set and what order to call the functions.
python/plotlabels.py Labels (colors and binning) for samples and variables

The relevant Asimov classes are called to perform all the training and plot making (Asimov works as an interface between HEP data and Keras) using the python/runAsimov.py running macro.
This script loads the configuration file, an example is config/goldilocks.txt here. The options are passed to Asimov to set the network architecture and access specific features from the dataframe.
The features listed in the configuration file should match branches in the root input file(s).

Additionally, there is an option to apply a selection to the dataframe using the slices option defined here in the running script and here in the Asimov class. In the current setup, only AK4 jets with positive values for DeepCSV values.

Questions or Comments

For more information please submit an issue or PR.