iris-hep · AndriiPovsten · Nov 6, 2023 · Nov 21, 2023 · Nov 21, 2023 · Dec 8, 2023
diff --git a/analyses/cms-open-data-ttbar/README.md b/analyses/cms-open-data-ttbar/README.md
@@ -18,6 +18,7 @@ This directory is focused on running the CMS Open Data $t\bar{t}$ analysis throu
 | models/                       | Contains models used for ML inference task (when `USE_TRITON = False`)                                                                          |
 | utils/                        | Contains code for bookkeeping and cosmetics, as well as some boilerplate. Also contains images used in notebooks.                               |
 | utils/config.py               | This is a general config file to handle different options for running the analysis.                               |
+| REANA              | Folder with modification which is required for running the `ttbar_analysis_pipeline.ipynb` in the REANA platform. |
 
 #### Instructions for paired notebook
 

diff --git a/analyses/cms-open-data-ttbar/REANA/README.md b/analyses/cms-open-data-ttbar/REANA/README.md
@@ -0,0 +1,161 @@
+# REANA example - AGC CMS ttbar analysis with Coffea
+
+This demo shows the submission of [AGC](https://arxiv.org/abs/1010.2506) - Analysis Grand Challenge
+to the [REANA](http://www.reana.io/) using the Snakemake as an workflow engine.
+
+## Analysis Grand Challenge
+
+For full explanation please have a look at this documentation:
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7274936.svg)](https://doi.org/10.5281/zenodo.7274936)
+[![Documentation Status](https://readthedocs.org/projects/agc/badge/?version=latest)](https://agc.readthedocs.io/en/latest/?badge=latest)
+
+The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC.
+This includes
+
+- columnar data extraction from large datasets,
+- processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
+- statistical model construction and statistical inference,
+- relevant visualizations for these steps,
+
+The physics analysis task is a $t\bar{t}$ cross-section measurement with 2015 CMS Open Data (see `datasets/cms-open-data-2015`).
+The current reference implementation can be found in `analyses/cms-open-data-ttbar`.
+
+### 1. Input data 
+
+We are using [2015 CMS Open Data](https://cms.cern/news/first-cms-open-data-lhc-run-2-released) in this demonstration to showcase an analysis pipeline. The input `.root` files are located in the  `nanoAODschema.json`.
+## Analysis Code
+The current coffea AGC version defines the coffea Processor, which includes a lot of the physics analysis details:
+- event filtering and the calculation of observables,
+- event weighting,
+- calculating systematic uncertainties at the event and object level,
+- filling all the information into histograms that get aggregated and ultimately returned to us by coffea.
+
+The analysis takes the following inputs:
+
+- ``nanoAODschema.json`` input `.root` files
+- ``Snakefile`` The Snakefile for 
+- ``ttbar_analysis_reana.ipynb`` The main notebook file where files are processed and analysed.
+- ``file_merging.ipynb`` Notebook to merge each processed `.root` file in one file with unique keys.
+- ``final_merging.ipynb`` Notebook to merge histograms together all of 
+
+### 2. Compute environment 
+
+To be able to rerun the AGC after some time, we need to
+"encapsulate the current compute environment", for example to freeze the ROOT version our
+analysis is using. We shall achieve this by preparing a [Docker](https://www.docker.com/)
+container image for our analysis steps.
+
+We are using the modified verison of the ``analysis-systems-base`` [Docker image](https://github.com/iris-hep/analysis-systems-base) container with additional packages, the main on is [papermill](https://papermill.readthedocs.io/en/latest/) which allows to run the Jupyter Notebook from the command line with additional parameters.
+
+In our case, the Dockerfile creates a conda virtual environment with all necessary packages for running the AGC analysis.
+
+### 3. Kerberos authentication
+Some data are located at the eos/public so in order to process the big amount of files, user should be authenticated with Kerberos.
+In our case we achieve it by setting up:
+```console
+workflow:
+  type: snakemake
+  resources:
+    kerberos: true
+  file: Snakefile
+```
+If you are pocessing small amount of files (less than 10) you can set this option to `False`.
+Or you can also set the kerberos authentication via the Snakemake rules.
+For deeper understanding please refer to the (REANA documentation)[https://docs.reana.io/advanced-usage/access-control/kerberos/]
+
+### 4. AGC workflow for Snakemake suitability  
+REANA provides support for the Snakemake workflow engine. To ensure the fastest execution of the AGC ttbar workflow, a two-level (multicascading) parallelization approach with Snakemake is implemented.
+In the initial step, Snakemake distributes all jobs across separate nodes, each with a single `.root` file for `ttbar_analysis_reana.ipynb`. 
+Subsequently, after the completion of each rule, the merging of individual files into one per sample takes place.
+#Here is the high level of AGC workflow 
+
+```console
+                                +-----------------------------------------+
+                                | Take the CMS open data from nanoaod.json|
+                                +-----------------------------------------+
+                                                    |
+                                                    |
+                                                    |
+                                                    v  
+                                  +-----------------------------------+
+                                  |rule: Process each file in parallel|
+                                  +-----------------------------------+
+                                                    |
+                                                    |
+                                                    |
+                                                    v
+                                +-----------------------------------------+                
+                                |rule: Merge created files for each sample|
+                                +-----------------------------------------+  
+                                                    |
+                                                    |
+                                                    |
+                                                    v
+                                +----------------------------------------------+ 
+                                |rule: Merge sample files into single histogram| 
+                                +----------------------------------------------+
+```
+
+### 5. Running the AGC on REANA
+
+The [reana.yaml](reana.yaml) file describes the above analysis
+structure with its inputs, code, runtime environment, computational workflow steps and
+expected outputs:
+
+```yaml
+version: 0.8.0
+inputs:
+  files:
+    - ttbar_analysis_reana.ipynb 
+    - nanoaod_inputs.json
+    - fix-env.sh
+    - corrections.json
+    - Snakefile
+    - file_merging.ipynb
+    - final_merging.ipynb
+    - prepare_workspace.py
+
+  directories:
+    - histograms
+    - utils
+workflow:
+  type: snakemake
+  resources:
+    kerberos: true
+  file: Snakefile
+outputs:
+  files:
+    - histograms_merged.root
+```
+
+We can now install the REANA command-line client, run the analysis and download the
+resulting plots:
+
+```console
+$ # create new virtual environment
+$ virtualenv ~/.virtualenvs/reana
+$ source ~/.virtualenvs/reana/bin/activate
+$ # install REANA client
+$ pip install reana-client
+$ # connect to some REANA cloud instance
+$ export REANA_SERVER_URL=https://reana.cern.ch/
+$ export REANA_ACCESS_TOKEN=XXXXXXX
+$ # Navigate to the `cms-open-data-ttbar` and select the path to the reana.yaml file which is inside the `Reana` folder
+$ # run AGC workflow
+$ reana-client run -file Reana/reana.yaml -w reana-agc-cms-ttbar-coffea
+$ # ... should be finished in around 6 minutes if you select all files in the Snakefile
+$ reana-client status
+$ # list workspace files
+$ reana-client ls
+```
+
+Please see the [REANA-Client](https://reana-client.readthedocs.io/) documentation for
+more detailed explanation of typical `reana-client` usage scenarios.
+
+### 6. Output results
+
+The output is created under the name of ``histograms_merged.root`` which can be further evaluated with variety of AGC tools.
+
+
+
+
diff --git a/analyses/cms-open-data-ttbar/REANA/Snakefile b/analyses/cms-open-data-ttbar/REANA/Snakefile
@@ -0,0 +1,100 @@
+N_FILES_MAX_PER_SAMPLE = -1
+download_sleep = 0
+url_prefix = "root://eospublic.cern.ch//eos/opendata"
+#In order to run analysis from Nebraska use this prefix
+#url_prefix = "https://xrootd-local.unl.edu:1094//" 
+import glob
+import json
+import os
+def extract_samples_from_json(json_file):
+    output_files = []
+
+    with open(json_file, "r") as fd:
+        data = json.load(fd)
+
+        for sample, conditions in data.items():
+            for condition, details in conditions.items():
+                sample_name = f"{sample}__{condition}"
+                output_files.append(sample_name)
+                with open(f"sample_{sample_name}_paths.txt", "w") as path_file:
+                    paths = [file_info["path"].replace("https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD",
+                                                       "root://eospublic.cern.ch//eos/opendata/cms/upload/agc/1.0.0/") for file_info in details["files"]]
+                    path_file.write("\n".join(paths))
+    return output_files
+
+def get_file_paths(wildcards, max=N_FILES_MAX_PER_SAMPLE):
+    "Return list of at most MAX file paths for the given SAMPLE."
+    import json
+    import os
+    filepaths = []
+    fd = open(f"sample_{wildcards.sample}__{wildcards.condition}_paths.txt")
+    filepaths = fd.read().splitlines()
+    fd.close()
+    return [f"histograms/histograms_{wildcards.sample}__{wildcards.condition}__"+filepath[38:] for filepath in filepaths][:max] 
+
+samples = extract_samples_from_json("nanoaod_inputs.json")
+
+def get_items(json_file):
+    samples = []
+    with open(json_file, "r") as fd:
+        data = json.load(fd)
+        for sample, conditions in data.items():
+            for condition in conditions:
+                samples.append((sample, condition))
+    return samples 
+
+rule all:
+    input:
+        "histograms_merged.root"
+
+rule process_sample_one_file_in_sample:
+    container:
+        "hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
+    resources:
+        kubernetes_memory_limit="3700Mi"
+    input:
+        "Reana/ttbar_analysis_reana.ipynb"
+    output:
+        "histograms/histograms_{sample}__{condition}__{filename}"
+    params:
+        sample_name = '{sample}__{condition}'
+    shell:
+        "/bin/bash -l && source Reana/fix-env.sh && python Reana/prepare_workspace.py sample_{params.sample_name}_{wildcards.filename} && papermill Reana/ttbar_analysis_reana.ipynb sample_{params.sample_name}_{wildcards.filename}_out.ipynb -p sample_name {params.sample_name} -p filename {url_prefix}{wildcards.filename} -k python3"
+
+rule process_sample:
+    container:
+        "hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
+    resources:
+        kubernetes_memory_limit="1850Mi"
+    input:
+        "Reana/file_merging.ipynb",
+        get_file_paths
+    output:
+        "everything_merged_{sample}__{condition}.root"
+    params:
+        sample_name = '{sample}__{condition}'
+    shell:
+        "/bin/bash -l && source Reana/fix-env.sh && papermill Reana/file_merging.ipynb merged_{params.sample_name}.ipynb -p sample_name {params.sample_name} -k python3"
+
+rule merging_histograms:
+    container:
+        "hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
+    resources:
+        kubernetes_memory_limit="1850Mi"
+    input:
+        "everything_merged_ttbar__nominal.root",
+        "everything_merged_ttbar__ME_var.root",
+        "everything_merged_ttbar__PS_var.root",
+        "everything_merged_ttbar__scaleup.root",
+        "everything_merged_ttbar__scaledown.root",
+        "everything_merged_single_top_s_chan__nominal.root",
+        "everything_merged_single_top_t_chan__nominal.root",
+        "everything_merged_single_top_tW__nominal.root",
+        "everything_merged_wjets__nominal.root",
+        "Reana/final_merging.ipynb"
+    output:
+        "histograms_merged.root"
+    shell:
+        "/bin/bash -l && source Reana/fix-env.sh && papermill Reana/final_merging.ipynb result_notebook.ipynb -k python3"
+
+
diff --git a/analyses/cms-open-data-ttbar/REANA/entrypoint.sh b/analyses/cms-open-data-ttbar/REANA/entrypoint.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+exec /bin/bash -l -c "$*"
diff --git a/analyses/cms-open-data-ttbar/REANA/file_merging.ipynb b/analyses/cms-open-data-ttbar/REANA/file_merging.ipynb
@@ -0,0 +1,57 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "import hist\n",
+    "import uproot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_histograms = {}\n",
+    "for fname in glob.glob(f\"histograms/histograms_{sample_name}_*/**/*.root\", recursive=True):\n",
+    "    print(f\"opening file {fname}\")\n",
+    "    with uproot.open(fname) as f:\n",
+    "        # loop over all histograms in file\n",
+    "        for key in f.keys(cycle=False):\n",
+    "            if key not in all_histograms.keys():\n",
+    "                # this kind of histogram has not been seen yet, create a new entry for it\n",
+    "                all_histograms.update({key: hist.Hist(f[key])})\n",
+    "            else:\n",
+    "                # this kind of histogram is already being tracked, so add it\n",
+    "                all_histograms[key] += hist.Hist(f[key])\n",
+    "# save this to a new file\n",
+    "with uproot.recreate(f\"everything_merged_{sample_name}.root\") as f:\n",
+    "    for key, value in all_histograms.items():\n",
+    "        f[key] = value\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file = uproot.open(f\"everything_merged_{sample_name}.root\")\n",
+    "keys = file.keys()\n",
+    "print(f\"Keys for the everything_merged_{sample_name}.root:\", keys)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/analyses/cms-open-data-ttbar/REANA/file_merging.py b/analyses/cms-open-data-ttbar/REANA/file_merging.py
@@ -0,0 +1,25 @@
+import glob
+import hist
+import uproot
+
+all_histograms = {}
+for fname in glob.glob(f"histograms/histograms_{sample_name}_*/**/*.root", recursive=True):
+    print(f"opening file {fname}")
+    with uproot.open(fname) as f:
+        # loop over all histograms in file
+        for key in f.keys(cycle=False):
+            if key not in all_histograms.keys():
+                # this kind of histogram has not been seen yet, create a new entry for it
+                all_histograms.update({key: hist.Hist(f[key])})
+            else:
+                # this kind of histogram is already being tracked, so add it
+                all_histograms[key] += hist.Hist(f[key])
+# save this to a new file
+with uproot.recreate(f"everything_merged_{sample_name}.root") as f:
+    for key, value in all_histograms.items():
+        f[key] = value
+
+
+file = uproot.open(f"everything_merged_{sample_name}.root")
+keys = file.keys()
+print(f"Keys for the everything_merged_{sample_name}.root:", keys)