Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REANA set up #217

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
6cb7dfc
reana.yaml file
Nov 6, 2023
6596b4d
Separate the samples in fileset for paralelisation
Nov 21, 2023
c514b3d
Merge step with histograms_merdeg.root
Nov 21, 2023
104f799
A lot of changes
Dec 8, 2023
ad3b0b5
Snakemake multi cascading
AndriiPovsten Feb 7, 2024
486faec
Snakemake multicascading and the submission.yaml for HEPData
AndriiPovsten Feb 7, 2024
7703d16
Snakemake multicascading
AndriiPovsten Feb 7, 2024
b69b21f
Without HEPData workspace
AndriiPovsten Feb 7, 2024
5df86b5
The HEPData folder with submission files for the cabinetry submission
AndriiPovsten Feb 7, 2024
e078f4b
Merge pull request #1 from AndriiPovsten/new_branch
AndriiPovsten Feb 7, 2024
be5c7a8
Better naming for the files with some suggestions for the Snakefile
AndriiPovsten Feb 29, 2024
f5bde77
Merge pull request #2 from AndriiPovsten/new_branch
AndriiPovsten Feb 29, 2024
a4911f3
The separate REANA folder
AndriiPovsten Feb 29, 2024
a8d3786
Merge pull request #3 from AndriiPovsten/new_branch
AndriiPovsten Feb 29, 2024
e2990a1
Test the file processing locally
AndriiPovsten Mar 1, 2024
2e4b13e
Merge pull request #4 from AndriiPovsten/Reproducibility_REANA
AndriiPovsten Mar 1, 2024
a0612d7
HEPData submission
AndriiPovsten Mar 11, 2024
87f510c
Merge pull request #5 from AndriiPovsten/main
AndriiPovsten Mar 11, 2024
af1906e
Change the naming
AndriiPovsten Mar 19, 2024
7ec7e4f
Cleaner Snakefile and main analysis notebook
AndriiPovsten Mar 19, 2024
fff6740
Merge pull request #7 from AndriiPovsten/new_branch
AndriiPovsten Mar 19, 2024
3edac4b
HEPData in the utils
AndriiPovsten Mar 21, 2024
3db44d4
The ultimate hepdata function for both current models
AndriiPovsten Mar 21, 2024
c51264c
HEPData submission with function in utils folder
AndriiPovsten Mar 22, 2024
0e052b7
Updated HEP_data fucntion
AndriiPovsten May 16, 2024
a37fd02
Shorter Snakefile version
AndriiPovsten May 16, 2024
dcafae7
Change the folder name
AndriiPovsten May 16, 2024
a439d95
adding the environment folder for local run
AndriiPovsten May 16, 2024
89e5585
Resolved conflicts in ttbar_analysis_pipeline.ipynb
AndriiPovsten May 16, 2024
8fe4084
Merge branch 'hepdata' into Reproducibility_REANA
AndriiPovsten May 16, 2024
49dae45
Updated READ.ME with only REANA folder
AndriiPovsten May 17, 2024
fe5aff3
Updated location and README
AndriiPovsten May 17, 2024
f44235a
Deleted cloned README.md
AndriiPovsten May 17, 2024
f6a1b22
Deleted non related files
AndriiPovsten May 17, 2024
ce05c8d
Cleaning up the files
AndriiPovsten Jun 24, 2024
b15fb7b
get rid of Store files
AndriiPovsten Jun 24, 2024
cffbac5
get rid of Store file
AndriiPovsten Jun 24, 2024
6fb63c9
Updated final_merging file with extract_samples function
AndriiPovsten Jun 26, 2024
7b839db
Getting rid of packages that are not used
AndriiPovsten Jun 26, 2024
010ad45
Getting rid of packages that are not used
AndriiPovsten Jun 26, 2024
1f1a5f1
Rename REANA folder
AndriiPovsten Jun 28, 2024
7065a85
Getting rid of environment directory
AndriiPovsten Jun 28, 2024
890745f
Return accidentaly deleted files
AndriiPovsten Jul 5, 2024
2d196b2
putting files in original place
AndriiPovsten Jul 5, 2024
51a4fd1
set to original state utils folder
AndriiPovsten Jul 5, 2024
a642b45
synchronizing the script and the notebook
AndriiPovsten Jul 5, 2024
a31152b
Using harbor image(ideally needs a papermill dependecy), changing rea…
AndriiPovsten Jul 5, 2024
c659f92
Updated Snakefile for handling the file locations
AndriiPovsten Jul 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions analyses/cms-open-data-ttbar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ This directory is focused on running the CMS Open Data $t\bar{t}$ analysis throu
| models/ | Contains models used for ML inference task (when `USE_TRITON = False`) |
| utils/ | Contains code for bookkeeping and cosmetics, as well as some boilerplate. Also contains images used in notebooks. |
| utils/config.py | This is a general config file to handle different options for running the analysis. |
| REANA | Folder with modification which is required for running the `ttbar_analysis_pipeline.ipynb` in the REANA platform. |

#### Instructions for paired notebook

Expand Down
161 changes: 161 additions & 0 deletions analyses/cms-open-data-ttbar/REANA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# REANA example - AGC CMS ttbar analysis with Coffea

This demo shows the submission of [AGC](https://arxiv.org/abs/1010.2506) - Analysis Grand Challenge
to the [REANA](http://www.reana.io/) using the Snakemake as an workflow engine.

## Analysis Grand Challenge

For full explanation please have a look at this documentation:
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7274936.svg)](https://doi.org/10.5281/zenodo.7274936)
[![Documentation Status](https://readthedocs.org/projects/agc/badge/?version=latest)](https://agc.readthedocs.io/en/latest/?badge=latest)

The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC.
This includes

- columnar data extraction from large datasets,
- processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
- statistical model construction and statistical inference,
- relevant visualizations for these steps,

The physics analysis task is a $t\bar{t}$ cross-section measurement with 2015 CMS Open Data (see `datasets/cms-open-data-2015`).
The current reference implementation can be found in `analyses/cms-open-data-ttbar`.

### 1. Input data

We are using [2015 CMS Open Data](https://cms.cern/news/first-cms-open-data-lhc-run-2-released) in this demonstration to showcase an analysis pipeline. The input `.root` files are located in the `nanoAODschema.json`.
## Analysis Code
The current coffea AGC version defines the coffea Processor, which includes a lot of the physics analysis details:
- event filtering and the calculation of observables,
- event weighting,
- calculating systematic uncertainties at the event and object level,
- filling all the information into histograms that get aggregated and ultimately returned to us by coffea.

The analysis takes the following inputs:

- ``nanoAODschema.json`` input `.root` files
- ``Snakefile`` The Snakefile for
- ``ttbar_analysis_reana.ipynb`` The main notebook file where files are processed and analysed.
- ``file_merging.ipynb`` Notebook to merge each processed `.root` file in one file with unique keys.
- ``final_merging.ipynb`` Notebook to merge histograms together all of

### 2. Compute environment

To be able to rerun the AGC after some time, we need to
"encapsulate the current compute environment", for example to freeze the ROOT version our
analysis is using. We shall achieve this by preparing a [Docker](https://www.docker.com/)
container image for our analysis steps.

We are using the modified verison of the ``analysis-systems-base`` [Docker image](https://github.com/iris-hep/analysis-systems-base) container with additional packages, the main on is [papermill](https://papermill.readthedocs.io/en/latest/) which allows to run the Jupyter Notebook from the command line with additional parameters.

In our case, the Dockerfile creates a conda virtual environment with all necessary packages for running the AGC analysis.

### 3. Kerberos authentication
Some data are located at the eos/public so in order to process the big amount of files, user should be authenticated with Kerberos.
In our case we achieve it by setting up:
```console
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
```
If you are pocessing small amount of files (less than 10) you can set this option to `False`.
Or you can also set the kerberos authentication via the Snakemake rules.
For deeper understanding please refer to the (REANA documentation)[https://docs.reana.io/advanced-usage/access-control/kerberos/]

### 4. AGC workflow for Snakemake suitability
REANA provides support for the Snakemake workflow engine. To ensure the fastest execution of the AGC ttbar workflow, a two-level (multicascading) parallelization approach with Snakemake is implemented.
In the initial step, Snakemake distributes all jobs across separate nodes, each with a single `.root` file for `ttbar_analysis_reana.ipynb`.
Subsequently, after the completion of each rule, the merging of individual files into one per sample takes place.
#Here is the high level of AGC workflow

```console
+-----------------------------------------+
| Take the CMS open data from nanoaod.json|
+-----------------------------------------+
|
|
|
v
+-----------------------------------+
|rule: Process each file in parallel|
+-----------------------------------+
|
|
|
v
+-----------------------------------------+
|rule: Merge created files for each sample|
+-----------------------------------------+
|
|
|
v
+----------------------------------------------+
|rule: Merge sample files into single histogram|
+----------------------------------------------+
```

### 5. Running the AGC on REANA

The [reana.yaml](reana.yaml) file describes the above analysis
structure with its inputs, code, runtime environment, computational workflow steps and
expected outputs:

```yaml
version: 0.8.0
inputs:
files:
- ttbar_analysis_reana.ipynb
- nanoaod_inputs.json
- fix-env.sh
- corrections.json
- Snakefile
- file_merging.ipynb
- final_merging.ipynb
- prepare_workspace.py

directories:
- histograms
- utils
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
outputs:
files:
- histograms_merged.root
```

We can now install the REANA command-line client, run the analysis and download the
resulting plots:

```console
$ # create new virtual environment
$ virtualenv ~/.virtualenvs/reana
$ source ~/.virtualenvs/reana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # Navigate to the `cms-open-data-ttbar` and select the path to the reana.yaml file which is inside the `Reana` folder
$ # run AGC workflow
$ reana-client run -file Reana/reana.yaml -w reana-agc-cms-ttbar-coffea
$ # ... should be finished in around 6 minutes if you select all files in the Snakefile
$ reana-client status
$ # list workspace files
$ reana-client ls
```

Please see the [REANA-Client](https://reana-client.readthedocs.io/) documentation for
more detailed explanation of typical `reana-client` usage scenarios.

### 6. Output results

The output is created under the name of ``histograms_merged.root`` which can be further evaluated with variety of AGC tools.




100 changes: 100 additions & 0 deletions analyses/cms-open-data-ttbar/REANA/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
N_FILES_MAX_PER_SAMPLE = -1
download_sleep = 0
url_prefix = "root://eospublic.cern.ch//eos/opendata"
#In order to run analysis from Nebraska use this prefix
#url_prefix = "https://xrootd-local.unl.edu:1094//"
import glob
import json
import os
def extract_samples_from_json(json_file):
output_files = []

with open(json_file, "r") as fd:
data = json.load(fd)

for sample, conditions in data.items():
for condition, details in conditions.items():
sample_name = f"{sample}__{condition}"
output_files.append(sample_name)
with open(f"sample_{sample_name}_paths.txt", "w") as path_file:
paths = [file_info["path"].replace("https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD",
"root://eospublic.cern.ch//eos/opendata/cms/upload/agc/1.0.0/") for file_info in details["files"]]
path_file.write("\n".join(paths))
return output_files

def get_file_paths(wildcards, max=N_FILES_MAX_PER_SAMPLE):
"Return list of at most MAX file paths for the given SAMPLE."
import json
import os
filepaths = []
fd = open(f"sample_{wildcards.sample}__{wildcards.condition}_paths.txt")
filepaths = fd.read().splitlines()
fd.close()
return [f"histograms/histograms_{wildcards.sample}__{wildcards.condition}__"+filepath[38:] for filepath in filepaths][:max]

samples = extract_samples_from_json("nanoaod_inputs.json")

def get_items(json_file):
samples = []
with open(json_file, "r") as fd:
data = json.load(fd)
for sample, conditions in data.items():
for condition in conditions:
samples.append((sample, condition))
return samples

rule all:
input:
"histograms_merged.root"

rule process_sample_one_file_in_sample:
container:
"hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
resources:
kubernetes_memory_limit="3700Mi"
input:
"Reana/ttbar_analysis_reana.ipynb"
output:
"histograms/histograms_{sample}__{condition}__{filename}"
params:
sample_name = '{sample}__{condition}'
shell:
"/bin/bash -l && source Reana/fix-env.sh && python Reana/prepare_workspace.py sample_{params.sample_name}_{wildcards.filename} && papermill Reana/ttbar_analysis_reana.ipynb sample_{params.sample_name}_{wildcards.filename}_out.ipynb -p sample_name {params.sample_name} -p filename {url_prefix}{wildcards.filename} -k python3"

rule process_sample:
container:
"hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
resources:
kubernetes_memory_limit="1850Mi"
input:
"Reana/file_merging.ipynb",
get_file_paths
output:
"everything_merged_{sample}__{condition}.root"
params:
sample_name = '{sample}__{condition}'
shell:
"/bin/bash -l && source Reana/fix-env.sh && papermill Reana/file_merging.ipynb merged_{params.sample_name}.ipynb -p sample_name {params.sample_name} -k python3"

rule merging_histograms:
container:
"hub.opensciencegrid.org/iris-hep/analysis-systems-base:latest"
resources:
kubernetes_memory_limit="1850Mi"
input:
"everything_merged_ttbar__nominal.root",
"everything_merged_ttbar__ME_var.root",
"everything_merged_ttbar__PS_var.root",
"everything_merged_ttbar__scaleup.root",
"everything_merged_ttbar__scaledown.root",
"everything_merged_single_top_s_chan__nominal.root",
"everything_merged_single_top_t_chan__nominal.root",
"everything_merged_single_top_tW__nominal.root",
"everything_merged_wjets__nominal.root",
"Reana/final_merging.ipynb"
output:
"histograms_merged.root"
shell:
"/bin/bash -l && source Reana/fix-env.sh && papermill Reana/final_merging.ipynb result_notebook.ipynb -k python3"


3 changes: 3 additions & 0 deletions analyses/cms-open-data-ttbar/REANA/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

exec /bin/bash -l -c "$*"
57 changes: 57 additions & 0 deletions analyses/cms-open-data-ttbar/REANA/file_merging.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"import hist\n",
"import uproot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_histograms = {}\n",
"for fname in glob.glob(f\"histograms/histograms_{sample_name}_*/**/*.root\", recursive=True):\n",
" print(f\"opening file {fname}\")\n",
" with uproot.open(fname) as f:\n",
" # loop over all histograms in file\n",
" for key in f.keys(cycle=False):\n",
" if key not in all_histograms.keys():\n",
" # this kind of histogram has not been seen yet, create a new entry for it\n",
" all_histograms.update({key: hist.Hist(f[key])})\n",
" else:\n",
" # this kind of histogram is already being tracked, so add it\n",
" all_histograms[key] += hist.Hist(f[key])\n",
"# save this to a new file\n",
"with uproot.recreate(f\"everything_merged_{sample_name}.root\") as f:\n",
" for key, value in all_histograms.items():\n",
" f[key] = value\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"file = uproot.open(f\"everything_merged_{sample_name}.root\")\n",
"keys = file.keys()\n",
"print(f\"Keys for the everything_merged_{sample_name}.root:\", keys)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
25 changes: 25 additions & 0 deletions analyses/cms-open-data-ttbar/REANA/file_merging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import glob
import hist
import uproot

all_histograms = {}
for fname in glob.glob(f"histograms/histograms_{sample_name}_*/**/*.root", recursive=True):

Check failure on line 6 in analyses/cms-open-data-ttbar/REANA/file_merging.py

View workflow job for this annotation

GitHub Actions / linter

Ruff (F821)

analyses/cms-open-data-ttbar/REANA/file_merging.py:6:49: F821 Undefined name `sample_name`
print(f"opening file {fname}")
with uproot.open(fname) as f:
# loop over all histograms in file
for key in f.keys(cycle=False):
if key not in all_histograms.keys():
# this kind of histogram has not been seen yet, create a new entry for it
all_histograms.update({key: hist.Hist(f[key])})
else:
# this kind of histogram is already being tracked, so add it
all_histograms[key] += hist.Hist(f[key])
# save this to a new file
with uproot.recreate(f"everything_merged_{sample_name}.root") as f:

Check failure on line 18 in analyses/cms-open-data-ttbar/REANA/file_merging.py

View workflow job for this annotation

GitHub Actions / linter

Ruff (F821)

analyses/cms-open-data-ttbar/REANA/file_merging.py:18:43: F821 Undefined name `sample_name`
for key, value in all_histograms.items():
f[key] = value


file = uproot.open(f"everything_merged_{sample_name}.root")

Check failure on line 23 in analyses/cms-open-data-ttbar/REANA/file_merging.py

View workflow job for this annotation

GitHub Actions / linter

Ruff (F821)

analyses/cms-open-data-ttbar/REANA/file_merging.py:23:41: F821 Undefined name `sample_name`
keys = file.keys()
print(f"Keys for the everything_merged_{sample_name}.root:", keys)

Check failure on line 25 in analyses/cms-open-data-ttbar/REANA/file_merging.py

View workflow job for this annotation

GitHub Actions / linter

Ruff (F821)

analyses/cms-open-data-ttbar/REANA/file_merging.py:25:41: F821 Undefined name `sample_name`
Loading
Loading