Sixth CAP meeting

Notes from the CERN Analysis Preservation/DASPOS/RECAST workshop

February 2nd, 2017 at CERN

Agenda: https://github.com/cernanalysispreservation/analysispreservation.cern.ch/wiki/Joint-CERN-Analysis-Preservation-DASPOS-RECAST-workshop

Additional Materials (slides): https://indico.cern.ch/event/611389/

Attendees: Representatives from LHC experiments, CAP team, members from DASPOS and RECAST

Notes

###Introduction

Reminder of the original use case: Preserving analyses for later access and reuse. With researchers submitting content to CAP, it becomes an aggregator of analysis contents within a collaboration. Hence, the use cases also include for example to ease the internal discoverability of analysis elements by offering comprehensive search functionalities on top of preserved content. See user stories for more details.
This mini workshop aimed at getting everyone involved on the same page on what has been happening lately in the project, to see the latest prototypes, to give feedback and to decide on the next steps.
The DASPOS project (presented by Mike Hildreth), consisting of a team of computer scientists, digital librarians and members of the scientists community, has been working very closely with CAP and RECAST recently. The focus has been on metadata aspects (i.e. Ontologies) and the reuse part (with Umbrella and RECAST). Another phase focuses on the integration of CAP into the research workflow, e.g. build connectors to CAP.
RECAST collaborates closely with CAP to build the “reuse” part of the CAP environment so that an analysis can be rerun with a modified set of input parameters on the computing cloud. (See details below.)
Work on CERN Analysis Preservation is organized in three pillars:
- Describe: In order to understand the workflow steps and the results of an analysis, it is crucial to identify the main elements of the analysis. This varies by collaboration/working group, so there is a challenge to handle standardisation vs completeness.
- Capture: To later access and reuse the preserved analysis, it is needed to capture the content (datasets, software, computing environment, workflows). Technical challenges arise due to large files, reused elements between analyses, versioning.
- Reuse: users accessing the content on CAP should be able to instantiate and rerun an analysis for the purposes of re-validation or future re-interpretation.

Describe Pillar

Overview chart of the describe pillar
The description of the analysis is done in JSON format. CAP uses a range of JSON Schemas that describe which JSON fields are permitted. The CAP team aims to standardize these as much as possible, with the limitation that only very few preservation standards exist, all the while allowing flexibility to adjust to LHC collaboration community practices. Schemas are versioned so that CAP content can contain records of different schema versions. https://github.com/cernanalysispreservation/analysispreservation.cern.ch/tree/master/cap/jsonschemas
The web forms for each collaboration (accessible through CAP) are one visual representation of these JSON schemas. Depending on the preference and work environment of the collaboration, the functionalities of the form can be adjusted in order to provide sufficient detail on the physics details of the analyses, on their typical workflows and dependencies.

Capture Pillar

Overview chart of the capture pillar
The fundamental architecture to preserve analysis elements (datasets, software, computing environment) has been set up using the Invenio digital library software with JSON schema management, Elasticsearch information retrieval, AngularJS for frontend, PostgreSQL for persistency, CERN EOS for the backend storage.
- See slides: https://indico.cern.ch/event/611389/attachments/1406852/2150011/project_architecture.pdf
An updated CAP platform has been set up and is now in a prototype state. Naturally, the prototype will evolve further during the next weeks and months as it will be tested further in typical end user scenarios. The access restrictions are following the usual CERN established standards (SSO, e-groups).

Reuse Pillar

Overview chart of reuse pillar
REANA is the platform that enables users to rerun analyses stored on CAP. It uses CERN OpenStack Magnum for computing cloud resource allocation and Kubernetes and Docker for container orchestration. It can run analysis workflows described with Yadage workflow engine. CAP is currently the major client, but REANA can be used by other services such as CERN Open Data, RECAST, or Zenodo. See slides: https://indico.cern.ch/event/611389/attachments/1406852/2151045/REANA-architecture-technologies-dev-process.pdf
RECAST/REANA slides with a demo of rerunning real ATLAS analysis: https://indico.cern.ch/event/611389/attachments/1406852/2150142/CAPRECASTWorkshop.pdf

From the discussion - “to be done”

Help the CAP team building the connectors to experimental internal databases which is considered crucial for the adoption of CAP as a tool.
- Clear responsible person on either side, collaboration and CAP team.
- CAP team can support building the connector, but we need help to know which information goes where, which fields are relevant, which ones change frequently, etc.
- If possible, it would be great to have an API interface on the experimental internal tools side that can be queried by CAP.
A draft version of a CAP “record”/”entry” should also be versioned and should be shareable within small group. Request: Enhance the draft mode so one can share it with selected people or the working group only (and not with the whole collaboration which is the default now). The current draft record can be viewed and edited only by the creator.
Investigate different types of records, ordinary analysis records and “reference records” (e.g. for Rivet) that can be referenced/used by other records.
- Need for many to many relationships.
- Towards an analysis registry (suggestion by Kyle).
Need to underline that the usage of the CAP analysis platform can vary slightly from collaboration to collaboration, e.g. when an analysis can be put into CAP - from the start of the analysis process or towards the end just before hte analysis approval. This influences cross-linking functionalities for example. If researchers submit early, it might happen that the analysis is not finally published as a publication. If the content is submitted at the time of the publication approval only, there are fewer versions and always a publication associated to it. CAP can support both models. Experiments are free to decide how they want to use CAP as a tool helping to preserve analyses.
Need for export to HEPData or others (where to?). Lukas Heinrich will start building a trial connector to HEPData.
One use case of REANA within CAP framework: Rerun preserved analyses from CAP every month or so in order to see any breaks due to external dependencies.
Question of payment for storage space. Contribution by Tim Smith indicated that it needs more discussion. Currently it is suggested to include in the standard plans and allocations for the LHC experiments.

Next steps

Test prototype with first sets of analyses (for robustness) at CERN.
Open URLs outside of CERN and share it.
Establish connectors with experimental collaborations databases (that is partially in parallel with other activities).
Test outside CERN with targeted users, submission and retrieval. Diversify testing according to the use cases presented. Deadline mid March (DPHEP workshop).
First internal strategical note (internal in the collaborations, CERN hierarchy).
Widening testing scope based on internal communications (5) hopefully with internal support of the collaborations.
Beta-release.
Strategical approval (internal in the collaborations and CERN).

Next meeting at DPHEP workshop March 13th to 15th. Time depending on the final workshop agenda.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sixth CAP meeting

Notes from the CERN Analysis Preservation/DASPOS/RECAST workshop

Notes

Describe Pillar

Capture Pillar

Reuse Pillar

From the discussion - “to be done”

Next steps

HOME

Overview

Meeting notes

Workshops

Requirements

Use cases

User stories

Roadmap

Clone this wiki locally