Skip to content
AnxhelaDani edited this page Jan 28, 2016 · 4 revisions

CERN analysis preservation (CAP) system - use cases

In this document potential scenarios are described for the usage CAP could offer, i.e. how it integrates with current practices. This is based on past user experience and the pre-meetings with the collaborations over the past months and weeks. The uses cases are divided into two sections: the first ones being focused on existing simple workflows that are observed today. Then, two scenarios in the second part illustrate the potential complexity of the tool.

  • Simple use cases

History:

Within HEP personal log-books were personal property, where all analysis details judged important by individual researchers for the progress of the analysis were captured. The text was completed by “cut-and paste” tables, plots, sections of software listings (describing e.g. data selection cuts); Monte Carlo generators and samples used, etc. It also contained references to papers which provided useful information for the analysis and “where-to-find” information to other ancillary documents: listings, outputs, etc.
In a well-kept logbook, the whole progress of an analysis was captured in time-sequence, including all the tries-and-errors, speculations, dead-ends, etc. (Typically, no pages were torn off, no information deleted but rather commented as “wrong” or “useless” or “erroneous”.)
For many of the use-cases which are enumerated below, one had to go back to the level of personal log-books. Therefore, it seems, that tools for keeping a modern, electronic, log-book, will be of prime importance in the future too. The main difference being that the personal version of the logbook will contain the full history (including the errors and “less relevant” information), while basically all the use cases discussed below are based on the “latest view” of only that part of the information relevant for the “final” (or “present”) state of the analysis. Therefore the personal logbook could be a “simpler” version of a tool (like the Jupyter notebooks) combined with an external version control tool (like git) taking care of “major snapshots”, while the tool itself covers the “minor” snapshots done during actual work.

Internal Notes were documents capturing the essence of those analyses which, finally, converged towards a physics result and were judged worth to be discussed within the collaboration in view of a later publication. Internal notes were relying heavily on the contents of personal log-books. These notes described all the details of an analysis which were necessary for a thorough discussion and a full understanding of the results; essentially, the Internal Note was a full-text version of the log-book, with all the speculations, mistakes, dead-ends, etc. eliminated, keeping only what has matured. Internal notes were authored by the person(s) who has (have) done the analysis; however, they were the property of the collaboration. Internal Notes had “versions” which captured the history of the scrutiny within the collaboration. In most cases, several iterations were necessary, with checking and re-checking, until the result was judged correct and acceptable by the collaboration.
Clearly, a modern electronic tool for the redaction of Internal Note, allowing the capture of problems, questions, and answers based on new analysis, with sequential versions, would be of great help (but of course, most redaction tools have these facilities already).
Electronic notebooks should be easily linkable as e.g. in the case of CMS that information is (or should be) available in the DAS system; submission form provides the option to link to CDS, INSPIRE and Indico where ideally the presentation of the content of the Internal Note should be.

Physics Notes were documents which were meant to go PUBLIC either as presentations at conferences (Conference reports) or as real publications. Hence, it contained all the information that was necessary for the concise presentation of the result to the general public/to those who are interested. - In most cases, a Physics Note constituted the first version of a real publication, which was refined and polished by the Editorial Board prior to submission to a journal.

These three “levels” of an analysis are still essential today. However, in contrast to previous times, all this can be done efficiently by sophisticated electronic tools, like CAP, which helps by capturing information from the starting of an analysis. It could preserve all research objects/materials throughout the whole research project.

Below, a few “use-cases”are enumerated ordered by increasing level of sophistication.

USE CASES

1.The person having done (part of) an analysis is leaving the collaboration and has to hand over the know-how to other collaboration members.

He will have to hand over his electronic log-book which captures all the history of his work (Internal Notes on the subject, if they exist, are already property of the collaboration).
The full understanding of the log-book contents would be facilitated if, from the beginning, log-book tool with standard features would be used (e.g. allowing the “line of progress” to be highlighted, without dismissing the tries-and-errors or dead-ends).
The question of access restrictions thus arises. Should the electronic log-books be regarded as the exclusive property of individual researchers (like in the past) or alternatively, should there be proviso for (limited) access, e.g., for the group of people working on the same subject, or for the whole collaboration?
Can be decided for each collaboration individually and also be flexible. Thus, access can start closed or very restrictive and then change to open upon publication (either of internal note or the paper?).
In any case, for the sake of a painless transfer of information, good log-book “tools” should be provided which allow easy recording, linking, and modification without deleting (only adding information), in a standard way, accepted by the whole collaboration.

CAP can help by capturing information from the very beginning, e.g. through a connection with the respective job databases. This will allow the capturing of all the changes made to the code used for analysis and thus be similar to an electronic logbook. Seeing as the job databases are usually accessible to the whole collaboration the information captured should be searchable for the whole collaboration or respective subgroups (note: even in a common infrastructure, one could also imagine scenarios in which only the submitter can “read” the detailed information). This is an important aspect to be discussed within CAP development and internally in the collaborations.


2.A newcomer would like join a group working on some physics subject

Clearly, he NEEDS to get access to the electronic logbook of the individual(s) already working on the subject in order to avoid duplication of previous tests, speculations, dead-ends, etc. An open, annotated or structured electronic logbook would allow data and its documentation to be found easily.

CAP and the search options it will provide (e.g. by physics groups, data sets, final state particles...) will allow newcomers to get an overview over the analyses already ongoing or finished and thus ideally de-duplicate efforts.


3.In a large collaboration, it may occur that two (groups of) people work independently on the same subject. – This may raise painful human problems but may be beneficial for the robustness and credibility of a result.

Typically, the two groups will “confront” each-other at the level of the Internal Notes, where the results are presented for the first time, side-by-side, to the collaboration. It is likely that only one of the analyses will be retained for publication and the other will be used as “back-up”. Or in other circumstances, they might be asked to combine the methods. The choice will be made on the basis of a detailed scrutiny of both (going back to the log-books), and on the basis of objective criteria, e.g., which of the two analyses is more efficient? Which of the two is safer, more stable (that is having the smaller systematic errors)?
In the process of comparing, it is essential to get easy access to all ancillary material (datasets, selections, analysis software, etc.) of each of the two analyses, using direct links defined within the (electronic) Internal Notes.

In the future, these cases will become easier to manage and to evaluate. The increased documentation, accessibility, and transparency of the analyses carried out within the collaboration will help furthering these parallel investigations. Groups should know early about each other to coordinate joint or parallel efforts. CAP allows them to search for ongoing and previous analyses to build on or reuse. In case there has to be a resolution of a conflict, both analyses will be documented in CAP and can be reviewed or even combined.


4.There is a conflict between results of two collaborations on the same subject (before or after publication).

Here the procedure is similar to that described under 3) but this time the comparison may start at the Physics Note or Publication level. However, my experience is that very soon one is lead back to the Internal Note level, and one will discuss the glorious detail that has disappeared from the Physics Notes or publications.

CAP will help to resolve conflicts by providing detailed information on the analyses that allow reproducing both analyses and trace what led to the conflicting results. Ideally, the analyses can then be combined to a joint result and publication.


5.A previous analysis has to be repeated.

The fairly natural reasons can be various: a) The knowledge of the experimental environment (e.g., detector properties) has improved, and a re-analysis is promising a more precise result; b) Older data have to be statistically combined with new data for a smaller statistical error; c) The theory input (which is an essential part of most analyses) has considerably improved since the last publication, and a re-analysis promises a more relevant overall result.
In all these cases, the re-analysis is likely to start at the Physics Note level but soon descends to the Internal Note level with its cascade of links to ancillary material and even log-books.

CAP will make the repetition of analyses easier as it will provide all the information needed for replication and thus will allow the adaptation of code to new/other datasets or the improvement of a previous analysis.


6.Data from several experiments, on the same physics subject, have to be statistically combined (e.g. LHC “Legacy” papers).

It is erroneous to think that this can be done efficiently at the level of the (published) results. On the contrary, one has to go back to the glorious details of each experiment, find a common “platform” where the data (which differ from experiment to experiment in their information content and format) can be handled equivalently and summed. One has to analyze the systematic errors of each experiment in order to recognize common and individual systematic errors, which have to be handled differently.
Typically, dedicated working groups, with members from the different experiments, are dealing with such a combination and, again, the most likely level to start with would be that of the (electronic) Internal Note with its links to all the ancillary material.
(However, here a new question of access policy may arise since Internal Notes are internal by their nature while the dedicated working group holds members from several collaborations.)

CAP or even the CODP can facilitate this work by mapping the collaboration specific data processing frameworks to a general framework that allows the comparison or combination of data from different collaborations. An ontology of consistent search terms will allow to identify suitable datasets and provide information about the software environment needed to analyse the data. This was addressed in the DASPOS/CERN workshop May 18th/19th 2015 [https://indico.cern.ch/event/394900/].


7.A working group or management member within a collaboration wishes to know who else has worked on a particular dataset, software piece or MC.

Currently there is often a missing link between the “primary data” and user generated files. This makes it difficult to understand who does what and when, i.e. to build on each others work.

The CAP platform will provide visibility and central storage to user created data. These datasets are currently only available to the researcher analysing them and maybe his/her group in addition. There are currently no options to search for these datasets or access them to check if someone else is/was working on the same subject. They are stored locally and not preserved and accessible to the collaboration. CAP will provide the opportunity to provide the researcher a permanent storage and documentation for his/her created data. For their colleagues, CAP allows to search for these datasets and meta information and thus, make the collaborations’ analyses more accessible within the experiments. In addition to the use cases described before, this case exceeds the metadata related features described before and actually tackles the preservation of data that would be lost for the collaboration otherwise.


8.Presentation or publication is submitted for internal/collaboration review and approval: lack of comprehensive metadata

A collaboration member would like to present his/her analysis at a conference (or submit a paper to a journal). S/he is requested to submit the details of the analysis for peer review and approval. This requires a documentation of the data and parameters used. Thus, one needs to find and compile these details again.

CAP could make this much easier. The researchers has been using this tool to take snapshots of his/her progress. Furthermore, s/he has been asked to enter the analysis details into CAP when demanding data access earlier. Thus, all the information needed for the approval is already in there. H/she chooses the kind of approval s/he needs (conference, paper…) and receives the relevant information. One could also imagine a workflow in which the details are submitted automatically to the committee or granting the review committee access to the analysis record in CAP.


9.Preparing for Open Data Sharing

Today, there is no comprehensive tool yet to prepare more complex datasets for public releases. The documentation and assemblage needs to be done almost from the beginning.

CAP will be offering APIs to enable easy information exchange between CAP, CODP, INSPIRE, CDS and others, also external platforms. Thus, data can be published openly by clicking one button. This will push it to the requested publishing platform or link the analysis information from the platform and make it publicly accessible on CAP.


10.Potential future scenario: Publication is sent to journal which requests data access for peer review.

Right now there is no tool to enable such a workflow (which goes beyond HEPdata dissemination practices). More and more high class journal demand access to the data for their peer review and for the published dataset [Note that this could also happen with more funders].

CAP could enable a “closed” by invitation only access service to allow reviewers limited access to the data and results under review. If required results could be published openly afterwards (see #9). If not, they stay untouched on CAP.


Conclusion: Plenty of use-cases can profit from a kind of a “standard annotated log-book” such as CAP which links to ancillary material or the location of the material captured in the submission (or provides the files immediately)

  • How could the user experience look like in the future?
    [Two more complex scenarios to illustrate the user experience in CAP - please note that both are examples envisioning how it could look like (based on knowledge in early summer 2015). The described functionalities are subject to change depending on the technology development and internal (political) developments]

Scenario A (2015): Student S (CMS) goes through the published analyses from 2012 and documents them in CAP

Person S starts his search for a paper in CAP by a keyword/author/... search through the published analysis that were important from CADI (CMS Analysis Database Interface). He chooses a record that doesn’t have any analysis information attached [automatically imported information/records that are not manually edited yet/created will be highlighted in the search results]. He opens the paper (through link to INSPIRE or CDS) and the presentation of the analysis results (which ideally is linked from the record, if this link does not exist, he will search the presentation on Indico and create the link). From the information in there, he will try to fill as much metadata as possible to allow future searches over e.g. final state particles, primary datasets etc. Ideally, he works in close collaboration with some physics analysis groups who can help him create links to the primary datasets and even to software.

Scenario B (in 2016/17): Experimentalist E (LHCb) starts a new analysis by using CAP

Person E starts a new submission in CAP. In the first step, he enters the Stripping Line for the DST(s) (Data summary tape) he will use for his analysis. This will be done by providing the path to the LHCb bookkeeping database (BKDB). Based on this location link, important metadata for search will be automatically extracted, such as the year, the reconstruction software, the stripping software or the particles analysed (+ possibly trigger information, magnet status etc.). E then has the option to edit or correct the extracted metadata. Furthermore, he can add other data paths or MC simulation data. In a next step, he adds the analysis code and related information. Part of the metadata (such as the DaVinci version) might already be pre-filled as it was possible to extract this information from the DST. He then has two options to provide the code he used to do the first filtering and selection of events:

  • a link to Urania, the LHCb high-level physics analysis software repository
  • a direct upload of his python code to CAP

In addition to the code, he has to provide run instructions. These can either be copied in a provided text field or uploaded as a readme file. In a last step, he has to provide the output data of this first selection which his next analysis steps will be based on. This can be a direct file upload or a link to e.g. a public ASF folder. In any case though, a copy of the output root file will be stored for the long-term in CAP.
The next analysis steps follow the same idea. Input data is the output root file from the previous step, E uploads or links his code and the corresponding run instructions and the output files (data and MC) from this analysis step.
Once his analysis is advanced enough to be presented, he can share the record with his work group leader to get feedback and/or the approval to take it to the review committee. The review committee then has the option to check and repeat his analysis with the information, code and data provided in CAP.
He can then later add the internal notes through a link to CDS as well as the internal approval presentation linking to Indico. In addition, the discussion about the analysis can be captured through a link to the analysis discussion mailing list.
After the publication of the analysis, he should add the publication information (DOI, arXiv, link to INSPIRE record) to permanently link the analysis to the published paper.