How to best preserve the computing environment (e.g., software versions and dependencies)? #12
Replies: 7 comments
-
Group discussion (11 june 2024) Software versions/dependencies Containers Virtual Machines Ansible ToDo
Notes
Open questions
|
Beta Was this translation helpful? Give feedback.
-
Apptainer only runs on Linux. It needs to run in a virtual machine on Windows and Mac. Docker has a Windows installer: |
Beta Was this translation helpful? Give feedback.
-
ENCORE aims to provide a method for the broader scientific community. It should not rely too much on specific tools. The only exception at this moment is git/GitHub, which is already integrated into ENCORE through, for example, the github.txt file and the step-by-step guide. Our goal is to avoid imposing particular tools on researchers to maintain flexibility and inclusivity. However, reproducibility is crucial for 'environments,' necessitating an exception for widely accepted tools. Here’s the plan for selecting these tools:
Selected tools should be OS independent. Additionally, we need to account for programming languages beyond R and Python, such as MATLAB, C++, etc. We should evaluate the applicability of environment managers like Conda for these languages to ensure comprehensive support. GitHub Codespaces also needs to be considered since we are already using GitHub. In summary, while extending ENCORE, our focus will be on selecting tools that ensure reproducibility, are widely accepted, and maintain flexibility across different programming environments. |
Beta Was this translation helpful? Give feedback.
-
UTMJust a side note, FYI: VirtualBox isn’t the best tool for Mac Silicon-based machines (i.e. all the new machines, M1, M2, M3): I am currently playing around with UTM (which is mentioned in the link in the above article as one of the better ones). Works pretty OK, but of course stumbling on some issues …
Anyway, a lot of things to explore and help each other out with! |
Beta Was this translation helpful? Give feedback.
-
PodmanI didn't look into this, but perhaps Podman (https://podman.io/) is useful? This was suggested by Frans van der Kloet (UvA). Podman is a daemonless, open source, Linux native tool designed to make it easy to find, run, build, share and deploy applications using Open Containers Initiative (OCI) Containers and Container Images. Can I run Podman in Windows? Very instructive video [here]. |
Beta Was this translation helpful? Give feedback.
-
C. Titus BrownThis blog provides some interesting thoughts on reproducibility using containers/VMs Blog: http://ivory.idyll.org/blog/2017-pof-software-archivability.html
|
Beta Was this translation helpful? Give feedback.
-
Should we, alternatively, disseminate a project online (in a controlled environment)? What about performance issues? Is it possible to decide (upfront) which projects should use containers/VMs A container/VM should not restrict the further use/extension/modification by peers. |
Beta Was this translation helpful? Give feedback.
-
Computing environment
One challenge that is only partially addressed by ENCORE concerns the preservation of the full computing environment. This environment is defined by (interdependencies of) the operating system, software tools, versions and dependencies, programming language libraries, etc. It is very important that the computing environment is preserved as much as possible.
One approach is proposed by Gruning and co-workers (Gruning, 2018). The proposed a software stack of interconnected technologies to preserve the computing environment (Figure 1). This stack comprises
Gruning, B., Chilton, J., Koster, J., Dale, R., Soranzo, N., van den Beek, M., Goecks, J., Backofen, R., Nekrutenko, A., & Taylor, J. (2018). Practical Computational Reproducibility in the Life Sciences. Cell Syst, 6(6), 631-635. https://doi.org/10.1016/j.cels.2018.03.014
Figure 1. Software stack of interconnected technologies that Enables Computational Reproducibility. For R projects, renv might be used as an alternative for conda. Docker or Singularity may be used for the containerization. Copied and modified from Gruning (2018).
Brief explanation of the stack
Since Conda packages are frequently updated, if a Conda virtual environment is created by specifying only the top-level tools and versions, recreating it at a later point in time using the same specifications may easily result in slightly different dependencies being installed. Therefore, containerization is need to provide a next level of isolation. Containers (e.g., Docker, Singularity) are run directly on the host operating system’s kernel but encapsulate every other aspect of the runtime environment. Note, that Docker still has security concerns while Singularity provides better security for multi-user systems and can run Docker images. Containers provide isolated and reproducible compute environments, but still depend on the operating system kernel version and underlying hardware. An even greater isolation can be achieved through virtualization, which runs analysis within an emulated virtual machine (VM) with precisely defined hardware specifications. Virtualization, which provides the third layer of our reproducibility stack
Conda vs pip
Conda and pip are serving completely different use cases despite having similar features. Conda is a system package manager while pip (https://pypi.org/project/pip/) is a Python package manager.
Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.
With conda you can install much more than just Python libraries. You can install entire software stacks such as Python + Django + Celery + PostgreSQL + nginx. You can install R, R libraries, Node.js, Java programs, C and C++ programs and libraries, Perl programs, etc. conda has an env system that allows you to have all of these installed across multiple different environments. Also, conda is able to do all these software and package installations in an isolated, userspace manner. This is critical because it means that you can install complex software stacks on a system without needing root privileges. In a lot of ways, conda serves as a lightweight userspace alternative to Docker for isolating software stacks.
On the other hand, pip can only install Python packages, and it quite often screws up the installations on multi-user systems, breaking global system dependencies and/or the user's dependency stacks. This is why people who rely only on pip MUST use virtualenv, but even then pip sometimes misbehaves and installs to the wrong places. In general, pip is dangerous and a mess to use. Easy to screw up your user Python library stack or even the entire server's installation. Tread carefully any time you use the globally available system-installed pip.
The thing that a lot of people do not understand is that conda and pip are NOT mutually exclusive. In fact, you are supposed to use them together. The intended usage is this;
This is an important distinction. When you install conda, it brings its own version of pip (and Python) that will automatically become available when you activate conda (conda activate myenv), and will automatically install packages into your currently activated conda env. See also https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html.
Issues may arise when using pip and conda together. When combining conda and pip, it is best to use an isolated conda environment. Only after conda has been used to install as many packages as possible should pip be used to install any remaining software. If modifications are needed to the environment, it is best to create a new environment rather than running conda after pip. When appropriate, conda and pip requirements should be stored in text files.
And as always, make sure to save the installation commands with version-locked dependencies for both conda and pip for every project where they are used.
Use in ENCORE
Since software versions and dependencies are a main obstacle for reproducibility, it is required to use a package manager of some sort (e.g., conda, renv) and to ensure that Compendium Recipients can reproduce the software environment. For example, conda allows to export the environment using 'conda list --explicit > myenv.txt', which can be imported by the Compendium Recipient.
Preferably, also provide a file with versions of all software (including packages, libraries) used in the project. For example, in R one can use 'installed.packages()', while in conda one can use 'conda list --explicit > myenv.txt'.
In the context of ENCORE, we are still working on best practices for using containers and/or VMs.
Beta Was this translation helpful? Give feedback.
All reactions