How to best preserve the computing environment (e.g., software versions and dependencies)? #12

ahcvankampen · 2024-05-23T08:14:11Z

ahcvankampen
May 23, 2024
Maintainer

Computing environment

One challenge that is only partially addressed by ENCORE concerns the preservation of the full computing environment. This environment is defined by (interdependencies of) the operating system, software tools, versions and dependencies, programming language libraries, etc. It is very important that the computing environment is preserved as much as possible.

One approach is proposed by Gruning and co-workers (Gruning, 2018). The proposed a software stack of interconnected technologies to preserve the computing environment (Figure 1). This stack comprises

Conda to manage software packages and dependencies (https://docs.conda.io/en/latest/). Conda provides a virtual execution environment. With conda you can manage R packages but renv is good alternative (https://rstudio.github.io/renv/index.html).
Containers such as Docker (https://www.docker.com/) or Singularity (https://docs.sylabs.io/guides/3.5/user-guide/introduction.html) to provide an isolated environment for the software. That is, a container has no knowledge of the operating system and contains everything required to run the software.
Virtual Machines (VM) for hardware virtualization to achieve complete isolation and reproducibility. Virtualization can be achieved via (commercial) clouds or, on a local computer, using for example Workstation Player from VMware (https://www.vmware.com/) or VirtualBox from Oracle (https://www.virtualbox.org/)

Gruning, B., Chilton, J., Koster, J., Dale, R., Soranzo, N., van den Beek, M., Goecks, J., Backofen, R., Nekrutenko, A., & Taylor, J. (2018). Practical Computational Reproducibility in the Life Sciences. Cell Syst, 6(6), 631-635. https://doi.org/10.1016/j.cels.2018.03.014

Figure 1. Software stack of interconnected technologies that Enables Computational Reproducibility. For R projects, renv might be used as an alternative for conda. Docker or Singularity may be used for the containerization. Copied and modified from Gruning (2018).

Brief explanation of the stack

Since Conda packages are frequently updated, if a Conda virtual environment is created by specifying only the top-level tools and versions, recreating it at a later point in time using the same specifications may easily result in slightly different dependencies being installed. Therefore, containerization is need to provide a next level of isolation. Containers (e.g., Docker, Singularity) are run directly on the host operating system’s kernel but encapsulate every other aspect of the runtime environment. Note, that Docker still has security concerns while Singularity provides better security for multi-user systems and can run Docker images. Containers provide isolated and reproducible compute environments, but still depend on the operating system kernel version and underlying hardware. An even greater isolation can be achieved through virtualization, which runs analysis within an emulated virtual machine (VM) with precisely defined hardware specifications. Virtualization, which provides the third layer of our reproducibility stack

Conda vs pip

Conda and pip are serving completely different use cases despite having similar features. Conda is a system package manager while pip (https://pypi.org/project/pip/) is a Python package manager.

Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves. Conda also creates a virtual environment, like virtualenv does.

With conda you can install much more than just Python libraries. You can install entire software stacks such as Python + Django + Celery + PostgreSQL + nginx. You can install R, R libraries, Node.js, Java programs, C and C++ programs and libraries, Perl programs, etc. conda has an env system that allows you to have all of these installed across multiple different environments. Also, conda is able to do all these software and package installations in an isolated, userspace manner. This is critical because it means that you can install complex software stacks on a system without needing root privileges. In a lot of ways, conda serves as a lightweight userspace alternative to Docker for isolating software stacks.

On the other hand, pip can only install Python packages, and it quite often screws up the installations on multi-user systems, breaking global system dependencies and/or the user's dependency stacks. This is why people who rely only on pip MUST use virtualenv, but even then pip sometimes misbehaves and installs to the wrong places. In general, pip is dangerous and a mess to use. Easy to screw up your user Python library stack or even the entire server's installation. Tread carefully any time you use the globally available system-installed pip.

The thing that a lot of people do not understand is that conda and pip are NOT mutually exclusive. In fact, you are supposed to use them together. The intended usage is this;

first you install and/or set up conda for your project (including conda env as needed) and install all packages you need first from conda channels
second, and with conda activated, you use the version of pip included with conda to install and required pip dependencies into your project's conda env

This is an important distinction. When you install conda, it brings its own version of pip (and Python) that will automatically become available when you activate conda (conda activate myenv), and will automatically install packages into your currently activated conda env. See also https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html.

conda install -n myenv pip
conda activate myenv
pip <pip_subcommand>

Issues may arise when using pip and conda together. When combining conda and pip, it is best to use an isolated conda environment. Only after conda has been used to install as many packages as possible should pip be used to install any remaining software. If modifications are needed to the environment, it is best to create a new environment rather than running conda after pip. When appropriate, conda and pip requirements should be stored in text files.

And as always, make sure to save the installation commands with version-locked dependencies for both conda and pip for every project where they are used.

Use in ENCORE

Since software versions and dependencies are a main obstacle for reproducibility, it is required to use a package manager of some sort (e.g., conda, renv) and to ensure that Compendium Recipients can reproduce the software environment. For example, conda allows to export the environment using 'conda list --explicit > myenv.txt', which can be imported by the Compendium Recipient.

Preferably, also provide a file with versions of all software (including packages, libraries) used in the project. For example, in R one can use 'installed.packages()', while in conda one can use 'conda list --explicit > myenv.txt'.

In the context of ENCORE, we are still working on best practices for using containers and/or VMs.

ahcvankampen · 2024-06-11T14:48:03Z

ahcvankampen
Jun 11, 2024
Maintainer Author

Group discussion (11 june 2024)
We need to make a decision about what software to use.
We need to get experience with this software?

Software versions/dependencies
It makes sense to use conda and/or renv. These are widely used. We already have experience with them.

Containers
It is probably best to use Apptainer (formerly known as Singularity). Unlike Docker which requires root privileges to run a container, Singularity is designed for ease-of-use on shared multiuser systems and in high performance computing (HPC) environments. Singularity is compatible with all Docker images and it can be used with GPUs and MPI applications

Virtual Machines
Not clear if these are really necessary at this stage. First gain more experience with containers/conda/renv.

Ansible
It may make sense to use Ansible, which is an open source IT automation engine that automates provisioning, configuration management, application deployment, orchestration, and many other IT processes.

ToDo

hands on workshop (September 2024)
identify the caveats
develop expertise and best practices
update ENCORE documentation (e.g., readme files, step-by-step guide)

Notes

GitHub Codespaces can be used to setup Docker containers.

Open questions

What are the limitations?
How do the different conda versions compare?
At what stage of project should we start using conda/renv/containers?
pip vs conda / renv vs conda
Are there differences between OS's to take care of?

0 replies

barberavanschaik · 2024-06-12T08:45:24Z

barberavanschaik
Jun 12, 2024
Maintainer

Apptainer only runs on Linux. It needs to run in a virtual machine on Windows and Mac.
https://apptainer.org/docs/admin/main/installation.html#installation-on-windows-or-mac

Docker has a Windows installer:
https://docs.docker.com/desktop/install/windows-install/

0 replies

ahcvankampen · 2024-06-18T07:27:39Z

ahcvankampen
Jun 18, 2024
Maintainer Author

ENCORE aims to provide a method for the broader scientific community. It should not rely too much on specific tools. The only exception at this moment is git/GitHub, which is already integrated into ENCORE through, for example, the github.txt file and the step-by-step guide. Our goal is to avoid imposing particular tools on researchers to maintain flexibility and inclusivity.

However, reproducibility is crucial for 'environments,' necessitating an exception for widely accepted tools. Here’s the plan for selecting these tools:

Conda: Conda is a clear choice due to its popularity and broad acceptance in the scientific community.
Containers: For containerization, we can consider Docker and Apptainer (formerly Singularity), as they are leading tools in this area.
Virtual Machines (VMs): Selecting the best VM tool (VirtualBox?) for ENCORE requires further investigation. Which one is most widely accepted by the scientific community? What are the pro’s and con’s?
Cloud Services: We should avoid using SURF Cloud-based VMs (or any other cloud).
Automation: While Ansible is valuable, it should be categorized under ENCORE_AUTOMATION rather than as a core part of ENCORE itself.

Selected tools should be OS independent.

Additionally, we need to account for programming languages beyond R and Python, such as MATLAB, C++, etc. We should evaluate the applicability of environment managers like Conda for these languages to ensure comprehensive support.

GitHub Codespaces also needs to be considered since we are already using GitHub.

In summary, while extending ENCORE, our focus will be on selecting tools that ensure reproducibility, are widely accepted, and maintain flexibility across different programming environments.

0 replies

ahcvankampen · 2024-06-18T08:09:56Z

ahcvankampen
Jun 18, 2024
Maintainer Author

UTM

Just a side note, FYI: VirtualBox isn’t the best tool for Mac Silicon-based machines (i.e. all the new machines, M1, M2, M3):
https://machow2.com/download-virtualbox-apple-silicon/

I am currently playing around with UTM (which is mentioned in the link in the above article as one of the better ones). Works pretty OK, but of course stumbling on some issues …

One of them being that not enough space is given to the root partition at the start and then having to do that afterwards .
The other one that by using virtualization (and you do not want the emulation as that is very slow!!), it is using the underlying OS-architecture, ie. arm64 on the M1. And not all Python packages from PyPi seem to be available for that architecture (yet). So, you cannot install them using conda/pip install …)
Apptainer needs a Linux VM on Windows or Mac …

Anyway, a lot of things to explore and help each other out with!

0 replies

ahcvankampen · 2024-06-20T08:20:35Z

ahcvankampen
Jun 20, 2024
Maintainer Author

Podman

I didn't look into this, but perhaps Podman (https://podman.io/) is useful? This was suggested by Frans van der Kloet (UvA).

Podman is a daemonless, open source, Linux native tool designed to make it easy to find, run, build, share and deploy applications using Open Containers Initiative (OCI) Containers and Container Images.

Can I run Podman in Windows?
Yes. While "containers are Linux," Podman also runs on Mac and Windows, where it provides a native CLI and uses a guest Linux system to launch your containers (known as a Podman machine). From https://developers.redhat.com/articles/2023/09/27/how-install-and-use-podman-desktop-windows

Very instructive video [here].

0 replies

ahcvankampen · 2024-06-27T08:34:51Z

ahcvankampen
Jun 27, 2024
Maintainer Author

C. Titus Brown

This blog provides some interesting thoughts on reproducibility using containers/VMs

Blog: http://ivory.idyll.org/blog/2017-pof-software-archivability.html

Makes remarks about ‘docker is not robust’
“My conclusion is that, on a decadal time scale, we cannot rely on software to run repeatably.”
“ Virtual machines considered harmful for reproducibility”

0 replies

ahcvankampen · 2024-06-27T12:28:43Z

ahcvankampen
Jun 27, 2024
Maintainer Author

Should we, alternatively, disseminate a project online (in a controlled environment)?

What about performance issues?

Is it possible to decide (upfront) which projects should use containers/VMs

A container/VM should not restrict the further use/extension/modification by peers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to best preserve the computing environment (e.g., software versions and dependencies)? #12

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to best preserve the computing environment (e.g., software versions and dependencies)? #12

ahcvankampen May 23, 2024 Maintainer

Computing environment

Replies: 7 comments

ahcvankampen Jun 11, 2024 Maintainer Author

barberavanschaik Jun 12, 2024 Maintainer

ahcvankampen Jun 18, 2024 Maintainer Author

ahcvankampen Jun 18, 2024 Maintainer Author

UTM

ahcvankampen Jun 20, 2024 Maintainer Author

Podman

ahcvankampen Jun 27, 2024 Maintainer Author

C. Titus Brown

ahcvankampen Jun 27, 2024 Maintainer Author

ahcvankampen
May 23, 2024
Maintainer

ahcvankampen
Jun 11, 2024
Maintainer Author

barberavanschaik
Jun 12, 2024
Maintainer

ahcvankampen
Jun 18, 2024
Maintainer Author

ahcvankampen
Jun 18, 2024
Maintainer Author

ahcvankampen
Jun 20, 2024
Maintainer Author

ahcvankampen
Jun 27, 2024
Maintainer Author

ahcvankampen
Jun 27, 2024
Maintainer Author