diff --git a/docs/no_toc/02-why-containers.html b/docs/no_toc/02-why-containers.html index 89459bf..6f50b1b 100644 --- a/docs/no_toc/02-why-containers.html +++ b/docs/no_toc/02-why-containers.html @@ -139,7 +139,7 @@
Not only are our computers very unique at any one point in time, but as time moves forward computers and the software environments change very rapidly. Some of this happens through intentional installations of programs by computer users and some are automatic software updates that they may not be aware of that are instigated by the developers of the hardware and software that make up the computer.
+Not only are our computers very unique at any one point in time, but as time moves forward computers and the software environments change very rapidly. These changes might happen through intentional installations of programs by computer users. Some are changes users may not be aware of, which took place through automatic software updates that are instigated by the developers of the hardware and software that make up the computer.
If we control the computing environments, that is one less variable we need to deal with in our science. Control the computing environment = That much more reproducible science.
+If we control the computing environments, that is one less variable we need to deal with in our science. Control the computing environment = much more reproducible science.
We could think of impractical ways to control our computing environment: We could have one laptop that we ship back and forth between all our collaborators. Although this may make the computing environment slightly more controlled, clearly it is not a practical solution.
And containerization works for aiding in reproducibility as shown by (Beaulieu-Jones and Greene 2017)
+And containerization aids in reproducibility as shown by (Beaulieu-Jones and Greene 2017)
When a container is used, an aspect of variability in scientific analysis – the computing environment – is controlled for, the results are more reproducible!
+When a container is used, an aspect of variability in scientific analysis – the computing environment – is controlled for, and as a result, the results are more reproducible!
But more than this, there are even more benefits to using containers:
-Installing software can be a huge headache. Bioinformatics involves using software that is often fringe and developed and maintained by small teams – or sometimes the software isn’t maintained at all. This means installation can take a lot of valuable time that scientists often don’t have.
+Beside reasons stated above, there are even more benefits to using containers:
+Installing software can be a huge headache. Bioinformatics involves using software that is often fringe – developed and maintained by small teams – or sometimes the software isn’t maintained at all. This means installation can take a lot of valuable time that scientists often don’t have.
Now we can run the following command but we will have to run docker ps
or docker container ls
and get the container ID we need to put here.
Now we can run the following command but we will have to run docker ps
and get the container ID we need to put here.
docker exec -it <REPLACE_WITH_CONTAINER_ID> bash /home/run_analysis.sh
+or in the exec tab of the container in Docker desktop app, run
+bash /home/run_analysis.sh
To share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries generally cross-compatible meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.
+To share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries are generally cross-compatible, meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.
This article has a nice summary of some of the most popular ones.
And here’s a TL;DR of the most common registries:
We encourage you to consider what container registries work best for you specific project and team. Here’s a starter list of considerations you may want to think of, roughly in the order of importance.
COPY
ing change? Do I need to use –no-cache to force a rebuild of everything so the changes are seen?Now you have the basics of using containers but this is really just the beginning! As you continue to work with containers you will encounter errors and need to troubleshoot. This table has a quick rundown on some of the most common errors:
diff --git a/docs/no_toc/search.json b/docs/no_toc/search.json index 77c4486..1c18861 100644 --- a/docs/no_toc/search.json +++ b/docs/no_toc/search.json @@ -74,7 +74,7 @@ "href": "02-why-containers.html", "title": "\n2 Why Containers?\n", "section": "", - "text": "2.1 The problem\nIn today’s data driven world, science is driven by computer work. But each of these computers is unique. This goes far beyond “Mac vs PC”. Every computer has a special configuration of software and software versions that is installed on it. Some of this is determined by the user of the computer, some due to the design by the company that builds and sells the computers, and some is even controlled by institutions and their IT departments that manage the computers.\nSoftware programs can give us a concrete example of what differing computing environments can look like by printing out this information. This side-by-side example below shows two different computers’ computing environments. This printout comes from using sessionInfo() in the R programming language. You can see that not only do these two computing environments differ by operating system, but also by software packages installed, software packages loaded, and the versions of these software packages.\nNot only are our computers very unique at any one point in time, but as time moves forward computers and the software environments change very rapidly. Some of this happens through intentional installations of programs by computer users and some are automatic software updates that they may not be aware of that are instigated by the developers of the hardware and software that make up the computer.\nComputing environments are a moving target.\nThis can be a real problem for science because prior research has shown that these computing environments can affect results (Beaulieu-Jones and Greene 2017). In a genomic analysis for example, where the output might be a list of genes, differing computing environments may result in different numbers of significant genes!", + "text": "2.1 The problem\nIn today’s data driven world, science is driven by computer work. But each of these computers is unique. This goes far beyond “Mac vs PC”. Every computer has a special configuration of software and software versions that is installed on it. Some of this is determined by the user of the computer, some due to the design by the company that builds and sells the computers, and some is even controlled by institutions and their IT departments that manage the computers.\nSoftware programs can give us a concrete example of what differing computing environments can look like by printing out this information. This side-by-side example below shows two different computers’ computing environments. This printout comes from using sessionInfo() in the R programming language. You can see that not only do these two computing environments differ by operating system, but also by software packages installed, software packages loaded, and the versions of these software packages.\nNot only are our computers very unique at any one point in time, but as time moves forward computers and the software environments change very rapidly. These changes might happen through intentional installations of programs by computer users. Some are changes users may not be aware of, which took place through automatic software updates that are instigated by the developers of the hardware and software that make up the computer.\nComputing environments are a moving target.\nThis can be a real problem for science because prior research has shown that these computing environments can affect results (Beaulieu-Jones and Greene 2017). In a genomic analysis for example, where the output might be a list of genes, differing computing environments may result in different numbers of significant genes!", "crumbs": [ "2 Why Containers?" ] @@ -84,7 +84,7 @@ "href": "02-why-containers.html#containers-as-an-aid-for-reproducibility", "title": "\n2 Why Containers?\n", "section": "\n2.2 Containers as an aid for reproducibility", - "text": "2.2 Containers as an aid for reproducibility\nScience progresses when data and hypotheses are thoroughly and sequentially tested at three levels: repeatability, reproducibility, and replicability. If results are not repeatable, they won’t be reproducible or replicable. These three concepts represent the pillars of ensuring a study’s reliability and validity.\nFor the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and produce the same results and conclusion.\nGenerally speaking, the more variables involved in a system, the messier things get, and the less clarity we have in what we are observing. It turns out computing environments are another variable that can affect reproducibility.\n\n\n\n\n\n\n\n\nAn important note: although your results can be reproducibly wrong (you’re coming to a faulty conclusion consistently) they can NOT be irreproducibly right.\n\n\n\n\n\n\n\n\nIf we control the computing environments, that is one less variable we need to deal with in our science. Control the computing environment = That much more reproducible science.\nWe could think of impractical ways to control our computing environment: We could have one laptop that we ship back and forth between all our collaborators. Although this may make the computing environment slightly more controlled, clearly it is not a practical solution.\n\n\n\n\n\n\n\n\nThat’s where containers come in.\nA container is kind of like your computer is running a computer inside of it. It is isolated from the rest of your computer so your science can be more reproducible.\n\n\n\n\n\n\n\n\nContainerization allows computing environments to be shared over the internet easily using libraries like Docker Hub.\n\n\n\n\n\n\n\n\nSharing computing environments and using those shared environments guarantee that the same computing environment is being used no matter where the analysis is being run.\n\n\n\n\n\n\n\n\nAnd containerization works for aiding in reproducibility as shown by (Beaulieu-Jones and Greene 2017)\n\n\n\n\n\n\n\n\nWhen a container is used, an aspect of variability in scientific analysis – the computing environment – is controlled for, the results are more reproducible!", + "text": "2.2 Containers as an aid for reproducibility\nScience progresses when data and hypotheses are thoroughly and sequentially tested at three levels: repeatability, reproducibility, and replicability. If results are not repeatable, they won’t be reproducible or replicable. These three concepts represent the pillars of ensuring a study’s reliability and validity.\nFor the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and produce the same results and conclusion.\nGenerally speaking, the more variables involved in a system, the messier things get, and the less clarity we have in what we are observing. It turns out computing environments are another variable that can affect reproducibility.\n\n\n\n\n\n\n\n\nAn important note: although your results can be reproducibly wrong (you’re coming to a faulty conclusion consistently) they can NOT be irreproducibly right.\n\n\n\n\n\n\n\n\nIf we control the computing environments, that is one less variable we need to deal with in our science. Control the computing environment = much more reproducible science.\nWe could think of impractical ways to control our computing environment: We could have one laptop that we ship back and forth between all our collaborators. Although this may make the computing environment slightly more controlled, clearly it is not a practical solution.\n\n\n\n\n\n\n\n\nThat’s where containers come in.\nA container is kind of like your computer is running a computer inside of it. It is isolated from the rest of your computer so your science can be more reproducible.\n\n\n\n\n\n\n\n\nContainerization allows computing environments to be shared over the internet easily using libraries like Docker Hub.\n\n\n\n\n\n\n\n\nSharing computing environments and using those shared environments guarantee that the same computing environment is being used no matter where the analysis is being run.\n\n\n\n\n\n\n\n\nAnd containerization aids in reproducibility as shown by (Beaulieu-Jones and Greene 2017)\n\n\n\n\n\n\n\n\nWhen a container is used, an aspect of variability in scientific analysis – the computing environment – is controlled for, and as a result, the results are more reproducible!", "crumbs": [ "2 Why Containers?" ] @@ -94,7 +94,7 @@ "href": "02-why-containers.html#top-reasons-for-containers", "title": "\n2 Why Containers?\n", "section": "\n2.3 Top reasons for containers!", - "text": "2.3 Top reasons for containers!\nBut more than this, there are even more benefits to using containers:\nInstalling software can be a huge headache. Bioinformatics involves using software that is often fringe and developed and maintained by small teams – or sometimes the software isn’t maintained at all. This means installation can take a lot of valuable time that scientists often don’t have.\n\n\n\n\n\n\n\n\n\n2.3.1 Unit Testing\nYou may not think of yourself as a software developer if you primarily write code for analyses. But this is still software! Just a different kind. In fact any kind of scientific code can still benefit from testing and automation. Our companion course about GitHub Actions and Continuous Integration / Continuous Deployment principles go into more detail about this.\nBut containers and automated testing of code go hand in hand. Rather than having your collaborator test it, it may be worth your while to have the code automatically tested, or the analysis automatically re-run upon the creation of a pull request.\nUnit testing then is a way to test each individual component of a code base. Whatever the smallest unit you can break your code down into should be tested. Each function should have a reproducible example that is re-run upon introducing new changes in a pull request. This way it will save you time by letting you know which part of the code may have broken with new changes.\nContainers assist with unit testing by allowing for a standard computing environment as well as ways to easily test code as it would be run in different operating systems: Macs, PCs, Linuxes, etc.\nTo summarize:", + "text": "2.3 Top reasons for containers!\nBeside reasons stated above, there are even more benefits to using containers:\nInstalling software can be a huge headache. Bioinformatics involves using software that is often fringe – developed and maintained by small teams – or sometimes the software isn’t maintained at all. This means installation can take a lot of valuable time that scientists often don’t have.\n\n\n\n\n\n\n\n\n\n2.3.1 Unit Testing\nYou may not think of yourself as a software developer if you primarily write code for analyses. But this is still software! Just a different kind. In fact any kind of scientific code can still benefit from testing and automation. Our companion course about GitHub Actions and Continuous Integration / Continuous Deployment principles go into more detail about this.\nBut containers and automated testing of code go hand in hand. Rather than having your collaborator test it, it may be worth your while to have the code automatically tested, or the analysis automatically re-run upon the creation of a pull request.\nUnit testing then is a way to test each individual component of a code base. Whatever the smallest unit you can break your code down into should be tested. Each function should have a reproducible example that is re-run upon introducing new changes in a pull request. This way it will save you time by letting you know which part of the code may have broken with new changes.\nContainers assist with unit testing by allowing for a standard computing environment as well as ways to easily test code as it would be run in different operating systems: Macs, PCs, Linuxes, etc.\nTo summarize:", "crumbs": [ "2 Why Containers?" ] @@ -144,7 +144,7 @@ "href": "04-using-volumes.html#activity-instructions", "title": "\n4 Using Volumes\n", "section": "", - "text": "4.1.1 Docker\nOur container is separate from our computer so if we want to use a file from our computer we have to attach it using a “volume”.\n\n4.1.1.0.1 Step 1: Let’s add our containers-for-scientists-sandbox files\nLet’s point a volume to our workshop files so we have them on our container.\nWe can specify a particular file path on our computer or give it $PWD. Then optionally we can give a : and a file path where we’d like it to be stored on the container. Otherwise it will be stored at the absolute top of the container. Note that $PWD is a special environment variable that stores the absolute path of the current working directory. You will need to be in the containers-for-scientists-sandbox-main for this to work.\n\n\n\n\n\n\n\n\n Now we can run:\ndocker run -v $PWD:/home cansav09/practice-image:1\nIf you have a windows machine you may have to run this variant instead. This version has a different ${} around the pwd part.\ndocker run -v ${pwd}:/home cansav09/practice-image:1\nIn Docker desktop you can specify a portal like this:\n\n\n\n\n\n\n\n\n\n4.1.1.1 Step 2: Retry calling the script\n Now we can run the following command but we will have to run docker ps or docker container ls and get the container ID we need to put here.\ndocker exec -it <REPLACE_WITH_CONTAINER_ID> bash /home/run_analysis.sh\n\n\n\n\n\n\n\n\nNow we have a new error! What does this mean?\nQuestion: Does our container have all of the same software that our computer has?\n\n\n\n\n\n\n\n\n\n4.1.2 Podman\nOur container is separate from our computer so if we want to use a file we have to attach it using a “volume”.\n\n4.1.2.0.1 Step 1: Let’s add our containers-for-scientists-sandbox files\nLet’s point a volume to our workshop files so we have them on our container.\nWe can specify a particular file path on our computer or give it $PWD Then optionally we can give a : and a file path we’d like this to be stored on on the container. Otherwise it will be stored at the absolute top of the container.\n Now we can run:\npodman run -v $pwd:/home cansav09/practice-image:1\nIf you have a windows machine you may have to run this variant instead. This version has a different ${} around the pwd part.\npodman run -v ${pwd}:/home cansav09/practice-image:1\n\n\n\n\n\n\n\n\n\n4.1.2.1 Step 2: Retry calling the script\n Now we can run the following command but we will have to run podman ps and get the container ID we need to put here.\npodman exec -it <REPLACE_WITH_CONTAINER_ID> bash /home/run_analysis.sh\nNow we have a new error:\nError in loadNamespace(x): There is no package called 'rmarkdown'\nWhat does this mean?\nQuestion: Does our container have all of the same software that our computer has?", + "text": "4.1.1 Docker\nOur container is separate from our computer so if we want to use a file from our computer we have to attach it using a “volume”.\n\n4.1.1.0.1 Step 1: Let’s add our containers-for-scientists-sandbox files\nLet’s point a volume to our workshop files so we have them on our container.\nWe can specify a particular file path on our computer or give it $PWD. Then optionally we can give a : and a file path where we’d like it to be stored on the container. Otherwise it will be stored at the absolute top of the container. Note that $PWD is a special environment variable that stores the absolute path of the current working directory. You will need to be in the containers-for-scientists-sandbox-main for this to work.\n\n\n\n\n\n\n\n\n Now we can run:\ndocker run -v $PWD:/home cansav09/practice-image:1\nIf you have a windows machine you may have to run this variant instead. This version has a different ${} around the pwd part.\ndocker run -v ${pwd}:/home cansav09/practice-image:1\nIn Docker desktop you can specify a portal like this:\n\n\n\n\n\n\n\n\n\n4.1.1.1 Step 2: Retry calling the script\n Now we can run the following command but we will have to run docker ps and get the container ID we need to put here.\ndocker exec -it <REPLACE_WITH_CONTAINER_ID> bash /home/run_analysis.sh\nor in the exec tab of the container in Docker desktop app, run\nbash /home/run_analysis.sh\n\n\n\n\n\n\n\n\nNow we have a new error! What does this mean?\nQuestion: Does our container have all of the same software that our computer has?\n\n\n\n\n\n\n\n\n\n4.1.2 Podman\nOur container is separate from our computer so if we want to use a file we have to attach it using a “volume”.\n\n4.1.2.0.1 Step 1: Let’s add our containers-for-scientists-sandbox files\nLet’s point a volume to our workshop files so we have them on our container.\nWe can specify a particular file path on our computer or give it $PWD Then optionally we can give a : and a file path we’d like this to be stored on on the container. Otherwise it will be stored at the absolute top of the container.\n Now we can run:\npodman run -v $pwd:/home cansav09/practice-image:1\nIf you have a windows machine you may have to run this variant instead. This version has a different ${} around the pwd part.\npodman run -v ${pwd}:/home cansav09/practice-image:1\n\n\n\n\n\n\n\n\n\n4.1.2.1 Step 2: Retry calling the script\n Now we can run the following command but we will have to run podman ps and get the container ID we need to put here.\npodman exec -it <REPLACE_WITH_CONTAINER_ID> bash /home/run_analysis.sh\nNow we have a new error:\nError in loadNamespace(x): There is no package called 'rmarkdown'\nWhat does this mean?\nQuestion: Does our container have all of the same software that our computer has?", "crumbs": [ "4 Using Volumes" ] @@ -304,7 +304,7 @@ "href": "07-sharing-images.html#container-registries", "title": "\n7 Best practices for sharing images\n", "section": "\n7.7 Container Registries", - "text": "7.7 Container Registries\nTo share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries generally cross-compatible meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.\nThis article has a nice summary of some of the most popular ones.\nAnd here’s a TL;DR of the most common registries:\n\n\nDockerhub – widely used, a default\n\nAmazon Web Services Container Registry - options for keeping private\n\n\nGithub container registry - If you are using GitHub packages works with that nicely\n\nSingularity – if you need more robust security\n\nWe encourage you to consider what container registries work best for you specific project and team. Here’s a starter list of considerations you may want to think of, roughly in the order of importance.\n\nIf you have protected data and security concerns (like we discussed earlier in this chapter) you may need to pick a container registry that allows privacy and strong security.\nPrice – not all container registries are free, but many of them are what kind of budget do you have for this purpose? Paying is generally not a necessity so don’t pay for a container registry subscription unless you need to.\nWhat tools are you already using? For example GitHub, Azure, and AWS have their own container registries, if you already are using these services you may consider using their associated registry. (Note GitHub actions works quite seamlessly with Dockerhub, so personally I haven’t had a reason to use GitHub Container Registry but it is an option.)\nIs there an industry standard? Where are your collaborators or those at your institution storing your images?\n\nWhile there are lots of container registry options, for the purposes of this tutorial, we’ll use Dockerhub. Dockerhub is one of the first container registries and still remains one of the largest. For most purposes, using Dockerhub will be just fine.", + "text": "7.7 Container Registries\nTo share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries are generally cross-compatible, meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.\nThis article has a nice summary of some of the most popular ones.\nAnd here’s a TL;DR of the most common registries:\n\n\nDockerhub – widely used, a default\n\nAmazon Web Services Container Registry - options for keeping private\n\n\nGithub container registry - If you are using GitHub packages works with that nicely\n\nSingularity – if you need more robust security\n\nWe encourage you to consider what container registries work best for you specific project and team. Here’s a starter list of considerations you may want to think of, roughly in the order of importance.\n\nIf you have protected data and security concerns (like we discussed earlier in this chapter) you may need to pick a container registry that allows privacy and strong security.\nPrice – not all container registries are free, but many of them aren’t. Think about what kind of budget do you have for specific purpose. Paying is generally not a necessity, so don’t pay for a container registry subscription unless you need to.\nWhat tools are you already using? For example GitHub, Azure, and AWS have their own container registries, if you already are using these services you may consider using their associated registry. (Note GitHub actions works quite seamlessly with Dockerhub, so personally I haven’t had a reason to use GitHub Container Registry but it is an option.)\nIs there an industry standard? Where are your collaborators or those at your institution storing your images?\n\nWhile there are lots of container registry options, for the purposes of this tutorial, we’ll use Dockerhub. Dockerhub is one of the first container registries and still remains one of the largest. For most purposes, using Dockerhub will be just fine.", "crumbs": [ "7 Best practices for sharing images" ]