From 5d67b55c3b278f1cd2cfd291888f6650bd8112e1 Mon Sep 17 00:00:00 2001 From: Hugo Tavares Date: Thu, 24 Oct 2024 15:49:56 +0100 Subject: [PATCH] update software instructions --- setup.md | 218 +++++++++++++++++---------------------- utils/envs/alignment.yml | 14 +++ utils/envs/assembly.yml | 15 +++ utils/envs/mags.yml | 11 ++ utils/training_setup.sh | 46 +++++++++ 5 files changed, 178 insertions(+), 126 deletions(-) create mode 100644 utils/envs/alignment.yml create mode 100644 utils/envs/assembly.yml create mode 100644 utils/envs/mags.yml create mode 100644 utils/training_setup.sh diff --git a/setup.md b/setup.md index 7e202a7..ef6bba8 100644 --- a/setup.md +++ b/setup.md @@ -19,206 +19,172 @@ If you want to setup your own computer to run the analysis demonstrated on this ## Data -The data used in these materials is not yet publicly available. -We will add a link to the data in due time. +You can download the data used in the workshop from the following link: + + + + + ## Software -### R and RStudio +### General setup ::: {.panel-tabset group="os"} #### Windows -Download and install all these using default options: +- Setup the **Windows Subsystem for Linux (WSL)** following [these instructions](https://cambiotraining.github.io/software-installation/materials/wsl.html). +- From a WSL terminal (previous step), install **Mamba** following the **Linux instructions** on [this page](https://cambiotraining.github.io/software-installation/materials/mamba.html). +- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html). -- [R](https://cran.r-project.org/bin/windows/base/release.html) -- [RTools](https://cran.r-project.org/bin/windows/Rtools/) -- [RStudio](https://www.rstudio.com/products/rstudio/download/#download) #### Mac OS -Download and install all these using default options: - -- [R](https://cran.r-project.org/bin/macosx/) -- [RStudio](https://www.rstudio.com/products/rstudio/download/#download) +- **Setup your macOS** by following [these instructions](https://cambiotraining.github.io/software-installation/materials/macos.html). +- Install **Mamba** using [these instructions](https://cambiotraining.github.io/software-installation/materials/mamba.html). +- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html). #### Linux -- Go to the [R installation](https://cran.r-project.org/bin/linux/) folder and look at the instructions for your distribution. -- Download the [RStudio](https://www.rstudio.com/products/rstudio/download/#download) installer for your distribution and install it using your package manager. +- Install **Mamba** using [these instructions](https://cambiotraining.github.io/software-installation/materials/mamba.html). +- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html). + ::: -#### R Packages +### R Packages Open RStudio. In the R console, run the following commands to install all the necessary packages: ```r install.packages("BiocManager") -BiocManager::install(c("dada2", "phyloseq", "Biostrings", "ggplot2", "reshape2", "readxl", "tidyverse")) +BiocManager::install(c("dada2", + "phyloseq", + "Biostrings", + "ggplot2", + "reshape2", + "readxl", + "tidyverse")) ``` -### Linux {#sec-install-linux} - -:::{.panel-tabset} -#### Fresh Installation +### Bioinformatics software -The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution. -The kind of distribution you choose is not critical, but we recommend **Ubuntu** if you are unsure. +We can install the software used in the course using `mamba`. +Due to the large number of programs, we recommend installing them in separate environments to avoid package version conflicts. +The following commands install the latest version of each software at the time of writing. +You may want to search [anaconda.org](https://anaconda.org/) for the latest versions available. -You can follow the [installation tutorial on the Ubuntu webpage](https://ubuntu.com/tutorials/install-ubuntu-desktop#1-overview). +```bash +mamba create -n alignment fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 metaphlan=4.1.1 mash=2.3 multiqc==1.25.1 -:::{.callout-warning} -Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss. -::: +mamba create -n assembly fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 spades=4.0.0 bbmap=39.10 flash=1.2.11 multiqc==1.25.1 -#### Windows WSL +mamba create -n mags maxbin2=2.2.7 prokka=1.14.6 gtdbtk=2.4.0 abricate=1.0.1 checkm-genome=1.2.3 +``` -The **Windows Subsystem for Linux (WSL2)** runs a compiled version of Ubuntu natively on Windows. +From now on, you can use these packages, by activating the respective software environment using `mamba activate alignment`, `mamba activate assembly` or `mamba activate mags`. -There are detailed instructions on how to install WSL on the [Microsoft documentation page](https://learn.microsoft.com/en-us/windows/wsl/install). -But briefly: -- Click the Windows key and search for _Windows PowerShell_, right-click on the app and choose **Run as administrator**. -- Answer "Yes" when it asks if you want the App to make changes on your computer. -- A terminal will open; run the command: `wsl --install`. - Progress bars will show while installing "Virtual Machine Platform", "Windows Subsystem for Linux" and finally "Ubuntu" (this process can take a long time). - - **Note:** it has happened to us in the past that the terminal freezes at the step of installing "Ubuntu". If it is frozen for ~1h at that step, press Ctrl + C and hopefully you will get a message saying "Ubuntu installed successfully". -- After installation completes, restart your computer. -- After restart, a terminal window will open asking you to create a username and password. - If it doesn't, click the Windows key and search for _Ubuntu_, click on the App and it should open a new terminal. - - You can use the same username and password that you have on Windows, or a different one - it's your choice. Spaces and other special characters are not allowed for your Ubuntu username. - - **Note:** when you type your password nothing seems to be happening as the cursor doesn't move. However, the terminal is recording your password as you type. You will be asked to type the new password again to confirm it, so you can always try again if you get it wrong the first time. +### Databases -You should now have access to a Ubuntu Linux terminal. -This behaves very much like a regular Ubuntu server. +Some of the programs used require us to download public databases in addition to their installation. +These files can be quite large, so we recommend that you use a shared storage if you're working in a team. +We also recommend that you keep track of the database versions used (e.g. saving them in explicit folder names), in case new updates are released in the future and you want to reproduce an analysis. -##### Configuring WSL2 +#### CheckM (1.5 GiB) -After installation, it is useful to **create shortcuts to your files on Windows**. -Your main `C:\` drive is located in `/mnt/c/` and other drives will be equally available based on their letter. -To create shortcuts to commonly-used directories you use _symbolic links_. -Here are some commands to automatically create shortcuts to your Windows "Documents", "Desktop" and "Downloads" folders (copy/paste these commands on the terminal): +First activate the environment: ```bash -ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("MyDocuments")' | tr -d '\r')) ~/Documents -ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("Desktop")' | tr -d '\r')) ~/Desktop -ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("UserProfile")' | tr -d '\r'))/Downloads ~/Downloads +mamba activate mags ``` -You may also want to **configure the Windows terminal to automatically open _WSL2_** (instead of the default Windows Command Prompt or Powershell): - -- Search for and open the " Terminal" application. -- Click on the down arrow in the toolbar. -- Click on " Settings". -- Under "Default Profile" select " Ubuntu". - - -#### Virtual Machine - -Another way to run Linux within Windows (or macOS) is to install a Virtual Machine. -However, this is mostly suitable for practicing and **not suitable for real data analysis**. - -Details for installing Ubuntu on VirtualBox is given on [this page](https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using-virtualbox#1-overview). -Make sure to do these things, while you are setting it up: - -- In Step 2 "Create a user profile": make sure to tick the Guest Additions option. -- In Step 2 "Define the Virtual Machine’s resources": - - Assign at least 4 CPUs and 16000MB of RAM. At the very minimum you need 2 CPUs to run an Ubuntu VM. - - Set at least 100GB as disk size, more if you have it available (note, this will not take 100GB of space on your computer, but it will allow using up to a maximum of that value, which is useful as we are working with sequencing data). - -Once the installation completes, login to the Ubuntu Virtual machine, open a terminal and do the following: - -- Run `su` command. -- Enter your user password. Your terminal should change to start with `root@` -- Type the command: `usermod -a -G sudo YOUR-USERNAME-HERE`. -- Close the terminal and restart the virtual machine. - -These commands will add your newly created user to the "sudo" (admin) group. -::: - +The [CheckM documentation](https://github.com/Ecogenomics/CheckM/wiki/Installation#required-reference-data) gives the link to its database file. -After making a fresh install of Ubuntu (using any of the methods above), open a terminal and run the following commands to update your system and install some essential packages: +We will download this databases to a directory in our home called `~/databases/checkmdb_20150116`, but you can change this if you prefer to save it elsewhere. +We use the date of the latest version of the database in the directory name for reference. ```bash -sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y -sudo apt install -y git -sudo apt install -y default-jre +# create variable with output directory name for our database +# change this to be a directory of your choice +checkm_db="$HOME/databases/checkmdb_20150116" +mkdir -p $checkm_db ``` - -### _Conda_ - -We recommend using the _Conda_ package manager to install your software. -In particular, the newest implementation called _Mamba_. - -To install _Mamba_, run the following commands from the terminal: +Download and decompress the file: ```bash -wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" -bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3 -rm Miniforge3-$(uname)-$(uname -m).sh -$HOME/miniforge3/bin/mamba init +wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz +tar -xzvf checkm_db.tar.gz -C $checkm_db +rm checkm_db.tar.gz ``` -Restart your terminal (or open a new one) and confirm that your shell now starts with the word `(base)`. -Then run the following commands: +After downloading, you need to run the following command to configure CheckM: ```bash -conda config --add channels defaults -conda config --add channels bioconda -conda config --add channels conda-forge -conda config --set remote_read_timeout_secs 1000 +checkm data setRoot $checkm_db ``` - -### Bioinformatics Software - -We can install all the software with `mamba`: +Alternatively, you can set an environment variable specifically in your Conda/Mamba environment: ```bash -mamba create -n metagen - -mamba install -n metagen fastqc multiqc cutadapt trimmomatic bowtie2 samtools metaphlan mash SPAdes bbmap flash maxbin2 prokka gtdbtk abricate checkm-genome +conda env config vars set CHECKM_DATA_PATH="$checkm_db" -n mags ``` -From now on, you can use these packages, by activating the software environment using `mamba activate metagen`. - - -### Databases - -Some of the programs used require us to download public databases in addition to their installation. -To follow these instructions, make sure you have activated the software environment first: `mamba activate metagen`. -#### CheckM (275MB) +#### GTDB-Tk (40GB) -The [CheckM documentation](https://github.com/Ecogenomics/CheckM/wiki/Installation#required-reference-data) gives the link to its database file. +The [GTDB-tk documentation](https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data) gives the link to its database files. -We will download this databases to a directory called `checkmdb`, but you can change this if you prefer to save it elsewhere. +We will download this databases to a directory in our home called `~/databases/gtdbtk_r220`, but you can change this if you prefer to save it elsewhere. +We use the name of the latest version of the database in the directory name for our reference. ```bash -mkdir checkmdb +# create variable with output directory name for our database +# change this to be a directory of your choice +gtdbtk_db="$HOME/databases/gtdbtk_r220" +mkdir -p $gtdbtk_db ``` Download and decompress the file: ```bash -wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz -tar -xzvf checkm_db.tar.gz -C checkmdb +wget -O gtdbtk_db.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz +tar -xzvf gtdbtk_db.tar.gz -C $gtdbtk_db rm checkm_db.tar.gz ``` -After downloading, we run the following command to configure CheckM: +Finally, we need to configure an environment variable to tell GTDB-tk where to find the database. +We define this for our Conda/Mamba environment called `mags`: ```bash -checkm data setRoot $(pwd)/checkmdb/ +conda env config vars set GTDBTK_DATA_PATH="$gtdbtk_db" -n mags ``` -#### GTDB-Tk (40GB) +#### MetaPhlAn (24 GiB) + +MetaPhlAn provides a command to download the latest database from its server ([instructions](https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4#pre-requisites)). +First we activate our environment and create the directory for the database: + +```bash +mamba activate alignment + +# create variable with output directory name for our database +# change this to be a directory of your choice +metaphlan_db="$HOME/databases/metaphlan" +mkdir -p $metaphlan_db +``` -This program offers a script to automatically download the database: +Then we can run the download command (this can take a long time to finish): ```bash -download-db.sh +metaphlan --install --bowtie2db $metaphlan_db ``` + +Finally, we need to configure an environment variable to tell MetaPhlAn where to find the database. +We define this for our Conda/Mamba environment called `alignment`: + +```bash +conda env config vars set DEFAULT_DB_FOLDER="$metaphlan_db" -n alignment +``` \ No newline at end of file diff --git a/utils/envs/alignment.yml b/utils/envs/alignment.yml new file mode 100644 index 0000000..8c0241d --- /dev/null +++ b/utils/envs/alignment.yml @@ -0,0 +1,14 @@ +name: alignment +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - fastqc==0.12.1 + - cutadapt==4.9 + - trimmomatic==0.39 + - bowtie2==2.5.4 + - samtools==1.21 + - metaphlan==4.1.1 + - mash==2.3 + - multiqc==1.25.1 diff --git a/utils/envs/assembly.yml b/utils/envs/assembly.yml new file mode 100644 index 0000000..1be5133 --- /dev/null +++ b/utils/envs/assembly.yml @@ -0,0 +1,15 @@ +name: assembly +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - fastqc==0.12.1 + - cutadapt==4.9 + - trimmomatic==0.39 + - bowtie2==2.5.4 + - samtools==1.21 + - spades==4.0.0 + - bbmap==39.10 + - flash==1.2.11 + - multiqc==1.25.1 diff --git a/utils/envs/mags.yml b/utils/envs/mags.yml new file mode 100644 index 0000000..08322cd --- /dev/null +++ b/utils/envs/mags.yml @@ -0,0 +1,11 @@ +name: mags +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - maxbin2==2.2.7 + - prokka==1.14.6 + - gtdbtk==2.4.0 + - abricate==1.0.1 + - checkm-genome==1.2.3 \ No newline at end of file diff --git a/utils/training_setup.sh b/utils/training_setup.sh new file mode 100644 index 0000000..b7aad76 --- /dev/null +++ b/utils/training_setup.sh @@ -0,0 +1,46 @@ +#!/bin/bash + +set -e + +# Environments +mamba create -n alignment fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 metaphlan=4.1.1 mash=2.3 multiqc==1.25.1 + +mamba create -n assembly fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 spades=4.0.0 bbmap=39.10 flash=1.2.11 multiqc==1.25.1 + +mamba create -n mags maxbin2=2.2.7 prokka=1.14.6 gtdbtk=2.4.0 abricate=1.0.1 checkm-genome=1.2.3 + + +# CheckM database + +checkm_db="$HOME/Course_Materials/databases/checkmdb" +mkdir -p $checkm_db + +wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz +tar -xzvf checkm_db.tar.gz -C $checkm_db +rm checkm_db.tar.gz + +conda env config vars set CHECKM_DATA_PATH="$checkm_db" -n mags + + +# GTDB-tk database + +gtdbtk_db="$HOME/Course_Materials/databases/gtdbtk" +mkdir -p $gtdbtk_db + +wget -O gtdbtk_db.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz +tar -xzvf gtdbtk_db.tar.gz -C $gtdbtk_db +rm checkm_db.tar.gz + +conda env config vars set GTDBTK_DATA_PATH="$gtdbtk_db" -n mags + + +# MetaPhlAn database + +mamba activate alignment + +metaphlan_db="$HOME/databases/metaphlan" +mkdir -p $metaphlan_db + +metaphlan --install --bowtie2db $metaphlan_db + +conda env config vars set DEFAULT_DB_FOLDER="$metaphlan_db" -n alignment \ No newline at end of file