Skip to content

Commit

Permalink
update software instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
tavareshugo committed Oct 24, 2024
1 parent 8bb6766 commit 5d67b55
Show file tree
Hide file tree
Showing 5 changed files with 178 additions and 126 deletions.
218 changes: 92 additions & 126 deletions setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,206 +19,172 @@ If you want to setup your own computer to run the analysis demonstrated on this

## Data

The data used in these materials is not yet publicly available.
We will add a link to the data in due time.
You can download the data used in the workshop from the following link:

<a href="https://www.dropbox.com/scl/fo/fmzzgerrv98plnklzthh0/h?rlkey=x0zvsweshldl4m056q48jqe5d&st=qmrtfhso&dl=0">
<button class="btn"><i class="fa fa-download"></i> Download</button>
</a>


## Software

### R and RStudio
### General setup

::: {.panel-tabset group="os"}
#### Windows

Download and install all these using default options:
- Setup the **Windows Subsystem for Linux (WSL)** following [these instructions](https://cambiotraining.github.io/software-installation/materials/wsl.html).
- From a WSL terminal (previous step), install **Mamba** following the **Linux instructions** on [this page](https://cambiotraining.github.io/software-installation/materials/mamba.html).
- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html).

- [R](https://cran.r-project.org/bin/windows/base/release.html)
- [RTools](https://cran.r-project.org/bin/windows/Rtools/)
- [RStudio](https://www.rstudio.com/products/rstudio/download/#download)

#### Mac OS

Download and install all these using default options:

- [R](https://cran.r-project.org/bin/macosx/)
- [RStudio](https://www.rstudio.com/products/rstudio/download/#download)
- **Setup your macOS** by following [these instructions](https://cambiotraining.github.io/software-installation/materials/macos.html).
- Install **Mamba** using [these instructions](https://cambiotraining.github.io/software-installation/materials/mamba.html).
- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html).

#### Linux

- Go to the [R installation](https://cran.r-project.org/bin/linux/) folder and look at the instructions for your distribution.
- Download the [RStudio](https://www.rstudio.com/products/rstudio/download/#download) installer for your distribution and install it using your package manager.
- Install **Mamba** using [these instructions](https://cambiotraining.github.io/software-installation/materials/mamba.html).
- Install **R** and **RStudio** following [these instructions](https://cambiotraining.github.io/software-installation/materials/r-base.html).

:::

#### R Packages
### R Packages

Open RStudio.
In the R console, run the following commands to install all the necessary packages:

```r
install.packages("BiocManager")
BiocManager::install(c("dada2", "phyloseq", "Biostrings", "ggplot2", "reshape2", "readxl", "tidyverse"))
BiocManager::install(c("dada2",
"phyloseq",
"Biostrings",
"ggplot2",
"reshape2",
"readxl",
"tidyverse"))
```


### Linux {#sec-install-linux}

:::{.panel-tabset}
#### Fresh Installation
### Bioinformatics software

The recommendation for bioinformatic analysis is to have a dedicated computer running a Linux distribution.
The kind of distribution you choose is not critical, but we recommend **Ubuntu** if you are unsure.
We can install the software used in the course using `mamba`.
Due to the large number of programs, we recommend installing them in separate environments to avoid package version conflicts.
The following commands install the latest version of each software at the time of writing.
You may want to search [anaconda.org](https://anaconda.org/) for the latest versions available.

You can follow the [installation tutorial on the Ubuntu webpage](https://ubuntu.com/tutorials/install-ubuntu-desktop#1-overview).
```bash
mamba create -n alignment fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 metaphlan=4.1.1 mash=2.3 multiqc==1.25.1

:::{.callout-warning}
Installing Ubuntu on the computer will remove any other operating system you had previously installed, and can lead to data loss.
:::
mamba create -n assembly fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 spades=4.0.0 bbmap=39.10 flash=1.2.11 multiqc==1.25.1

#### Windows WSL
mamba create -n mags maxbin2=2.2.7 prokka=1.14.6 gtdbtk=2.4.0 abricate=1.0.1 checkm-genome=1.2.3
```

The **Windows Subsystem for Linux (WSL2)** runs a compiled version of Ubuntu natively on Windows.
From now on, you can use these packages, by activating the respective software environment using `mamba activate alignment`, `mamba activate assembly` or `mamba activate mags`.

There are detailed instructions on how to install WSL on the [Microsoft documentation page](https://learn.microsoft.com/en-us/windows/wsl/install).
But briefly:

- Click the Windows key and search for _Windows PowerShell_, right-click on the app and choose **Run as administrator**.
- Answer "Yes" when it asks if you want the App to make changes on your computer.
- A terminal will open; run the command: `wsl --install`.
Progress bars will show while installing "Virtual Machine Platform", "Windows Subsystem for Linux" and finally "Ubuntu" (this process can take a long time).
- **Note:** it has happened to us in the past that the terminal freezes at the step of installing "Ubuntu". If it is frozen for ~1h at that step, press <kbd>Ctrl + C</kdb> and hopefully you will get a message saying "Ubuntu installed successfully".
- After installation completes, restart your computer.
- After restart, a terminal window will open asking you to create a username and password.
If it doesn't, click the Windows key and search for _Ubuntu_, click on the App and it should open a new terminal.
- You can use the same username and password that you have on Windows, or a different one - it's your choice. Spaces and other special characters are not allowed for your Ubuntu username.
- **Note:** when you type your password nothing seems to be happening as the cursor doesn't move. However, the terminal is recording your password as you type. You will be asked to type the new password again to confirm it, so you can always try again if you get it wrong the first time.
### Databases

You should now have access to a Ubuntu Linux terminal.
This behaves very much like a regular Ubuntu server.
Some of the programs used require us to download public databases in addition to their installation.
These files can be quite large, so we recommend that you use a shared storage if you're working in a team.
We also recommend that you keep track of the database versions used (e.g. saving them in explicit folder names), in case new updates are released in the future and you want to reproduce an analysis.

##### Configuring WSL2
#### CheckM (1.5 GiB)

After installation, it is useful to **create shortcuts to your files on Windows**.
Your main `C:\` drive is located in `/mnt/c/` and other drives will be equally available based on their letter.
To create shortcuts to commonly-used directories you use _symbolic links_.
Here are some commands to automatically create shortcuts to your Windows "Documents", "Desktop" and "Downloads" folders (copy/paste these commands on the terminal):
First activate the environment:

```bash
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("MyDocuments")' | tr -d '\r')) ~/Documents
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("Desktop")' | tr -d '\r')) ~/Desktop
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("UserProfile")' | tr -d '\r'))/Downloads ~/Downloads
mamba activate mags
```

You may also want to **configure the Windows terminal to automatically open _WSL2_** (instead of the default Windows Command Prompt or Powershell):

- Search for and open the "<i class="fa-solid fa-terminal"></i> Terminal" application.
- Click on the down arrow <i class="fa-solid fa-chevron-down"></i> in the toolbar.
- Click on "<i class="fa-solid fa-gear"></i> Settings".
- Under "Default Profile" select "<i class="fa-brands fa-linux"></i> Ubuntu".


#### Virtual Machine

Another way to run Linux within Windows (or macOS) is to install a Virtual Machine.
However, this is mostly suitable for practicing and **not suitable for real data analysis**.

Details for installing Ubuntu on VirtualBox is given on [this page](https://ubuntu.com/tutorials/how-to-run-ubuntu-desktop-on-a-virtual-machine-using-virtualbox#1-overview).
Make sure to do these things, while you are setting it up:

- In Step 2 "Create a user profile": make sure to tick the Guest Additions option.
- In Step 2 "Define the Virtual Machine’s resources":
- Assign at least 4 CPUs and 16000MB of RAM. At the very minimum you need 2 CPUs to run an Ubuntu VM.
- Set at least 100GB as disk size, more if you have it available (note, this will not take 100GB of space on your computer, but it will allow using up to a maximum of that value, which is useful as we are working with sequencing data).

Once the installation completes, login to the Ubuntu Virtual machine, open a terminal and do the following:

- Run `su` command.
- Enter your user password. Your terminal should change to start with `root@`
- Type the command: `usermod -a -G sudo YOUR-USERNAME-HERE`.
- Close the terminal and restart the virtual machine.

These commands will add your newly created user to the "sudo" (admin) group.
:::

The [CheckM documentation](https://github.com/Ecogenomics/CheckM/wiki/Installation#required-reference-data) gives the link to its database file.

After making a fresh install of Ubuntu (using any of the methods above), open a terminal and run the following commands to update your system and install some essential packages:
We will download this databases to a directory in our home called `~/databases/checkmdb_20150116`, but you can change this if you prefer to save it elsewhere.
We use the date of the latest version of the database in the directory name for reference.

```bash
sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
sudo apt install -y git
sudo apt install -y default-jre
# create variable with output directory name for our database
# change this to be a directory of your choice
checkm_db="$HOME/databases/checkmdb_20150116"
mkdir -p $checkm_db
```


### _Conda_

We recommend using the _Conda_ package manager to install your software.
In particular, the newest implementation called _Mamba_.

To install _Mamba_, run the following commands from the terminal:
Download and decompress the file:

```bash
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -b -p $HOME/miniforge3
rm Miniforge3-$(uname)-$(uname -m).sh
$HOME/miniforge3/bin/mamba init
wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xzvf checkm_db.tar.gz -C $checkm_db
rm checkm_db.tar.gz
```

Restart your terminal (or open a new one) and confirm that your shell now starts with the word `(base)`.
Then run the following commands:
After downloading, you need to run the following command to configure CheckM:

```bash
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set remote_read_timeout_secs 1000
checkm data setRoot $checkm_db
```


### Bioinformatics Software

We can install all the software with `mamba`:
Alternatively, you can set an environment variable specifically in your Conda/Mamba environment:

```bash
mamba create -n metagen

mamba install -n metagen fastqc multiqc cutadapt trimmomatic bowtie2 samtools metaphlan mash SPAdes bbmap flash maxbin2 prokka gtdbtk abricate checkm-genome
conda env config vars set CHECKM_DATA_PATH="$checkm_db" -n mags
```

From now on, you can use these packages, by activating the software environment using `mamba activate metagen`.


### Databases

Some of the programs used require us to download public databases in addition to their installation.
To follow these instructions, make sure you have activated the software environment first: `mamba activate metagen`.

#### CheckM (275MB)
#### GTDB-Tk (40GB)

The [CheckM documentation](https://github.com/Ecogenomics/CheckM/wiki/Installation#required-reference-data) gives the link to its database file.
The [GTDB-tk documentation](https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data) gives the link to its database files.

We will download this databases to a directory called `checkmdb`, but you can change this if you prefer to save it elsewhere.
We will download this databases to a directory in our home called `~/databases/gtdbtk_r220`, but you can change this if you prefer to save it elsewhere.
We use the name of the latest version of the database in the directory name for our reference.

```bash
mkdir checkmdb
# create variable with output directory name for our database
# change this to be a directory of your choice
gtdbtk_db="$HOME/databases/gtdbtk_r220"
mkdir -p $gtdbtk_db
```

Download and decompress the file:

```bash
wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xzvf checkm_db.tar.gz -C checkmdb
wget -O gtdbtk_db.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
tar -xzvf gtdbtk_db.tar.gz -C $gtdbtk_db
rm checkm_db.tar.gz
```

After downloading, we run the following command to configure CheckM:
Finally, we need to configure an environment variable to tell GTDB-tk where to find the database.
We define this for our Conda/Mamba environment called `mags`:

```bash
checkm data setRoot $(pwd)/checkmdb/
conda env config vars set GTDBTK_DATA_PATH="$gtdbtk_db" -n mags
```

#### GTDB-Tk (40GB)
#### MetaPhlAn (24 GiB)

MetaPhlAn provides a command to download the latest database from its server ([instructions](https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4#pre-requisites)).
First we activate our environment and create the directory for the database:

```bash
mamba activate alignment

# create variable with output directory name for our database
# change this to be a directory of your choice
metaphlan_db="$HOME/databases/metaphlan"
mkdir -p $metaphlan_db
```

This program offers a script to automatically download the database:
Then we can run the download command (this can take a long time to finish):

```bash
download-db.sh
metaphlan --install --bowtie2db $metaphlan_db
```

Finally, we need to configure an environment variable to tell MetaPhlAn where to find the database.
We define this for our Conda/Mamba environment called `alignment`:

```bash
conda env config vars set DEFAULT_DB_FOLDER="$metaphlan_db" -n alignment
```
14 changes: 14 additions & 0 deletions utils/envs/alignment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: alignment
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- fastqc==0.12.1
- cutadapt==4.9
- trimmomatic==0.39
- bowtie2==2.5.4
- samtools==1.21
- metaphlan==4.1.1
- mash==2.3
- multiqc==1.25.1
15 changes: 15 additions & 0 deletions utils/envs/assembly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: assembly
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- fastqc==0.12.1
- cutadapt==4.9
- trimmomatic==0.39
- bowtie2==2.5.4
- samtools==1.21
- spades==4.0.0
- bbmap==39.10
- flash==1.2.11
- multiqc==1.25.1
11 changes: 11 additions & 0 deletions utils/envs/mags.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: mags
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- maxbin2==2.2.7
- prokka==1.14.6
- gtdbtk==2.4.0
- abricate==1.0.1
- checkm-genome==1.2.3
46 changes: 46 additions & 0 deletions utils/training_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash

set -e

# Environments
mamba create -n alignment fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 metaphlan=4.1.1 mash=2.3 multiqc==1.25.1

mamba create -n assembly fastqc=0.12.1 cutadapt=4.9 trimmomatic=0.39 bowtie2=2.5.4 samtools=1.21 spades=4.0.0 bbmap=39.10 flash=1.2.11 multiqc==1.25.1

mamba create -n mags maxbin2=2.2.7 prokka=1.14.6 gtdbtk=2.4.0 abricate=1.0.1 checkm-genome=1.2.3


# CheckM database

checkm_db="$HOME/Course_Materials/databases/checkmdb"
mkdir -p $checkm_db

wget -O checkm_db.tar.gz https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
tar -xzvf checkm_db.tar.gz -C $checkm_db
rm checkm_db.tar.gz

conda env config vars set CHECKM_DATA_PATH="$checkm_db" -n mags


# GTDB-tk database

gtdbtk_db="$HOME/Course_Materials/databases/gtdbtk"
mkdir -p $gtdbtk_db

wget -O gtdbtk_db.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
tar -xzvf gtdbtk_db.tar.gz -C $gtdbtk_db
rm checkm_db.tar.gz

conda env config vars set GTDBTK_DATA_PATH="$gtdbtk_db" -n mags


# MetaPhlAn database

mamba activate alignment

metaphlan_db="$HOME/databases/metaphlan"
mkdir -p $metaphlan_db

metaphlan --install --bowtie2db $metaphlan_db

conda env config vars set DEFAULT_DB_FOLDER="$metaphlan_db" -n alignment

0 comments on commit 5d67b55

Please sign in to comment.