diff --git a/CONTRIBUTORS.yaml b/CONTRIBUTORS.yaml
index a63406c620196c..a5c71a1cc625cd 100644
--- a/CONTRIBUTORS.yaml
+++ b/CONTRIBUTORS.yaml
@@ -1374,6 +1374,10 @@ Sofokli5:
orcid: 0000-0002-4833-4726
joined: 2022-05
+sophia120199:
+ name: Sophia Hampe
+ joined: 2023-04
+
stephanierobin:
name: Stéphanie Robin
email: stephanie.robin@inrae.fr
diff --git a/SECURITY.md b/SECURITY.md
new file mode 100644
index 00000000000000..649ec88e232fd1
--- /dev/null
+++ b/SECURITY.md
@@ -0,0 +1,12 @@
+# Security Policy
+
+## Supported Versions
+
+The GTN does not necessarily have a notion of 'supported versions' as our site is completely static and does not go through any sort of managed release cycle, unlike Galaxy.
+All dynamic features implemented in our site are strictly client-side, which generally provides acceptable security guarantees and low risk of user-facing exploitation.
+
+There are occasionally vulnerabilities in the tools we use to build the site, however those vulnerabilities generally do not affect us, nor the output produced by jekyll and related tools.
+
+## Reporting a Vulnerability
+
+Please report any vulnerabilities using [GitHub private reporting](https://github.com/galaxyproject/training-material/security/advisories/new). One of the maintainers will acknowledge your report within 5 business days (European).
diff --git a/faqs/galaxy/interactive_tools_jupyter_launch.md b/faqs/galaxy/interactive_tools_jupyter_launch.md
index 312425ea384b16..46c7b5994d895a 100644
--- a/faqs/galaxy/interactive_tools_jupyter_launch.md
+++ b/faqs/galaxy/interactive_tools_jupyter_launch.md
@@ -11,7 +11,7 @@ contributors: [annefou,shiltemann,nomadscientist]
>
> > Run JupyterLab
> >
-> > 1. {% tool [Interactive Jupyter Notebook](interactive_tool_jupyter_notebook) %}:
+> > 1. {% tool [Interactive Jupyter Notebook](interactive_tool_jupyter_notebook) %}. Note that on some Galaxies this is called {% tool [Interactive JupyTool and notebook](interactive_tool_jupyter_notebook) %}:
> > 2. Click Run Tool
> > 3. The tool will start running and will stay running permanently
> >
diff --git a/learning-pathways/intro-to-galaxy-and-ecology.md b/learning-pathways/intro-to-galaxy-and-ecology.md
index b1e6565c3da51f..51da0c8ca5b247 100644
--- a/learning-pathways/intro-to-galaxy-and-ecology.md
+++ b/learning-pathways/intro-to-galaxy-and-ecology.md
@@ -6,8 +6,7 @@ title: Introduction to Galaxy and Ecological data analysis
description: |
This learning path aims to teach you the basics of Galaxy and analysis of ecological data.
You will learn how to use Galaxy for analysis, and will be guided through the common
- steps of biodiversity data analysis; download, check and filter GBIF data and analyze abundance data through modeling
- sequences.
+ steps of biodiversity data analysis: download, check, filter and explore biodiversity data and analyze abundance data through modeling.
priority: 1
@@ -31,8 +30,7 @@ pathway:
topic: ecology
- name: gbif_cleaning
topic: ecology
- - name: PAMPA-toolsuite-tutorial
- topic: ecology
+
- section: "Module 3: Basics of Biodiversity abundance data analysis"
description: Working on abundance data, you often want to analyze it through modeling to compute and analyze biodiversity metrics.
diff --git a/topics/admin/tutorials/ansible-galaxy/tutorial.md b/topics/admin/tutorials/ansible-galaxy/tutorial.md
index 81d4a42e59a009..95c4b0b7e8ff3d 100644
--- a/topics/admin/tutorials/ansible-galaxy/tutorial.md
+++ b/topics/admin/tutorials/ansible-galaxy/tutorial.md
@@ -1779,7 +1779,7 @@ Galaxy is now configured with an admin user, a database, and a place to store da
> > Mar 16 01:15:15 gat systemd[1]: galaxy-gunicorn.service: Consumed 3.381s CPU time.
> > ```
> >
-> > Check your /srv/galaxy/config/galaxy.yml and ensure that it lines up exactly with what you expect.
+> > Check your /srv/galaxy/config/galaxy.yml and ensure that it lines up exactly with what you expect. You might observe a warning that `Dynamic handlers are configured in Gravity but Galaxy is not configured to assign jobs to handlers dynamically`. We will address this [below](#job-configuration), and you can disregard it for now.
> {: .tip}
>
> 6. Some things to note:
@@ -1814,7 +1814,7 @@ With this we have:
- PostgreSQL running
- Galaxy running (managed by Gravity + systemd)
-Although Gunicorn can server HTTP for us directly, a reverse proxy in front of Gunicorn can automatically compress selected content, and we can easily apply caching headers to specific types of content like CSS or images. It is also necessary if we want to serve multiple sites at once, e.g. with a group website at `/` and Galaxy at `/galaxy`. Lastly, it can provide authentication as well, as noted in the [External Authentication]({{ site.baseurl }}/topics/admin/tutorials/external-auth/tutorial.html) tutorial.
+Although Gunicorn can serve HTTP for us directly, a reverse proxy in front of Gunicorn can automatically compress selected content, and we can easily apply caching headers to specific types of content like CSS or images. It is also necessary if we want to serve multiple sites at once, e.g. with a group website at `/` and Galaxy at `/galaxy`. Lastly, it can provide authentication as well, as noted in the [External Authentication]({{ site.baseurl }}/topics/admin/tutorials/external-auth/tutorial.html) tutorial.
For this, we will use NGINX (pronounced "engine X" /ˌɛndʒɪnˈɛks/ EN-jin-EKS). It is possible to configure Galaxy with Apache and potentially other webservers but this is not the configuration that receives the most testing. We recommend NGINX unless you have a specific need for Apache.
diff --git a/topics/admin/tutorials/interactive-tools/tutorial.md b/topics/admin/tutorials/interactive-tools/tutorial.md
index 6d7bd23bbad8ba..001ad515ecdfe4 100644
--- a/topics/admin/tutorials/interactive-tools/tutorial.md
+++ b/topics/admin/tutorials/interactive-tools/tutorial.md
@@ -112,7 +112,7 @@ We will use several Ansible roles for this tutorial. In order to avoid repetetiv
>
> ```yaml
> - src: geerlingguy.docker
-> version: 2.6.0
+> version: 6.1.0
> - src: usegalaxy_eu.gie_proxy
> version: 0.0.2
> ```
@@ -409,7 +409,7 @@ As we use Let's Encrypt in staging mode, the wildcard certificates generated wit
> #certbot_auth_method: --webroot
> ```
>
-> Although this is not explicitly required (setting `cerbot_dns_provider` as we do overrides this setting), doing so is less confusing in the future, since it makes it clear that the "webroot" method for Let's Encrypt WEB-01 challenges is no longer in use for this server.
+> Although this is not explicitly required (setting `certbot_dns_provider` as we do overrides this setting), doing so is less confusing in the future, since it makes it clear that the "webroot" method for Let's Encrypt WEB-01 challenges is no longer in use for this server.
>
> - Add the following lines to your `group_vars/galaxyservers.yml` file:
>
@@ -515,7 +515,7 @@ As we use Let's Encrypt in staging mode, the wildcard certificates generated wit
> #certbot_auth_method: --webroot
> ```
>
-> Although this is not explicitly required (setting `cerbot_dns_provider` as we do overrides this setting), doing so is less confusing in the future, since it makes it clear that the "webroot" method for Let's Encrypt WEB-01 challenges is no longer in use for this server.
+> Although this is not explicitly required (setting `certbot_dns_provider` as we do overrides this setting), doing so is less confusing in the future, since it makes it clear that the "webroot" method for Let's Encrypt WEB-01 challenges is no longer in use for this server.
>
> - Add the following lines to your `group_vars/galaxyservers.yml` file:
>
@@ -586,7 +586,7 @@ A few Interactive Tool wrappers are provided with Galaxy, but they are [commente
>
> ```
>
-> 2. We need to modify `job_conf.xml` to instruct Galaxy on how run Interactive Tools (and specifically, how to run them in Docker). We will begin with a basic job conf:
+> 2. We need to modify `job_conf.xml` to instruct Galaxy on how to run Interactive Tools (and specifically, how to run them in Docker). We will begin with a basic job conf:
>
> Create `templates/galaxy/config/job_conf.xml.j2` with the following contents:
>
@@ -644,7 +644,7 @@ A few Interactive Tool wrappers are provided with Galaxy, but they are [commente
> ```
> {% endraw %}
>
-> Next, inform `galaxyproject.galaxy` of where you would like the `job_conf.xml` to reside, that GxITs should be enabled, and where the GxIT map database can be found:
+> Next, inform `galaxyproject.galaxy` of where you would like the `job_conf.xml` to reside, that GxITs should be enabled, and where the GxIT map database can be found. Watch for other conflicting configurations from previous tutorials (e.g. `job_config: ...`):
>
> {% raw %}
> ```yaml
diff --git a/topics/genome-annotation/tutorials/apollo/slides.html b/topics/genome-annotation/tutorials/apollo/slides.html
index 65c9ffb182c549..142794f1a3fd28 100644
--- a/topics/genome-annotation/tutorials/apollo/slides.html
+++ b/topics/genome-annotation/tutorials/apollo/slides.html
@@ -83,9 +83,9 @@
- A Human finds problems algorithms can't
-.pull-left[.image-90[![Schema showing how automated annotation, experimental evidences (cDNAs, HMM domain searches, RNASeq, similarity with other species), and human analysis are used by Apollo to manually curate an annotation](../../images/apollo/apollo_workflow.png)]]
+.pull-left[.image-75[![Schema showing how automated annotation, experimental evidences (cDNAs, HMM domain searches, RNASeq, similarity with other species), and human analysis are used by Apollo to manually curate an annotation](../../images/apollo/apollo_workflow.png)]]
-.pull-right[.image-40[![Apollo screenshot showing how RNASeq reads align mostly within some exons limits, but not perfectly](../../images/apollo/rnaseq_cov.png)]]
+.pull-right[.image-25[![Apollo screenshot showing how RNASeq reads align mostly within some exons limits, but not perfectly](../../images/apollo/rnaseq_cov.png)]]
???
diff --git a/topics/metagenomics/images/metagenomics-nanopore/kmers-kraken.jpg b/topics/metagenomics/faqs/images/kmers-kraken.jpg
similarity index 100%
rename from topics/metagenomics/images/metagenomics-nanopore/kmers-kraken.jpg
rename to topics/metagenomics/faqs/images/kmers-kraken.jpg
diff --git a/topics/metagenomics/faqs/kraken.md b/topics/metagenomics/faqs/kraken.md
new file mode 100644
index 00000000000000..043a1d09395f19
--- /dev/null
+++ b/topics/metagenomics/faqs/kraken.md
@@ -0,0 +1,17 @@
+---
+title: Kraken2 and the k-mer approach for taxonomy classification
+area: format
+box_type: details
+layout: faq
+contributors: [bebatut]
+---
+
+In the $$k$$-mer approach for taxonomy classification, we use a database containing DNA sequences of genomes whose taxonomy we already know. On a computer, the genome sequences are broken into short pieces of length $$k$$ (called $$k$$-mers), usually 30bp.
+
+**Kraken** examines the $$k$$-mers within the query sequence, searches for them in the database, looks for where these are placed within the taxonomy tree inside the database, makes the classification with the most probable position, then maps $$k$$-mers to the lowest common ancestor (LCA) of all genomes known to contain the given $$k$$-mer.
+
+![Kraken2]({{site.baseurl}}/topics/metagenomics/faqs/images/kmers-kraken.jpg "Kraken sequence classification algorithm. To classify a sequence, each k-mer in the sequence is mapped to the lowest common ancestor (LCA, i.e. the lowest node) of the genomes that contain that k-mer in the database. The taxa associated with the sequence's k-mers, as well as the taxa's ancestors, form a pruned subtree of the general taxonomy tree, which is used for classification. In the classification tree, each node has a weight equal to the number of k-mers in the sequence associated with the node's taxon. Each root-to-leaf (RTL) path in the classification tree is scored by adding all weights in the path, and the maximal RTL path in the classification tree is the classification path (nodes highlighted in yellow). The leaf of this classification path (the orange, leftmost leaf in the classification tree) is the classification used for the query sequence. Source: {% cite Wood2014 %}")
+
+__Kraken2__ uses a compact hash table, a probabilistic data structure that allows for faster queries and lower memory requirements. It applies a spaced seed mask of _s_ spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer.
+
+You can find more information about the __Kraken2__ algorithm in the paper [_Improved metagenomic analysis with Kraken 2_](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0).
\ No newline at end of file
diff --git a/topics/metagenomics/faqs/taxon.md b/topics/metagenomics/faqs/taxon.md
index 4261bbf11da095..4fa07cfe6ee931 100644
--- a/topics/metagenomics/faqs/taxon.md
+++ b/topics/metagenomics/faqs/taxon.md
@@ -29,6 +29,8 @@ Family | Felidae
Genus | *Felis*
Species | *F. catus*
+From this classification, one can generate a tree of life, also known as a phylogenetic tree. It is a rooted tree that describes the relationship of all life on earth. At the root sits the "last universal common ancestor" and the three main branches (in taxonomy also called domains) are bacteria, archaea and eukaryotes. Most important for this is the idea that all life on earth is derived from a common ancestor and therefore when comparing two species, you will -sooner or later- find a common ancestor for all of them.
+
Let's explore taxonomy in the Tree of Life, using [Lifemap](https://lifemap.univ-lyon1.fr/)
diff --git a/topics/metagenomics/tutorials/beer-data-analysis/tutorial.md b/topics/metagenomics/tutorials/beer-data-analysis/tutorial.md
index c211cd8fbcc9d2..4155eb8fd27b9f 100644
--- a/topics/metagenomics/tutorials/beer-data-analysis/tutorial.md
+++ b/topics/metagenomics/tutorials/beer-data-analysis/tutorial.md
@@ -317,15 +317,7 @@ One of the main aims in microbiome data analysis is to identify the organisms se
Taxonomic assignment or classification is the process of assigning an **Operational Taxonomic Unit** (OTUs, that is, groups of related individuals / taxon) to sequences. To assign an OTU to a sequence it is compared against a database, but this comparison can be done in different ways, with different bioinformatics tools. Here we will use **Kraken2** ({% cite wood2019improved %}).
-> Kraken2 and the k-mer approach for taxonomy classification
->
-> In the $$k$$-mer approach for taxonomy classification, we use a database containing DNA sequences of genomes whose taxonomy we already know. On a computer, the genome sequences are broken into short pieces of length $$k$$ (called $$k$$-mers), usually 30bp.
->
-> **Kraken** examines the $$k$$-mers within the query sequence, searches for them in the database, looks for where these are placed within the taxonomy tree inside the database, makes the classification with the most probable position, then maps $$k$$-mers to the lowest common ancestor (LCA) of all genomes known to contain the given $$k$$-mer.
->
-> ![Kraken2](../../images/metagenomics-nanopore/kmers-kraken.jpg "Kraken sequence classification algorithm. To classify a sequence, each k-mer in the sequence is mapped to the lowest common ancestor (LCA, i.e. the lowest node) of the genomes that contain that k-mer in the database. The taxa associated with the sequence's k-mers, as well as the taxa's ancestors, form a pruned subtree of the general taxonomy tree, which is used for classification. In the classification tree, each node has a weight equal to the number of k-mers in the sequence associated with the node's taxon. Each root-to-leaf (RTL) path in the classification tree is scored by adding all weights in the path, and the maximal RTL path in the classification tree is the classification path (nodes highlighted in yellow). The leaf of this classification path (the orange, leftmost leaf in the classification tree) is the classification used for the query sequence. Source: {% cite Wood2014 %}")
->
-{: .details}
+{% snippet topics/metagenomics/faqs/kraken.md %}
> Kraken2
>
diff --git a/topics/metagenomics/tutorials/nanopore-16S-metagenomics/tutorial.md b/topics/metagenomics/tutorials/nanopore-16S-metagenomics/tutorial.md
index ceebbdf7b77586..33f5b5f3c2995d 100644
--- a/topics/metagenomics/tutorials/nanopore-16S-metagenomics/tutorial.md
+++ b/topics/metagenomics/tutorials/nanopore-16S-metagenomics/tutorial.md
@@ -264,13 +264,7 @@ One of the key steps in metagenomic data analysis is to identify the taxon to wh
To perform the taxonomic classification we will use __Kraken2__ ({% cite Wood2019 %}). This tool uses the minimizer method to sample the k-mers (all the read's subsequences of length _k_) in a deterministic fashion in order to reduce memory constumption and processing time. In addition, it masks low-complexity sequences from reference sequences by using __dustmasker__.
-
->
-> __Kraken2__ uses a compact hash table, a probabilistic data structure that allows for faster queries and lower memory requirements. It applies a spaced seed mask of _s_ spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer.
-> You can find more information about the __Kraken2__ algorithm in the paper [_Improved metagenomic analysis with Kraken 2_](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0).
-{: .comment}
-
-![Taxonomic classification](../../images/metagenomics-nanopore/kmers-kraken.jpg "Kraken2 sequence classification algorithm. To classify a sequence, each l-mer is mapped to the lowest common ancestor (LCA) of the genomes that contain that l-mer in a database. In the classification tree, each node has a weight equal to the number of l-mers in the sequence associated with the node’s taxon. Image originally published in {% cite Wood2014 %}.")
+{% snippet topics/metagenomics/faqs/kraken.md %}
For this tutorial, we will use the __SILVA database__ ({% cite Quast2012 %}). It includes over 3.2 million 16S rRNA sequences from the _Bacteria_, _Archaea_ and _Eukaryota_ domains.
diff --git a/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md b/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md
index 15d13acfa53b6f..ac016adeb48426 100644
--- a/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md
+++ b/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md
@@ -376,15 +376,7 @@ In this section we would like to identify the different organisms found in our s
In the previous section we ran **Kraken2** along with the **Kalamari** database, which is also a kind of taxonomy profiling but the database used is designed to include all possible host sequences. In the following part, we run **Kraken2** again; but this time with one of its built-in databases, **Standard PlusPF**, which can give us more insight into pathogen candidate species than **Kalamari**. You can test this yourself by comparing reports of both **Kraken2** runs.
-> Kraken2 and the k-mer approach for taxonomy classification
->
-> In the $$k$$-mer approach for taxonomy classification, we use a database containing DNA sequences of genomes whose taxonomy we already know. On a computer, the genome sequences are broken into short pieces of length $$k$$ (called $$k$$-mers), usually 30bp.
->
-> **Kraken** examines the $$k$$-mers within the query sequence, searches for them in the database, looks for where these are placed within the taxonomy tree inside the database, makes the classification with the most probable position, then maps $$k$$-mers to the lowest common ancestor (LCA) of all genomes known to contain the given $$k$$-mer.
->
-> ![Kraken2](../../images/metagenomics-nanopore/kmers-kraken.jpg "Kraken sequence classification algorithm. To classify a sequence, each k-mer in the sequence is mapped to the lowest common ancestor (LCA, i.e. the lowest node) of the genomes that contain that k-mer in the database. The taxa associated with the sequence's k-mers, as well as the taxa's ancestors, form a pruned subtree of the general taxonomy tree, which is used for classification. In the classification tree, each node has a weight equal to the number of k-mers in the sequence associated with the node's taxon. Each root-to-leaf (RTL) path in the classification tree is scored by adding all weights in the path, and the maximal RTL path in the classification tree is the classification path (nodes highlighted in yellow). The leaf of this classification path (the orange, leftmost leaf in the classification tree) is the classification used for the query sequence. Source: {% cite Wood2014 %}")
->
-{: .details}
+{% snippet topics/metagenomics/faqs/kraken.md %}
@@ -516,7 +508,7 @@ To look for these genes and determine the strain of the bacteria we are testing
1. Genome assembly to get contigs, i.e. longer sequences, using **metaflye** ({% cite flye %}) then assembly polishing using [__medaka consensus pipeline__](https://github.com/nanoporetech/medaka) and visualizing the assembly graph using **Bandage Image** ({% cite Wick2015 %})
2. Generate an **MLST** report with **MLST** tool that scans genomes against PubMLST schemes
-3. Generate reports with **AMR** genes and **VF** using [__ABRicate__](https://github.com/tseemann/abricate)
+3. Generate reports with **AMR** genes and **VF** using [__ABRicate__](https://github.com/tseemann/abricate)
As outputs, we will get our **FASTA** and **Tabular** files to track genes and visualize our pathogenic identification. For that we will need one more file to create a report and we can upload it directly:
@@ -952,7 +944,7 @@ To identify variants, we
-4. **Filter variants** to keep only the pass and good quality variants using **SnpSift Filter** ({% cite Cingolani2012 %})
+4. **Filter variants** to keep only the pass and good quality variants using **SnpSift Filter** ({% cite Cingolani2012 %})
>
> [__LoFreq filter__](https://csb5.github.io/lofreq/) can be also used instead, both tools performs equal and fast results.
@@ -1198,9 +1190,9 @@ We use **Heatmap w ggplot** tool along with other tabular manipulating tools to
> > 3. Both samples were spiked with the same pathogen species, _S. enterica_, but not the same strain:
> >
> > - `Barcode10_Spike2` sample is spiked with _S. enterica subsp. enterica_ strain
-> > - `Barcode11_Spike2b` sample is spiked with _S. enterica subsp. houtenae_ strain.
-> >
-> > This can be the main cause of the big similarities and the few differences of the bacteria pathogen **VF** gene products found between both of the two samples.
+> > - `Barcode11_Spike2b` sample is spiked with _S. enterica subsp. houtenae_ strain.
+> >
+> > This can be the main cause of the big similarities and the few differences of the bacteria pathogen **VF** gene products found between both of the two samples.
> > Other factors such as the **time** and **location** of the sampling may cause other differences. By knowing the metadata of the samples inputted for the workflows in real life we can understand what actually happened. We can have samples with no pathogen found then we start detecting genes from the 7th or 8th sample, then we can identify where and when the pathogen entered the host, and stop the cause of that
> >
> {: .solution}
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/CAMI_software_ranking.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/CAMI_software_ranking.png
new file mode 100644
index 00000000000000..16e23afe3f94f1
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/CAMI_software_ranking.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-kraken.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-kraken.png
new file mode 100644
index 00000000000000..898a178af37bcc
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-kraken.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-metaphlan.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-metaphlan.png
new file mode 100644
index 00000000000000..94b695518782b5
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona-metaphlan.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/krona_bacteria.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona_bacteria.png
new file mode 100644
index 00000000000000..2af7ea3d4456d9
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/krona_bacteria.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-comparison-domain.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-comparison-domain.png
new file mode 100644
index 00000000000000..5e2bc3fc842934
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-comparison-domain.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-proteobacteria.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-proteobacteria.png
new file mode 100644
index 00000000000000..97e631821d27e3
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-proteobacteria.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-results-overview.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-results-overview.png
new file mode 100644
index 00000000000000..adadb580291ce0
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-results-overview.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JC1A.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JC1A.png
new file mode 100644
index 00000000000000..7284e26a71b8f9
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JC1A.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JP4D.png b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JP4D.png
new file mode 100644
index 00000000000000..f6c30793dfb108
Binary files /dev/null and b/topics/metagenomics/tutorials/taxonomic-profiling/images/pavian-kraken-sankey-JP4D.png differ
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/krona-kraken.html b/topics/metagenomics/tutorials/taxonomic-profiling/krona-kraken.html
new file mode 100644
index 00000000000000..2a49a709c93812
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/krona-kraken.html
@@ -0,0 +1,11137 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.bib b/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.bib
new file mode 100644
index 00000000000000..e70ee3f8a3c087
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.bib
@@ -0,0 +1,172 @@
+
+@book{Bik.2014,
+ author = {Bik, Holly M.},
+ year = {2014},
+ title = {Phinch: An interactive, exploratory data visualization framework for --Omic datasets},
+ doi = {10.1101/009944}
+}
+
+@article{blanco2023extending,
+ title={Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4},
+ author={Blanco-M{\'\i}guez, Aitor and Beghini, Francesco and Cumbo, Fabio and McIver, Lauren J and Thompson, Kelsey N and Zolfo, Moreno and Manghi, Paolo and Dubois, Leonard and Huang, Kun D and Thomas, Andrew Maltez and others},
+ journal={Nature Biotechnology},
+ pages={1--12},
+ year={2023},
+ publisher={Nature Publishing Group US New York},
+ doi={10.1038/s41587-023-01688-w}
+}
+
+
+@article{Breitwieser.2020,
+ abstract = {SUMMARY
+
+Pavian is a web application for exploring classification results from metagenomics experiments. With Pavian, researchers can analyze, visualize and transform results from various classifiers-such as Kraken, Centrifuge and MethaPhlAn-using interactive data tables, heatmaps and Sankey flow diagrams. An interactive alignment coverage viewer can help in the validation of matches to a particular genome, which can be crucial when using metagenomics experiments for pathogen detection.
+
+AVAILABILITY AND IMPLEMENTATION
+
+Pavian is implemented in the R language as a modular Shiny web app and is freely available under GPL-3 from http://github.com/fbreitwieser/pavian.},
+ author = {Breitwieser, Florian P. and Salzberg, Steven L.},
+ year = {2020},
+ title = {Pavian: interactive analysis of metagenomics data for microbiome studies and pathogen identification},
+ pages = {1303--1304},
+ volume = {36},
+ number = {4},
+ journal = {Bioinformatics (Oxford, England)},
+ doi = {10.1093/bioinformatics/btz715}
+}
+
+@article{Lu.2017,
+ author = {Lu, Jennifer and Breitwieser, Florian P. and Thielen, Peter and Salzberg, Steven L.},
+ year = {2017},
+ title = {Bracken: estimating species abundance in metagenomics data},
+ pages = {e104},
+ volume = {3},
+ journal = {PeerJ Computer Science},
+ doi = {10.7717/peerj-cs.104}
+}
+
+@article{Meyer.2022,
+ abstract = {Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.},
+ author = {Meyer, Fernando and Fritz, Adrian and Deng, Zhi-Luo and Koslicki, David and Lesker, Till Robin and Gurevich, Alexey and Robertson, Gary and Alser, Mohammed and Antipov, Dmitry and Beghini, Francesco and Bertrand, Denis and Brito, Jaqueline J. and Brown, C. Titus and Buchmann, Jan and Bulu{\c{c}}, Aydin and Chen, Bo and Chikhi, Rayan and Clausen, Philip T. L. C. and Cristian, Alexandru and Dabrowski, Piotr Wojciech and Darling, Aaron E. and Egan, Rob and Eskin, Eleazar and Georganas, Evangelos and Goltsman, Eugene and Gray, Melissa A. and Hansen, Lars Hestbjerg and Hofmeyr, Steven and Huang, Pingqin and Irber, Luiz and Jia, Huijue and J{\o}rgensen, Tue Sparholt and Kieser, Silas D. and Klemetsen, Terje and Kola, Axel and Kolmogorov, Mikhail and Korobeynikov, Anton and Kwan, Jason and LaPierre, Nathan and Lemaitre, Claire and Li, Chenhao and Limasset, Antoine and Malcher-Miranda, Fabio and Mangul, Serghei and Marcelino, Vanessa R. and Marchet, Camille and Marijon, Pierre and Meleshko, Dmitry and Mende, Daniel R. and Milanese, Alessio and Nagarajan, Niranjan and Nissen, Jakob and Nurk, Sergey and Oliker, Leonid and Paoli, Lucas and Peterlongo, Pierre and Piro, Vitor C. and Porter, Jacob S. and Rasmussen, Simon and Rees, Evan R. and Reinert, Knut and Renard, Bernhard and Robertsen, Espen Mikal and Rosen, Gail L. and Ruscheweyh, Hans-Joachim and Sarwal, Varuni and Segata, Nicola and Seiler, Enrico and Shi, Lizhen and Sun, Fengzhu and Sunagawa, Shinichi and S{\o}rensen, S{\o}ren Johannes and Thomas, Ashleigh and Tong, Chengxuan and Trajkovski, Mirko and Tremblay, Julien and Uritskiy, Gherman and Vicedomini, Riccardo and Wang, Zhengyang and Wang, Ziye and Wang, Zhong and Warren, Andrew and Willassen, Nils Peder and Yelick, Katherine and You, Ronghui and Zeller, Georg and Zhao, Zhengqiao and Zhu, Shanfeng and Zhu, Jie and Garrido-Oter, Ruben and Gastmeier, Petra and Hacquard, Stephane and H{\"a}u{\ss}ler, Susanne and Khaledi, Ariane and Maechler, Friederike and Mesny, Fantin and Radutoiu, Simona and Schulze-Lefert, Paul and Smit, Nathiana and Strowig, Till and Bremges, Andreas and Sczyrba, Alexander and McHardy, Alice Carolyn},
+ year = {2022},
+ title = {Critical Assessment of Metagenome Interpretation: the second round of challenges},
+ pages = {429--440},
+ volume = {19},
+ number = {4},
+ journal = {Nature methods},
+ doi = {10.1038/s41592-022-01431-4}
+}
+
+
+@article{Okie.2020,
+ abstract = {Several universal genomic traits affect trade-offs in the capacity, cost, and efficiency of the biochemical information processing that underpins metabolism and reproduction. We analyzed the role of these traits in mediating the responses of a planktonic microbial community to nutrient enrichment in an oligotrophic, phosphorus-deficient pond in Cuatro Ci{\'e}negas, Mexico. This is one of the first whole-ecosystem experiments to involve replicated metagenomic assessment. Mean bacterial genome size, GC content, total number of tRNA genes, total number of rRNA genes, and codon usage bias in ribosomal protein sequences were all higher in the fertilized treatment, as predicted on the basis of the assumption that oligotrophy favors lower information-processing costs whereas copiotrophy favors higher processing rates. Contrasting changes in trait variances also suggested differences between traits in mediating assembly under copiotrophic versus oligotrophic conditions. Trade-offs in information-processing traits are apparently sufficiently pronounced to play a role in community assembly because the major components of metabolism-information, energy, and nutrient requirements-are fine-tuned to an organism's growth and trophic strategy.},
+ author = {Okie, Jordan G. and Poret-Peterson, Amisha T. and Lee, Zarraz Mp and Richter, Alexander and Alcaraz, Luis D. and Eguiarte, Luis E. and Siefert, Janet L. and Souza, Valeria and Dupont, Chris L. and Elser, James J.},
+ year = {2020},
+ title = {Genomic adaptations in information processing underpin trophic strategy in a whole-ecosystem nutrient enrichment experiment},
+ volume = {9},
+ journal = {eLife},
+ doi = {10.7554/eLife.49816}
+}
+
+
+@article{Ondov.2011,
+ abstract = {BACKGROUND
+
+A critical output of metagenomic studies is the estimation of abundances of taxonomical or functional groups. The inherent uncertainty in assignments to these groups makes it important to consider both their hierarchical contexts and their prediction confidence. The current tools for visualizing metagenomic data, however, omit or distort quantitative hierarchical relationships and lack the facility for displaying secondary variables.
+
+RESULTS
+
+Here we present Krona, a new visualization tool that allows intuitive exploration of relative abundances and confidences within the complex hierarchies of metagenomic classifications. Krona combines a variant of radial, space-filling displays with parametric coloring and interactive polar-coordinate zooming. The HTML5 and JavaScript implementation enables fully interactive charts that can be explored with any modern Web browser, without the need for installed software or plug-ins. This Web-based architecture also allows each chart to be an independent document, making them easy to share via e-mail or post to a standard Web server. To illustrate Krona's utility, we describe its application to various metagenomic data sets and its compatibility with popular metagenomic analysis tools.
+
+CONCLUSIONS
+
+Krona is both a powerful metagenomic visualization tool and a demonstration of the potential of HTML5 for highly accessible bioinformatic visualizations. Its rich and interactive displays facilitate more informed interpretations of metagenomic analyses, while its implementation as a browser-based application makes it extremely portable and easily adopted into existing analysis packages. Both the Krona rendering code and conversion tools are freely available under a BSD open-source license, and available from: http://krona.sourceforge.net.},
+ author = {Ondov, Brian D. and Bergman, Nicholas H. and Phillippy, Adam M.},
+ year = {2011},
+ title = {Interactive metagenomic visualization in a Web browser},
+ pages = {385},
+ volume = {12},
+ journal = {BMC Bioinformatics},
+ doi = {10.1186/1471-2105-12-385}
+}
+
+
+@article{Sczyrba.2017,
+ abstract = {Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from $\sim$700 newly sequenced microorganisms and $\sim$600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.},
+ author = {Sczyrba, Alexander and Hofmann, Peter and Belmann, Peter and Koslicki, David and Janssen, Stefan and Dr{\"o}ge, Johannes and Gregor, Ivan and Majda, Stephan and Fiedler, Jessika and Dahms, Eik and Bremges, Andreas and Fritz, Adrian and Garrido-Oter, Ruben and J{\o}rgensen, Tue Sparholt and Shapiro, Nicole and Blood, Philip D. and Gurevich, Alexey and Bai, Yang and Turaev, Dmitrij and DeMaere, Matthew Z. and Chikhi, Rayan and Nagarajan, Niranjan and Quince, Christopher and Meyer, Fernando and Balvo{\v{c}}iūt{\.{e}}, Monika and Hansen, Lars Hestbjerg and S{\o}rensen, S{\o}ren J. and Chia, Burton K. H. and Denis, Bertrand and Froula, Jeff L. and Wang, Zhong and Egan, Robert and {Don Kang}, Dongwan and Cook, Jeffrey J. and Deltel, Charles and Beckstette, Michael and Lemaitre, Claire and Peterlongo, Pierre and Rizk, Guillaume and Lavenier, Dominique and Wu, Yu-Wei and Singer, Steven W. and Jain, Chirag and Strous, Marc and Klingenberg, Heiner and Meinicke, Peter and Barton, Michael D. and Lingner, Thomas and Lin, Hsin-Hung and Liao, Yu-Chieh and Silva, Genivaldo Gueiros Z. and Cuevas, Daniel A. and Edwards, Robert A. and Saha, Surya and Piro, Vitor C. and Renard, Bernhard Y. and Pop, Mihai and Klenk, Hans-Peter and G{\"o}ker, Markus and Kyrpides, Nikos C. and Woyke, Tanja and Vorholt, Julia A. and Schulze-Lefert, Paul and Rubin, Edward M. and Darling, Aaron E. and Rattei, Thomas and McHardy, Alice C.},
+ year = {2017},
+ title = {Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software},
+ pages = {1063--1071},
+ volume = {14},
+ number = {11},
+ journal = {Nature methods},
+ doi = {10.1038/nmeth.4458}
+}
+
+@misc{whipps1988mycoparasitism,
+ title={Mycoparasitism and plant disease control},
+ author={Whipps, John M and Lewis, Karen and Cooke, RC},
+ journal={Fungi in biological control systems},
+ volume={1988},
+ pages={161--187},
+ year={1988},
+ publisher={Manchester University Press Manchester},
+ url={https://www.jstor.org/stable/2434885}
+}
+
+
+@article{Wood.2019,
+ abstract = {Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85{\%}, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.},
+ author = {Wood, Derrick E. and Lu, Jennifer and Langmead, Ben},
+ year = {2019},
+ title = {Improved metagenomic analysis with Kraken 2},
+ keywords = {Alignment-free methods;Metagenomics;Metagenomics classification;Metagenomics/methods;Microbiome;Minimizers;Probabilistic data structures;Software},
+ pages = {257},
+ volume = {20},
+ number = {1},
+ journal = {Genome biology},
+ doi = {10.1186/s13059-019-1891-0},
+ file = {Kraken2:Attachments/Kraken2.pdf:application/pdf}
+}
+
+
+@article{Wood.2014,
+ abstract = {Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.},
+ author = {Wood, Derrick E. and Salzberg, Steven L.},
+ year = {2014},
+ title = {Kraken: ultrafast metagenomic sequence classification using exact alignments},
+ keywords = {Archaea/classification/genetics;Bacteria/classification/genetics;Classification;Humans;Metagenome;Metagenomics/methods;Sensitivity and Specificity;Sequence Alignment/methods;Sequence Analysis, DNA/methods;Software},
+ pages = {R46},
+ volume = {15},
+ number = {3},
+ journal = {Genome biology},
+ doi = {10.1186/gb-2014-15-3-r46},
+ file = {Kraken:Attachments/Kraken.pdf:application/pdf}
+}
+
+
+@article{Ye.2019,
+ abstract = {Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.},
+ author = {Ye, Simon H. and Siddle, Katherine J. and Park, Daniel J. and Sabeti, Pardis C.},
+ year = {2019},
+ title = {Benchmarking Metagenomics Tools for Taxonomic Classification},
+ pages = {779--794},
+ volume = {178},
+ number = {4},
+ journal = {Cell},
+ doi = {10.1016/j.cell.2019.07.010},
+ file = {Ye, Siddle et al. 2019 - Benchmarking Metagenomics Tools for Taxonomic:Attachments/Ye, Siddle et al. 2019 - Benchmarking Metagenomics Tools for Taxonomic.pdf:application/pdf}
+}
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.md b/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.md
new file mode 100644
index 00000000000000..dd1c538753a44a
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/tutorial.md
@@ -0,0 +1,768 @@
+---
+layout: tutorial_hands_on
+title: Taxonomic Profiling and Visualization of Metagenomic Data
+zenodo_link: https://zenodo.org/record/7871630
+questions:
+- Which species (or genera, families, ...) are present in my sample?
+- What are the different approaches and tools to get the community profile of my sample?
+- How can we visualize and compare community profiles?
+objectives:
+- Explain what taxonomic assignment is
+- Explain how taxonomic assignment works
+- Apply Kraken to assign taxonomic labels
+- Apply Krona and Pavian to visualize results of assignment and understand the output
+- Identify taxonomic classification tool that fits best depending on their data
+level: Introductory
+key_points:
+- There are 3 different types of taxonomic assignment (genome, gene or k-mer based)
+- Kraken or MetaPhlAn can be used in Galaxy for taxonomic assignment
+- Visualization of community profiles can be done using Krona or Pavian
+- The taxonomic classification tool to use depends on the data
+time_estimation: 2H
+contributions:
+ authorship:
+ - sophia120199
+ - bebatut
+tags:
+- metagenomics
+- taxonomic profiling
+---
+
+# Introduction
+
+The term **"microbiome"** describes "a characteristic microbial community occupying a reasonably well-defined habitat which has distinct physio-chemical properties. The term thus not only refers to the microorganisms involved but also encompasses their theatre of activity" ({% cite whipps1988mycoparasitism %}).
+
+Microbiome data can be gathered from different environments such as soil, water or the human gut. The biological interest lies in general in the question how the microbiome present at a specific site influences this environment. To study a microbiome, we need to use indirect methods like metagenomics or metatranscriptomics.
+
+**Metagenomic samples** contain DNA from different organisms at a specific site, where the sample was collected. Metagenomic data can be used to find out which organisms coexist in that niche and which genes are present in the different organisms.
+Metatranscriptomic samples include the transcribed gene products, thus RNA, that therefore allow to not only study the presence of genes but additionally their expression in the given environment. The following tutorial will focus on metagenomics data, but the principle is the same for metatranscriptomics data.
+
+The investigation of microorganisms present at a specific site and their relative abundance is also called **"microbial community profiling"**.
+Basic for this is to find out which microorganisms are present in the sample. This can be achieved for all known microbes, where the DNA sequence specific for a certain species is known.
+
+For that we try to **identify the taxon** to which each individual reads belong.
+
+{% snippet topics/metagenomics/faqs/taxon.md %}
+
+When we talk about metagenomic data here, what we start with is sequences derived from DNA fragments that could be isolated from the sample of interest. Ideally, from all microbes present in the sample, we would also find DNA. The underlying idea of taxonomic assignment is to compare the DNA sequences found in the sample (reads) to DNA sequences of a database. When a read matches a database DNA sequence of a known microbe, we can derive a list with microbes present in the sample.
+
+When talking about taxonomic assignment or taxonomic classification, most of the time we actually talk about two methods, that in practice are often used interchangeably:
+- **taxonomic binning**: the clustering of individual sequence reads based on similarities criteria and assignation of clusters to reference taxa
+- **taxonomic profiling**: classification of individual reads to reference taxa to extract the relative abundances of the different taxa
+
+## Taxonomic profiling
+
+Tools for taxonomic profiling can be divided into three groups. Nevertheless, all of them require a pre-computed database based on previously sequenced microbial DNA or protein sequences.
+1. **DNA-to-DNA**: comparison of sequencing reads with genomic databases of DNA sequences, with tools like Kraken ({% cite Wood2014 %})
+2. **DNA-to-Protein** : compare sequencing reads with protein databases (more computationally intensive because of analysis of all six frames of potential DNA-to amino acid translation, with tools like DIAMOND
+3. **Marker based**: search for marker genes (e.g. 16S rRNA sequence) in reads, which is quick, but introduces bias, with tools like MetaPhlAn ({% cite blanco2023extending %})
+
+The comparison of reads to database sequences can be done in different ways, leading to three different types of taxonomic assignment:
+
+- **Genome based** approach
+
+ Reads are aligned to reference genomes. Considering the coverage and breadth, genomes are used to measure genome abundance. Furthermore, genes can be analyzed in genomic context. Advantages of this method are the high detection accuracy, that the unclassified percentage is known, that all SNVs can be detected and that high-resolution genomic comparisons are possible. This method takes medium compute cost.
+
+- **Gene based** approach
+
+ Reads are aligned to reference genes. Next, marker genes are used to estimate species abundance. Furthermore, genes can be analyzed in isolation for presence or absence in a specific condition. The major advantage is the detection of the pangenome (entire set of genes within a species). Major disadvantages are the high compute cost, low detection accuracy and that the unclassified percentage is unknown. At least intragenic SNVs can be detected and low-resolution genomic comparison is possible.
+
+- **k-mer based** approach
+
+ Databases as well as the samples DNA are broken into strings of length $$k$$ for comparison. From all the genomes in the database, where a specific k-mer is found, a lowest common ancestor (LCA) tree is derived and the abundance of k-mers within the tree is counted. This is the basis for a root-to-leaf path calculation, where the path with the highest score is used for classification of the sample. By counting the abundance of k-mers, also an estimation of relative abundance of taxa is possible. The major advantage of k-mer based analysis is the low compute cost. Major disadvantages are the low detection accuracy, that the unclassified percentage is unknown and that there is no gene detection, no SNVs detection and no genomic comparison possible. An example for a k-mer based analysis tool is Kraken, which will be used in this tutorial
+
+After this theoretical introduction, let's now get hands on analyzing an actual dataset!
+
+## Background on data
+
+The dataset we will use for this tutorial comes from an oasis in the Mexican desert called Cuatro Ciénegas ({% cite Okie.2020 %}). The researchers were interested in genomic traits that affect the rates and costs of biochemical information processing within cells. They performed a whole-ecosystem experiment, thus fertilizing the pond to achieve nutrient enriched conditions.
+
+Here we will use 2 datasets:
+- `JP4D`: a microbiome sample collected from the Lagunita Fertilized Pond
+- `JC1A`: a **control** samples from a control mesocosm.
+
+The datasets differ in size, but according to the authors this doesn't matter for their analysis of genomic traits. Also, they underline that differences between the two samples reflect trait-mediated ecological dynamics instead of microevolutionary changes as the duration of the experiment was only 32 days. This means that depending on available nutrients, specific lineages within the pond grow more successfully than others because of their genomic traits.
+
+The datafiles are named according to the first four characters of the filenames.
+It is a collection of paired-end data with R1 being the forward reads and R2 being the reverse reads. Additionally, the reads have been trimmed using __cutadapt__ as explained in the [Quality control tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %})
+
+>
+>
+> In this tutorial, we will cover:
+>
+> 1. TOC
+> {:toc}
+>
+{: .agenda}
+
+# Prepare Galaxy and data
+
+Any analysis should get its own Galaxy history. So let's start by creating a new one:
+
+> Data upload
+>
+> 1. Create a new history for this analysis
+>
+> {% snippet faqs/galaxy/histories_create_new.md %}
+>
+> 2. Rename the history
+>
+> {% snippet faqs/galaxy/histories_rename.md %}
+>
+{: .hands_on}
+
+We need now to import the data
+
+> Import datasets
+>
+> 1. Import the following samples via link from [Zenodo]({{ page.zenodo_link }}) or Galaxy shared data libraries:
+>
+> ```text
+> {{ page.zenodo_link }}/files/JC1A_R1.fastq.gz
+> {{ page.zenodo_link }}/files/JC1A_R2.fastq.gz
+> {{ page.zenodo_link }}/files/JP4D_R1.fastq.gz
+> {{ page.zenodo_link }}/files/JP4D_R2.fastq.gz
+> ```
+>
+> {% snippet faqs/galaxy/datasets_import_via_link.md %}
+> {% snippet faqs/galaxy/datasets_import_from_data_library.md %}
+>
+> 2. 3. Create a paired collection.
+>
+> {% snippet faqs/galaxy/collections_build_list_paired.md %}
+>
+{: .hands_on}
+
+# k-mer based taxonomic assignment of metagenomic data
+
+Our input data is the DNA reads of microbes present at Cuatro Ciénegas.
+
+To find out which microorganisms are present, we will compare the reads of the sample to a reference database, i.e. sequences of known microorganisms stored in a database, using **Kraken2** ({% cite wood2019improved %}).
+
+{% snippet topics/metagenomics/faqs/kraken.md %}
+
+For this tutorial, we will use the Standard (archaea, bacteria, viral, plasmid, human, UniVec_Core) plus protozoa & fungi database, which means:
+
+- archaea: RefSeq complete archaeal genomes/proteins
+- bacteria: RefSeq complete bacterial genomes/proteins
+- plasmid: RefSeq plasmid nucleotide/protein sequences
+- viral: RefSeq complete viral genomes/proteins
+- human: GRCh38 human genome/proteins
+- fungi: RefSeq complete fungal genomes/proteins
+- plant: RefSeq complete plant genomes/proteins
+- protozoa: RefSeq complete protozoan genomes/proteins
+- UniVec_Core: A subset of UniVec, NCBI-supplied database of vector, adapter, linker, and primer sequences that may be contaminating sequencing projects and/or assemblies, chosen to minimize false positive hits to the vector database
+
+> Assign taxonomic labels with Kraken2
+>
+> 1. {% tool [Kraken2](toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2/2.1.1+galaxy1) %} with the following parameters:
+> - *"Single or paired reads"*: `Paired Collection`
+> - {% icon param-collection %} *"Collection of paired reads"*: Input paired collection
+> - *"Confidence"*: `0.1`
+>
+> A confidence score of 0.1 means that at least 10% of the k-mers should match entries in the database. This value can be reduced if a less restrictive taxonomic assignation is desired.
+>
+> - In *"Create Report"*:
+> - *"Print a report with aggregrate counts/clade to file"*: `Yes`
+> - *"Select a Kraken2 database"*: most recent `Prebuilt Refseq indexes: PlusPF`
+>
+{: .hands_on}
+
+**Kraken2** will create two outputs for each dataset
+
+- **Classification**: tabular files with one line for each sequence classified by Kraken and 5 columns:
+
+ 1. `C`/`U`: a one letter indicating if the sequence classified or unclassified
+ 2. Sequence ID as in the input file
+ 3. NCBI taxonomy ID assigned to the sequence, or 0 if unclassified
+ 4. Length of sequence in bp (`read1|read2` for paired reads)
+ 5. A space-delimited list indicating the lowest common ancestor (LCA) mapping of each k-mer in the sequence
+
+ For example, `562:13 561:4 A:31 0:1 562:3` would indicate that:
+ 1. The first 13 k-mers mapped to taxonomy ID #562
+ 2. The next 4 k-mers mapped to taxonomy ID #561
+ 3. The next 31 k-mers contained an ambiguous nucleotide
+ 4. The next k-mer was not in the database
+ 5. The last 3 k-mers mapped to taxonomy ID #562
+
+ `|:|` indicates end of first read, start of second read for paired reads
+
+ For JC1A:
+
+ ```
+ Column 1 Column 2 Column 3 Column 4 Column 5
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998 0 151|190 A:18 0:14 2055:5 0:1 2220095:5 0:74 |:| 0:3 A:54 2:1 0:32 204455:1 2823043:5 0:60
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:15782:2187 0 169|173 0:101 37329:1 0:33 |:| 0:10 2751189:5 0:30 1883:2 0:39 2609255:5 0:48
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:11745:2196 0 235|214 0:173 2282523:5 2746321:2 0:21 |:| 0:65 2746321:2 2282523:5 0:108
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:18358:2213 0 251|251 0:35 281093:5 0:3 651822:5 0:145 106591:3 0:21 |:| 0:64 106591:3 0:145 651822:5
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:14892:2226 0 68|59 0:34 |:| 0:25
+ U MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:18764:2247 0 146|146 0:112 |:| 0:112
+ C MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:12147:2252 9606 220|220 9606:148 0:19 9606:19 |:| 9606:19 0:19 9606:148
+ ```
+
+ >
+ >
+ > For JC1A sample
+ > 1. Is the first sequence in the file classified or unclassified?
+ > 2. What is the taxonomy ID assigned to the first classified sequence?
+ > 3. What is the corresponding taxon?
+ >
+ > >
+ > > 1. classified
+ > > 2. 9606, for the line 7
+ > > 3. 9606 corresponds to Homo sapiens when looking at NCBI.
+ > {: .solution}
+ {: .question}
+
+- **Report**: tabular files with one line per taxon and 6 columns or fields
+
+ 1. Percentage of fragments covered by the clade rooted at this taxon
+ 2. Number of fragments covered by the clade rooted at this taxon
+ 3. Number of fragments assigned directly to this taxon
+ 4. A rank code, indicating
+ - (U)nclassified
+ - (R)oot
+ - (D)omain
+ - (K)ingdom
+ - (P)hylum
+ - (C)lass
+ - (O)rder
+ - (F)amily
+ - (G)enus, or
+ - (S)pecies
+
+ Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., `G2` is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
+
+ 5. NCBI taxonomic ID number
+ 6. Indented scientific name
+
+ ```
+ Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
+ 76.86 105399 105399 U 0 unclassified
+ 23.14 31740 1197 R 1 root
+ 22.20 30448 312 R1 131567 cellular organisms
+ 12.58 17254 3767 D 2 Bacteria
+ 8.77 12027 2867 P 1224 Proteobacteria
+ 4.94 6779 3494 C 28211 Alphaproteobacteria
+ 1.30 1782 1085 O 204455 Rhodobacterales
+ 0.43 593 461 F 31989 Rhodobacteraceae
+ 0.05 74 53 G 265 Paracoccus
+ ```
+
+ >
+ >
+ > 1. What are the percentage on unclassified for JC1A and JP4D?
+ > 2. What are the kindgoms found for JC1A and JP4D?
+ > 3. Where might the eukaryotic DNA come from?
+ > 4. How is the diversity of Proteobacteria in JC1A and JP4D?
+ >
+ > >
+ > >
+ > > 1. 78% for JC1A and 90% for JP4D
+ > > 2. Kindgoms:
+ > > - JC1A: 13% Bacteria, 10% Eukaryota, 0.03% Virus
+ > > - JP4D: 10% Bacteria, 0.7% Eukaryota
+ > > 3. It seems to be human contamination
+ > > 4. JC1A seems to have a big diversity of classes and species of Proteobacteria. JP4D seems more dominated by Aphaproteobacteria.
+ > {: .solution}
+ >
+ {: .question}
+
+Getting an overview of the assignation is not straightforward with the **Kraken2** outputs directly. We can use visualisation tools for that.
+
+A "simple and worthwile addition to Kraken for better abundance estimates" ({% cite Ye.2019 %}) is called __Bracken__ (Bayesian Reestimation of Abundance after Classification with Kraken). Instead of only using proportions of classified reads, it takes a probabilistic approach to generate final abundance profiles. It works by re-distributing reads in the taxonomic tree: "Reads assigned to nodes above the species level are distributed down to the species nodes, while reads assigned at the strain level are re-distributed upward to their parent species" ({% cite Lu.2017 %}).
+
+> Estimate species abundance with Bracken
+>
+> 1. {% tool [Estimate Abundance at Taxonomic Level](toolshed.g2.bx.psu.edu/repos/iuc/bracken/est_abundance/2.7+galaxy1) %} with the following parameters:
+> - {% icon param-collection %} *"Kraken report file"*: **Report** output of **Kraken**
+> - *"Select a kmer distribution"*: `PlusPF (2021-05-17)`
+>
+> It is important to choose the same database that you also chose for Kraken2
+>
+> - *"Produce Kraken-Style Bracken report"*: `yes`
+>
+{: .hands_on}
+
+# Visualization of taxonomic assignment
+
+Once we have assigned the corresponding taxa to each sequence, the next step is to properly visualize the data. There are several tools for that:
+- __Krona pie chart__ tool ({% cite Ondov.2011 %})
+- __Phinch__ ({% cite Bik.2014 %})
+- __Pavian__ ({% cite Breitwieser.2020 %})
+
+## Visualisation using Krona
+
+__Krona__ creates an interactive HTML file allowing hierarchical data to be explored with zooming, multi-layered pie charts. With this tool, we can easily visualize the composition of the bacterial communities and compare how the populations of microorganisms are modified according to the conditions of the environment.
+
+Kraken outputs can not be given directly to **Krona**, they first need to be converted.
+
+__Krakentools__ ({% cite Lu.2017 %}) is a suite of tools to work on Kraken outputs. It include a tool designed to translate results of the Kraken metagenomic classifier to the full representation of NCBI taxonomy. The output of this tool can be directly visualized by the Krona tool.
+
+> Convert Kraken report file
+>
+> 1. {% tool [Krakentools: Convert kraken report file](toolshed.g2.bx.psu.edu/repos/iuc/krakentools_kreport2krona/krakentools_kreport2krona/1.2+galaxy0) %} with the following parameters:
+> - {% icon param-collection %} *"Kraken report file"*: **Report** collection of **Kraken**
+>
+> 2. Inspect the generated output for JC1A
+{: .hands_on}
+
+
+>
+>
+> ```
+> 3868 k__Bacteria
+> 2867 k__Bacteria p__Proteobacteria
+> 3494 k__Bacteria p__Proteobacteria c__Alphaproteobacteria
+> 1085 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales
+> 461 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae
+> 53 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus
+> 10 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_pantotrophus
+> 6 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_sanguinis
+> 4 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_sp._AK26
+> 1 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_sp._MA
+> ```
+>
+> 1. What are the different columns?
+> 2. What are the lines?
+>
+> >
+> >
+> > 1. Column 1 seems to correspond to the number of fragments covered by a taxon, the columns after represent the different taxonomic level (from kingdom to species)
+> > 2. A line is a taxon with its hierarchy and the number of reads assigned to is
+> {: .solution}
+>
+{: .question}
+
+Let's now run **Krona**
+
+> Generate Krona visualisation
+> 1. {% tool [Krona pie chart](toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1+galaxy0) %} with the following parameters:
+> - *"Type of input data"*: `Tabular`
+> - {% icon param-collection %} *"Input file"*: output of **Krakentools**
+>
+> 2. Inspect the generated file
+{: .hands_on}
+
+
+
+>
+>
+> 1. What are the percentage on unclassified for JC1A and JP4D?
+> 2. What are the kindgoms found for JC1A and JP4D?
+> 3. Where might the eukaryotic DNA come from?
+> 4. How is the diversity of Proteobacteria in JC1A and JP4D?
+>
+> >
+> >
+> > 1. 78% for JC1A and 90% for JP4D
+> > 2. Kindgoms:
+> > - JC1A: 13% Bacteria, 10% Eukaryota, 0.03% Virus
+> > - JP4D: 10% Bacteria, 0.7% Eukaryota
+> > 3. It seems to be human contamination
+> > 4. JC1A seems to have a big diversity of classes and species of Proteobacteria. JP4D seems more dominated by Aphaproteobacteria.
+> {: .solution}
+>
+{: .question}
+
+
+## Visualization using Pavian
+
+__Pavian__ (pathogen visualization and more) ({% cite Breitwieser.2020 %}) is an interactive visualization tool for metagenomic data. It was developed for the clinical metagenomic problem to find a disease-causing pathogen in a patient sample, but it is useful to analyze and visualize any kind of metagenomics data.
+
+> Launch Pavian
+> 1. {% tool [Pavian](interactive_tool_pavian) %} with the following paramters:
+> - {% icon param-collection %} *"Kraken and MetaPhlAn-style reports"*: **Report** collection of **Kraken**
+>
+{: .hands_on}
+
+**Pavian** runs a Galaxy Interactive tool. You can access it when it become orange.
+
+> Interact with Pavian
+>
+> 1. Open **Pavian**
+>
+> {% snippet faqs/galaxy/interactive_tools_open.md %}
+>
+> 2. Import data
+> 1. Click on `Use data on server`
+> 2. Select both samples
+> 3. Click on `Read selected directories`
+> 4. Check you have a table in `Available sample sets` looks like
+>
+> X | FormatOK | Include | Name | ReportFile | ReportFilePath
+> --- | --- | --- | --- | --- | ---
+> 1 | X | X | JP4AD | JP4AD | /home/shiny//JP4AD
+> 2 | X | X | JC1A | JC1A | /home/shiny//JC1A
+>
+> 5. Click on `Save table`
+>
+> 3. Click on `Results Overview` in the left panel
+>
+{: .hands_on}
+
+This page shows the summary of the classifications in the selected sample set:
+
+![Screenshot of a table with 2 rows (JP4A and JC1A) and columns: Name, Number of raw reads, Classified reads, Chordate reads, Artificial reads, Unclassified reads, Microbial reads, Bacterial reads, Viral reads, Fungal reads, Protozoan reads. The cells have a barchart that shows the relation of the cell value to other cell values in the same category](./images/pavian-kraken-results-overview.png)
+
+>
+>
+> 1. Does both sample have same size?
+> 1. What are the percentage of classified reads for JC1A and JP4D?
+> 2. Are the percentage of bacterial reads similar?
+>
+> >
+> >
+> > 1. JP4D has much more reads than JC1A
+> > 2. 10.2% for JP4D and 23.1% for JC1A
+> > 3. 12.6% for JC1A and 9.44% for JP4D. So similar magnitude orders
+> {: .solution}
+>
+{: .question}
+
+Let's now inspect assignements to reads per sample.
+
+> Inspect samples with Pavian
+>
+> 1. Click on `Sample` in the left panel
+> 2. Select `JC1A` in the `Select sample` drop-down on the top
+>
+> The first view gives a Sankey diagram for one sample:
+>
+> ![Sankey plot with taxonomy hierarchy from domain on the left to species on the right](./images/pavian-kraken-sankey-JC1A.png)
+>
+> >
+> >
+> > 1. What is a Sankey diagram?
+> > 2. What are the different set of values represented as the horizontal axis?
+> >
+> > >
+> > >
+> > > 1. A sankey diagram is a visualization used to depict a flow from one set of values to another
+> > > 2. The taxonomy hierarchy from domain on the left to species on the right
+> > {: .solution}
+> >
+> {: .question}
+>
+> 3. Click on `Proteobacteria` in the Sankey plot
+> 4. Inspect the created graph on the right
+>
+> ![Bar plot with the number of reads across all samples for Proteobacteria. JP4A bar is much higher than JC1A one. The turqoise bar shows the number of reads that are identified at the specific taxon; the orange bar shows the number of reads identified at children of the specified taxon.](./images/pavian-kraken-proteobacteria.png)
+>
+> >
+> >
+> > 1. Are the number of reads assigned to Proteobacteria similar for both samples?
+> > 2. Why?
+> >
+> > >
+> > >
+> > > 1. JP4A has many more reads assigned to Proteobacteria than JC1A
+> > > 2. JP4A has many more reads initialls
+> > {: .solution}
+> >
+> {: .question}
+{: .hands_on}
+
+We would like now to compare both samples.
+
+> Inspect samples with Pavian
+>
+> 1. Click on `Comparison` in the left panel
+> 2. Select `%` and unclick `Reads` in the blue area drop-down on the top
+> 3. Click on `Domain` green button
+>
+> ![Screenshot of a table with 4 rows (Bacteria, Eukaryota, Archaea, Viruses) and / columns (Name, Rank, TID, Max, JP4A, JC1A, Lineage)](./images/pavian-kraken-sankey-JC1A.png)
+>
+> >
+> >
+> > Is there similar proportion of Bacteria in both samples?
+> >
+> > >
+> > >
+> > > JP4D has much higher proportion of Bacteria (> 93%>) than JC1A (57%), which contains quite a lot of Eukaryote
+> > {: .solution}
+> >
+> {: .question}
+>
+> 4. Select `Homo sapiens` in the `Filter taxa` box below the green buttons
+>
+> >
+> >
+> > Is there similar proportion of Bacteria in both samples?
+> >
+> > >
+> > >
+> > > After human filtering, both samples have similar proportion of Bacteria
+> > {: .solution}
+> >
+> {: .question}
+>
+> 3. Click on `Class` green button
+>
+> >
+> >
+> > 1. How are the diversities of classes in both samples?
+> > 2. What could it biologically mean given that JC1A is a control and JP4D a sample from fertilized pond?
+> >
+> > >
+> > >
+> > > 1. JP4D seems highly dominated by a Alphaproteobacteria class. JC1A has also a majority of Alphaproteobacteria, but also significant proportions of Betaproteobacteria, Gammaproteobacteria, Flavobacteria, Actinomycetia
+> > > 2. Alphaproteobacteria seems to have a survival advantage in the new environment. According to the authors this correlates with specific genomic traits that enable them to cope better with high nutrient availability.
+> > {: .solution}
+> >
+> {: .question}
+>
+{: .hands_on}
+
+Once you are done with Pavian, you should delete it in your history so the corresponding job is killed.
+
+> Visualize the taxonomical classification with Phinch
+>
+> __Phinch__ ({% cite Bik.2014 %}) is another tools to visualize large biological datasets like our taxonomic classification. Taxonomy Bar Charts, Bubble Charts, Sankey Diagrams, Donut Partitions and Attributes Column Chart can be generated using this tool.
+>
+> As a first step, we need to convert the Kraken output file into a kraken-biom file to make it accessible for Phinch. For this, we need to add a metadata file, provided on [Zenodo]({{ page.zenodo_link }}/files/metadata.tabular).
+> When generating a metadata file for your own data, you can take this as an example and apply the [general guidelines](http://qiime.org/documentation/file_formats.html).
+>
+> > Phinch
+> > 1. Import the metadata tabular from [Zenodo]({{ page.zenodo_link }}) or Galaxy shared data libraries:
+> >
+> > ```text
+> > {{ page.zenodo_link }}/files/metadata.tabular
+> > ```
+> >
+> > 1. Use {% tool [Kraken-biom](toolshed.g2.bx.psu.edu/repos/iuc/kraken_biom/kraken_biom/1.2.0+galaxy1) %} to convert Kraken2 report into the correct format for phinch with the following parameters.
+> > - {% icon param-collection %} *"Input"*: **Report** output of **Kraken2**
+> > - {% icon param-file %} *"Sample Metadata file"*: Metadata tabular
+> > - *"Output format"*: `JSON`
+> >
+> > 2. {% tool [Phinch Visualisation](interactive_tool_phinch) %} with the following paramters:
+> > - *"Input"*: `Kraken-biom output file`
+> {: .hands_on}
+>
+> **Phinch** runs a Galaxy Interactive tool. You can access it when it become orange.
+>
+> > Interact with Phinch
+> >
+> > 1. Open **Phinch**
+> >
+> > {% snippet faqs/galaxy/interactive_tools_open.md %}
+> >
+> > The first pages shows an overview of your samples. Here, you have the possibility to further filter your data, for example by date or location, depending on which information you provided in your metadata file.
+> >
+> > >
+> > >
+> > > 1. How many sequence reads do the samples contain?
+> > >
+> > > >
+> > > >
+> > > > 1. JC1A (Phinch name: 0) contains 56008 reads, while JP4D (Phinch name: 1) contains 242438 reads
+> > > {: .solution}
+> > {: .question}
+> >
+> > 2. Click on `Proceed to gallery` to see an overview of all visualization options.
+> {: .hands_on}
+>
+> Let's have a look at the **taxonomy bar chart**. Here, you can see the abundance of different taxa depicted in different colors in your samples. On top of the chart you can select which rank is supposed to be shown in the chart. You can also change the display options to for example switch between value und percentage.
+>
+> >
+> >
+> > 1. What information can you get from hovering over a sample?
+> > 2. How many percent of the sample reads are bacteria and how many are eukaryota?
+> >
+> > >
+> > >
+> > > 1. the taxon’s name and the taxonomy occurrence in the sample
+> > > 2. choose kingdom and hover over the bars to find "taxonomy occurence in this sample":
+> > > - Sample 0: 75,65 % bacteria; 24,51 % eukaryota
+> > > - Sample 1: 92,70 % bacteria; 6,87 % eukaryota
+> > {: .solution}
+> {: .question}
+>
+> Let's go back to the gallery and choose the **bubble chart**. Here, you can find the distribution of taxa across the whole dataset at the rank that you can choose above the chart. When hovering over the bubbles, you get additional information concerning the taxon.
+>
+> >
+> >
+> > 1. Which is the most abundant Class?
+> > 2. How many reads are found in both samples?
+> >
+> > >
+> > >
+> > > To order the bubbles according to their size you can choose the `list` option shown right next to the taxonomy level. Clicking on one bubble gives you the direct comparison of the distribution of this taxon in the different samples.
+> > > 1. The most abundant Class is Alphaproteobacteria
+> > > 2. With 18.114 reads in Sample 0 and 153.230 reads in sample 1
+> > {: .solution}
+> {: .question}
+>
+>
+> Another displaying option is the **Sankey diagram**, that is depicting the abundance of taxonomies as a flow chart. Again, you can choose the taxonomy level that you want to show in your diagram. When clicking on one bar of the diagram, this part is enlarged for better view.
+>
+> The **donut partition** summarizes the microbial community according to non-numerical attributes. In the drop-down menu at the top right corner, you can switch between the different attributes provided in the metadata file. In our case, you can for example choose the 'environmental medium' to see the difference between sediment and water (It doesn't really make a lot of sense in our very simple dataset, as this will show the same result as sorting them by sample 0 and 1, but if attributes change across different samples this might be an interesting visualization option). When clicking on one part of the donut you will also find the distribution of the taxon across the samples. On the right hand side you can additionally choose if you’d like to have dynamic y axis or prefer standard y axis to compare different donuts with each other.
+>
+> The **attributes column chart** summarizes the microbial community according to numerical attributes. In the drop-down menu at the top right corner, you can switch between the different attributes provided in the metadata file. In our case, you can for example choose the 'geographic location' to (again, it doesn't really make a lot of sense in our very simple dataset, as this will show the same result as sorting them by sample 0 and 1, but if attributes change across different samples this might be an interesting visualization option).
+>
+> Once you are done with Phinch, you should delete it in your history so the corresponding job is killed.
+{: .details}
+
+# Choosing the right tool
+
+When it comes to taxonomic assignment while analyzing metagenomic data, **Kraken2** is not the only tool available. Several papers do benchmarking of different tools ({% cite Meyer.2022 %},{% cite Sczyrba.2017 %},{% cite Ye.2019 %}).
+
+> Benchmarking taxonomic classification tools
+>
+> The benchmarking papers present different methods for comparing the available tools:
+> - The **CAMI challenge** is based on results of different labs that each used the CAMI dataset to perform their analysis on and send it back to the authors.
+> - {% cite Ye.2019 %} performed all the analysis themselves
+>
+> Additionally, the datasets used for both benchmarking approaches differ:
+> - **CAMI**: only ~30%-40% of reads are simulated from known taxa while the rest of the reads are from novel taxa, plasmids or simulated evolved strains.
+> - {% cite Ye.2019 %} used International Metagenomics and Microbiome Standards Alliance (IMMSA) datasets, wherein the taxa are described better.
+>
+> When benchmarking different classification tools, several metrics are used to compare their performance:
+> 1. **Precision**: proportion of true positive species identified in the sample divided by number of total species identified by the method
+> 2. **Recall**: proportion of true positive species divided by the number of distinct species actually in the sample
+> 3. Precision-recall curve: each point represents the precision and recall scores at a specific abundance threshold, the **area under the precision-recall curve (AUPR)**
+> 4. **L2 distance**: representation of abundance profiles → how accurately the abundance of each species or genera in the resulting classification reflects the abundance of each species in the original biological sample (“ground truth”)
+>
+> When it comes to taxonomic profiling, thus investigating the abundance of specific taxa, the biggest problem is the abundance bias. It is introduced during isolation of DNA (which might work for some organisms better then for others) and by PCR duplicates during PCR amplification.
+{: .details}
+
+> Profiling tools
+>
+> Profilers, which are tools that investigate relative abundances of taxa within a dataset, fall into three groups depending on their performance:
+> 1. Profilers, that correctly predict relative abundances
+> 2. Precise profilers (suitable, when many false positives would increase cost and effort in downstream analysis)
+> 3. Profilers with high recall (suitable for pathogen detection, when the failure of detecting an organism can have severe negative consequences)
+>
+> However, some characteristics are common to all profilers:
+> - Most profilers only perform well until the family level
+> - Drastic decrease in performance between family and genus level, while little change between order and family level
+> - poorer performance of all profilers on CAMI datasets compared to International Metagenomics and Microbiome Standards Alliance (IMMSA)
+> - Fidelity of abundance estimates decreases notably when viruses and plasmids were present
+> - high numbers of false positive calls at low abundance
+> - Taxonomic profilers vs profiles from taxonomic binning:
+> Precision and recall of the taxonomic binners were comparable to that of the profilers;
+> abundance estimation at higher ranks was more problematic for the binners
+{: .details}
+
+Tool | Approach | Available in Galaxy | CAMI challenge | {% cite Ye.2019 %}
+--- | --- | --- | --- | ---
+mOTUs | | No | Most memory efficient |
+MetaPhlAn | Marker genes | Yes | | Recommended for low computational requirements (< 2 Gb of memory)
+DUDes | | No | |
+FOCUS | | No | Fast, most memory efficient |
+Bracken | k-mer | Yes | Fast |
+Kraken | k-mer | Yes | Fastest; most memory efficient | Good performance metrics; very fast on large numbers of samples; allow custom databases when high amounts of memory (>100 Gb) are available
+
+> Using a marker gene approach with MetaPhlAn
+>
+> In this tutorial, we follow second approach using **MetaPhlAn** ({% cite truong2015metaphlan2 %}). MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) at species-level. MetaPhlAn 4 relies on ~5.1M unique clade-specific marker genes identified from ~1M microbial genomes (~236,600 references (bacterial, archeal, viral and eukaryotic) and 771,500 metagenomic assembled genomes) spanning 26,970 species-level genome bins.
+>
+> It allows:
+> - unambiguous taxonomic assignments;
+> - accurate estimation of organismal relative abundance;
+> - species-level resolution for bacteria, archaea, eukaryotes and viruses;
+> - strain identification and tracking
+> - orders of magnitude speedups compared to existing methods.
+> - microbiota strain-level population genomics
+> - MetaPhlAn clade-abundance estimation
+>
+> The basic usage of MetaPhlAn consists in the identification of the clades (from phyla to species and strains in particular cases) present in the microbiota obtained from a microbiome sample and their relative abundance.
+>
+> **MetaPhlAn** in Galaxy can not directly take as input a paired collection but expect 2 collections: 1 with forward data and 1 with reverse. Before launching **MetaPhlAn**, we need to split our input paired collection.
+>
+> > Assign taxonomic labels with MetaPhlAn
+> >
+> > 1. {% tool [Unzip collection](__UNZIP_COLLECTION__) %} with the following parameters:
+> > - {% icon param-collection %} *"Paired input to unzip"*: Input paired collection
+> >
+> > 2. {% tool [MetaPhlAn](toolshed.g2.bx.psu.edu/repos/iuc/metaphlan/metaphlan/4.0.6+galaxy1) %} with the following parameters:
+> > - In *"Inputs"*
+> > - *"Input(s)"*: `Fasta/FastQ file(s) with microbiota reads`
+> > - *"Fasta/FastQ file(s) with microbiota reads"*: `Paired-end files`
+> > - {% icon param-collection %} *"Forward paired-end Fasta/FastQ file with microbiota reads"*: output of **Unzip collection** with **forward** in the name
+> > - {% icon param-collection %} *"Reverse paired-end Fasta/FastQ file with microbiota reads"*: output of **Unzip collection** with **reverse** in the name
+> > - In *"Outputs"*:
+> > - *"Output for Krona?"*: `Yes`
+> {: .hands_on}
+>
+>
+> 5 files and a collection are generated by **MetaPhlAn** {% icon tool %}:
+>
+> - `Predicted taxon relative abundances`: tabular files with the **community profile**
+>
+> ```
+> #SampleID Metaphlan_Analysis
+> #clade_name NCBI_tax_id relative_abundance additional_species
+> k__Bacteria 2 100.0
+> k__Bacteria|p__Bacteroidetes 2|976 94.38814
+> k__Bacteria|p__Proteobacteria 2|1224 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935 2|976| 94.38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria 2|1224|28211 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935|o__OFGB45935 2|976|| 94.38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales 2|1224|28211|204455 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935|o__OFGB45935|f__FGB45935 2|976||| 94.38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales|f__Rhodobacteraceae 2|1224|28211|> 204455| 31989 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935|o__OFGB45935|f__FGB45935|g__GGB56609 2|976|||| 94.38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales|f__Rhodobacteraceae|g__Phycocomes 2| 1224| 28211|204455|31989|2873978 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935|o__OFGB45935|f__FGB45935|g__GGB56609|s__GGB56609_SGB78025 2| 976||||| 94. 38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales|f__Rhodobacteraceae|g__Phycocomes| s__Phycocomes_zhengii 2|1224|28211|204455|31989|2873978|2056810 5.61186
+> k__Bacteria|p__Bacteroidetes|c__CFGB45935|o__OFGB45935|f__FGB45935|g__GGB56609|s__GGB56609_SGB78025| t__SGB78025 2| 976|||||| 94.38814
+> k__Bacteria|p__Proteobacteria|c__Alphaproteobacteria|o__Rhodobacterales|f__Rhodobacteraceae|g__Phycocomes| s__Phycocomes_zhengii|t__SGB31485 2|1224|28211|204455|31989|2873978|2056810| 5.61186
+> ```
+>
+> Each line contains 4 columns:
+> 1. the lineage with different taxonomic levels, from high level taxa (kingdom: `k__`) to more precise taxa
+> 2. the NCBI taxon ids of the lineage taxonomic level
+> 3. the relative abundance found for our sample for the lineage
+> 4. any additional species
+>
+> >
+> >
+> > 1. Which kindgoms have been identified for JC1A?
+> > 2. For JP4D?
+> > 3. How is the diversity for JP4D compared to **Kraken** results?
+> >
+> > >
+> > > 1. All reads have been unclassified so no kindgom identified
+> > > 2. Bacteria, no eukaryotes
+> > > 3. Diversity is really reduced for JP4D using **MetaPhlAn**, compared to the one identified with **Kraken**
+> > {: .solution}
+> {: .question}
+>
+> - `Predicted taxon relative abundances for Krona`: same information as the previous files but formatted for > visualization using **Krona**
+>
+> ```
+> Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10
+> 94.38814 Bacteria Bacteroidetes CFGB45935 OFGB45935 FGB45935 GGB56609 GGB56609 SGB78025
+> 5.61186 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Phycocomes Phycocomes zhengii
+> 94.38814 Bacteria Bacteroidetes CFGB45935 OFGB45935 FGB45935 GGB56609 GGB56609 SGB78025 SGB78025
+> 5.61186 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Phycocomes Phycocomes zhengii SGB31485
+> ```
+>
+> Each line represent an identified taxons with 9 columns:
+> - Column 1: The percentage of reads assigned the taxon
+> - Column 2-9: The taxon description at the different taxonomic levels from Kindgom to more precise taxa
+>
+> - A **collection** with the same information as in the tabular file but splitted into different files, one per taxonomic level
+> - `BIOM file` with **community profile** in BIOM format, a common and standard format in microbiomics and used as the input for tools like mothur or Qiime.
+>
+> - `Bowtie2 output` and `SAM file` with the results of the sequence mapping on the reference database
+>
+> Let's now run **Krona** to visualize the communities
+>
+>
+> > Generate Krona visualisation
+> > 1. {% tool [Krona pie chart](toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1) %} with the following parameters:
+> > - *"Type of input data"*: `tabular`
+> > - {% icon param-collection %} *"Input file"*: Krona output of **Metaphlan**
+> >
+> > 2. Inspect the generated file
+> {: .hands_on}
+>
+> As pointed before, the community looks a lot less diverse than with **Kraken**. It is probably due to the reference database complete enough yet to identify all taxons, or not enough reads in the input data to have marker genes in them. Indeed, no taxon has been identified for JC1A, which contains much less reads than JP4D. **Kraken** is also known to have high number of false positive.
+>
+{: .details}
+
+# Conclusion
+
+In this tutorial, we look how to get the community profile from microbiome data. We apply **Kraken2** to assign taxonomic labels to two microbiome sample datasets. We then visualize the results using **Krona**, **Pavian** and **Phinch** to analyze and compare the datasets. Finally, we discuss important facts when it comes to choosing the right tool for taxonomic assignment. Additionally, we use **MetaPhlAn** on the same datsets and compare the results to **Kraken2**.
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/workflows/index.md b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/index.md
new file mode 100644
index 00000000000000..e092e0ae66ddd4
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/index.md
@@ -0,0 +1,3 @@
+---
+layout: workflow-list
+---
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow-test.yml b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow-test.yml
new file mode 100644
index 00000000000000..6acbb58a157ef6
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow-test.yml
@@ -0,0 +1,66 @@
+---
+- doc: Test for the workflow
+ job:
+ raw-reads:
+ class: Collection
+ collection_type: 'list:paired'
+ elements:
+ - class: Collection
+ type: paired
+ identifier: JP4D
+ elements:
+ - identifier: forward
+ class: File
+ location: https://zenodo.org/record/7871630/files/JP4D_R1.fastqsanger.gz
+ filetype: fastqsanger
+ - identifier: reverse
+ class: File
+ location: https://zenodo.org/record/7871630/files/JP4D_R2.fastqsanger.gz
+ filetype: fastqsanger
+ - class: Collection
+ type: paired
+ identifier: JC1A
+ elements:
+ - identifier: forward
+ class: File
+ location: https://zenodo.org/record/7871630/files/JC1A_R1.fastqsanger.gz
+ filetype: fastqsanger
+ - identifier: reverse
+ class: File
+ location: https://zenodo.org/record/7871630/files/JC1A_2.fastqsanger.gz
+ filetype: fastqsanger
+ metadata:
+ class: File
+ location: metadata.tabular
+ filetype: tabular
+ outputs:
+ kraken_report:
+ element_tests:
+ JC1A:
+ asserts:
+ has_text:
+ text: 'Rhodobacterales'
+ JP4D:
+ asserts:
+ has_text:
+ text: 'Aphaproteobacteria'
+ krakentool_report:
+ element_tests:
+ JC1A:
+ asserts:
+ has_text:
+ text: 'o__Rhodobacterales'
+ JP4D:
+ asserts:
+ has_text:
+ text: 'o__Rhodobacterales'
+ metaphlan_output:
+ element_tests:
+ JC1A:
+ asserts:
+ has_text:
+ text: 'UNCLASSIFIED'
+ JP4D:
+ asserts:
+ has_text:
+ text: 'c__Alphaproteobacteria|o__Rhodobacterales'
\ No newline at end of file
diff --git a/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow.ga b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow.ga
new file mode 100644
index 00000000000000..7d7ab375b63fbf
--- /dev/null
+++ b/topics/metagenomics/tutorials/taxonomic-profiling/workflows/main-workflow.ga
@@ -0,0 +1,499 @@
+{
+ "a_galaxy_workflow": "true",
+ "annotation": "Taxonomic Profiling and Visualization of Metagenomic Data",
+ "creator": [
+ {
+ "class": "Person",
+ "identifier": "0000-0001-9852-1987",
+ "name": "B\u00e9r\u00e9nice Batut"
+ }
+ ],
+ "format-version": "0.1",
+ "license": "MIT",
+ "name": "Taxonomic Profiling and Visualization of Metagenomic Data",
+ "steps": {
+ "0": {
+ "annotation": "",
+ "content_id": null,
+ "errors": null,
+ "id": 0,
+ "input_connections": {},
+ "inputs": [
+ {
+ "description": "",
+ "name": "raw-reads"
+ }
+ ],
+ "label": "raw-reads",
+ "name": "raw-reads",
+ "outputs": [],
+ "position": {
+ "left": 0,
+ "top": 32
+ },
+ "tool_id": null,
+ "tool_state": "{\"name\": \"raw-reads\",\"optional\": false, \"tag\": null, \"collection_type\": \"list:paired\"}",
+ "tool_version": null,
+ "type": "data_collection_input",
+ "uuid": "8ceaae85-1ed7-4488-b839-53d36e642378",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "1": {
+ "annotation": "",
+ "content_id": null,
+ "errors": null,
+ "id": 1,
+ "input_connections": {},
+ "inputs": [
+ {
+ "description": "",
+ "name": "metadata"
+ }
+ ],
+ "label": "metadata",
+ "name": "metadata",
+ "outputs": [],
+ "position": {
+ "left": 35.48228574271107,
+ "top": 575.2289046420843
+ },
+ "tool_id": null,
+ "tool_state": "{\"name\": \"metadata\", \"optional\": false, \"tag\": null}",
+ "tool_version": null,
+ "type": "data_input",
+ "uuid": "715bbeb3-cb9a-49dd-a9c5-1f184f3eb34a",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "2": {
+ "annotation": "",
+ "content_id": "__UNZIP_COLLECTION__",
+ "errors": null,
+ "id": 2,
+ "input_connections": {
+ "input": {
+ "id": 0,
+ "output_name": "output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Unzip collection",
+ "outputs": [
+ {
+ "name": "forward",
+ "type": "input"
+ },
+ {
+ "name": "reverse",
+ "type": "input"
+ }
+ ],
+ "position": {
+ "left": 279.99761394978844,
+ "top": 1
+ },
+ "post_job_actions": {},
+ "tool_id": "__UNZIP_COLLECTION__",
+ "tool_state": "{\"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "1.0.0",
+ "type": "tool",
+ "uuid": "7e93e712-a30a-4e01-9b92-b56ac144e29c",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "3": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2/2.1.1+galaxy1",
+ "errors": null,
+ "id": 3,
+ "input_connections": {
+ "single_paired|input_pair": {
+ "id": 0,
+ "output_name": "output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Kraken2",
+ "outputs": [
+ {
+ "name": "report_output",
+ "type": "tabular"
+ },
+ {
+ "name": "output",
+ "type": "tabular"
+ }
+ ],
+ "position": {
+ "left": 279.9980390107984,
+ "top": 464
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2/2.1.1+galaxy1",
+ "tool_shed_repository": {
+ "changeset_revision": "e674066930b2",
+ "name": "kraken2",
+ "owner": "iuc",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"confidence\": \"0.1\", \"kraken2_database\": \"2022-09-04T165121Z_standard_prebuilt_pluspf_2022-06-07\", \"min_base_quality\": \"0\", \"minimum_hit_groups\": \"2\", \"quick\": false, \"report\": {\"create_report\": true, \"use_mpa_style\": false, \"report_zero_counts\": false, \"report_minimizer_data\": false}, \"single_paired\": {\"single_paired_selector\": \"collection\", \"__current_case__\": 0, \"input_pair\": {\"__class__\": \"ConnectedValue\"}}, \"split_reads\": false, \"use_names\": false, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "2.1.1+galaxy1",
+ "type": "tool",
+ "uuid": "ec3d15fe-c551-4986-b56b-6f92fe781a4e",
+ "when": null,
+ "workflow_outputs": [
+ {"output_name": "report_output", "label": "kraken_report"}
+ ]
+ },
+ "4": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/metaphlan/metaphlan/4.0.6+galaxy1",
+ "errors": null,
+ "id": 4,
+ "input_connections": {
+ "inputs|in|raw_in|in_f": {
+ "id": 2,
+ "output_name": "forward"
+ },
+ "inputs|in|raw_in|in_r": {
+ "id": 2,
+ "output_name": "reverse"
+ }
+ },
+ "inputs": [
+ {
+ "description": "runtime parameter for tool MetaPhlAn",
+ "name": "analysis"
+ }
+ ],
+ "label": null,
+ "name": "MetaPhlAn",
+ "outputs": [
+ {
+ "name": "output_file",
+ "type": "tabular"
+ },
+ {
+ "name": "bowtie2out",
+ "type": "tabular"
+ },
+ {
+ "name": "sam_output_file",
+ "type": "sam"
+ },
+ {
+ "name": "biom_output_file",
+ "type": "biom1"
+ },
+ {
+ "name": "krona_output_file",
+ "type": "tabular"
+ }
+ ],
+ "position": {
+ "left": 559.9960780215968,
+ "top": 0
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/metaphlan/metaphlan/4.0.6+galaxy1",
+ "tool_shed_repository": {
+ "changeset_revision": "1a6cdf55390f",
+ "name": "metaphlan",
+ "owner": "iuc",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"analysis\": {\"analysis_type\": {\"t\": \"rel_ab\", \"__current_case__\": 0, \"tax_lev\": {\"tax_lev\": \"a\", \"__current_case__\": 0, \"split_levels\": false}}, \"min_cu_len\": \"2000\", \"min_alignment_len\": null, \"organism_profiling\": [\"add_viruses\"], \"stat\": \"tavg_g\", \"stat_q\": \"0.2\", \"perc_nonzero\": \"0.33\", \"ignore_markers\": {\"__class__\": \"RuntimeValue\"}, \"avoid_disqm\": true}, \"inputs\": {\"in\": {\"selector\": \"raw\", \"__current_case__\": 0, \"raw_in\": {\"selector\": \"paired\", \"__current_case__\": 2, \"in_f\": {\"__class__\": \"ConnectedValue\"}, \"in_r\": {\"__class__\": \"ConnectedValue\"}}, \"read_min_len\": \"70\", \"mapping\": {\"bt2_ps\": \"very-sensitive\", \"min_mapq_val\": \"5\"}}, \"db\": {\"db_selector\": \"cached\", \"__current_case__\": 0, \"cached_db\": \"mpa_vOct22_CHOCOPhlAnSGB_202212-03042023\"}}, \"out\": {\"sample_id_key\": \"SampleID\", \"sample_id\": \"Metaphlan_Analysis\", \"use_group_representative\": false, \"legacy_output\": false, \"CAMI_format_output\": false, \"unknown_estimation\": false, \"krona_output\": true}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "4.0.6+galaxy1",
+ "type": "tool",
+ "uuid": "12d35121-341e-46da-aa4c-c48d4318bf6d",
+ "when": null,
+ "workflow_outputs": [
+ {"output_name": "output_file", "label": "metaphlan_output"}
+ ]
+ },
+ "5": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/krakentools_kreport2krona/krakentools_kreport2krona/1.2+galaxy0",
+ "errors": null,
+ "id": 5,
+ "input_connections": {
+ "report": {
+ "id": 3,
+ "output_name": "report_output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Krakentools: Convert kraken report file",
+ "outputs": [
+ {
+ "name": "output",
+ "type": "tabular"
+ }
+ ],
+ "position": {
+ "left": 559.9960852997306,
+ "top": 459
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/krakentools_kreport2krona/krakentools_kreport2krona/1.2+galaxy0",
+ "tool_shed_repository": {
+ "changeset_revision": "88d274322340",
+ "name": "krakentools_kreport2krona",
+ "owner": "iuc",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"intermediate_ranks\": false, \"report\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "1.2+galaxy0",
+ "type": "tool",
+ "uuid": "5de62f93-4e3a-4ea6-92c6-c3afcc088bdc",
+ "when": null,
+ "workflow_outputs": [
+ {"output_name": "output", "label": "krakentool_report"}
+ ]
+ },
+ "6": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/kraken_biom/kraken_biom/1.2.0+galaxy1",
+ "errors": null,
+ "id": 6,
+ "input_connections": {
+ "kraken_reports": {
+ "id": 3,
+ "output_name": "report_output"
+ },
+ "metadata": {
+ "id": 1,
+ "output_name": "output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Kraken-biom",
+ "outputs": [
+ {
+ "name": "biomOutput",
+ "type": "tabular"
+ }
+ ],
+ "position": {
+ "left": 559.9960852997306,
+ "top": 633
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/kraken_biom/kraken_biom/1.2.0+galaxy1",
+ "tool_shed_repository": {
+ "changeset_revision": "65eb9962d272",
+ "name": "kraken_biom",
+ "owner": "iuc",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"fmt\": \"hdf5\", \"kraken_reports\": {\"__class__\": \"ConnectedValue\"}, \"max\": \"O\", \"metadata\": {\"__class__\": \"ConnectedValue\"}, \"min\": \"S\", \"otu_fp\": false, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "1.2.0+galaxy1",
+ "type": "tool",
+ "uuid": "09f1648c-a10b-4804-8978-312dbac46e14",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "7": {
+ "annotation": "",
+ "content_id": "interactive_tool_pavian",
+ "errors": null,
+ "id": 7,
+ "input_connections": {
+ "infile": {
+ "id": 3,
+ "output_name": "report_output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Pavian",
+ "outputs": [
+ {
+ "name": "outfile",
+ "type": "txt"
+ }
+ ],
+ "position": {
+ "left": 559.9960780215968,
+ "top": 858
+ },
+ "post_job_actions": {},
+ "tool_id": "interactive_tool_pavian",
+ "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "1.0",
+ "type": "tool",
+ "uuid": "7e867e22-013a-41e7-b31f-a1f08cd52ee0",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "8": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/bracken/est_abundance/2.7+galaxy1",
+ "errors": null,
+ "id": 8,
+ "input_connections": {
+ "input": {
+ "id": 3,
+ "output_name": "report_output"
+ }
+ },
+ "inputs": [
+ {
+ "description": "runtime parameter for tool Estimate Abundance at Taxonomic Level",
+ "name": "input"
+ }
+ ],
+ "label": null,
+ "name": "Estimate Abundance at Taxonomic Level",
+ "outputs": [
+ {
+ "name": "report",
+ "type": "tabular"
+ },
+ {
+ "name": "kraken_report",
+ "type": "tabular"
+ }
+ ],
+ "position": {
+ "left": 559.5,
+ "top": 1027.5
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/iuc/bracken/est_abundance/2.7+galaxy1",
+ "tool_shed_repository": {
+ "changeset_revision": "79450f7fd718",
+ "name": "bracken",
+ "owner": "iuc",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"input\": {\"__class__\": \"RuntimeValue\"}, \"kmer_distr\": \"681f11a5-d5a8-4589-8e92-35b5da780cfa\", \"level\": \"S\", \"logfile_output\": false, \"out_report\": true, \"threshold\": \"10\", \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "2.7+galaxy1",
+ "type": "tool",
+ "uuid": "0ab3c44e-ed66-420f-b091-2581364635bf",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "9": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1+galaxy0",
+ "errors": null,
+ "id": 9,
+ "input_connections": {
+ "type_of_data|input": {
+ "id": 4,
+ "output_name": "krona_output_file"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Krona pie chart",
+ "outputs": [
+ {
+ "name": "output",
+ "type": "html"
+ }
+ ],
+ "position": {
+ "left": 839.9937065276528,
+ "top": 174
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1+galaxy0",
+ "tool_shed_repository": {
+ "changeset_revision": "e9005d1f3cfd",
+ "name": "taxonomy_krona_chart",
+ "owner": "crs4",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"combine_inputs\": false, \"root_name\": \"Root\", \"type_of_data\": {\"type_of_data_selector\": \"text\", \"__current_case__\": 1, \"input\": {\"__class__\": \"ConnectedValue\"}}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "2.7.1+galaxy0",
+ "type": "tool",
+ "uuid": "c2629c5e-7bbc-4bb7-a50a-8cf875ccebea",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "10": {
+ "annotation": "",
+ "content_id": "toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1+galaxy0",
+ "errors": null,
+ "id": 10,
+ "input_connections": {
+ "type_of_data|input": {
+ "id": 5,
+ "output_name": "output"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Krona pie chart",
+ "outputs": [
+ {
+ "name": "output",
+ "type": "html"
+ }
+ ],
+ "position": {
+ "left": 839.9937065276528,
+ "top": 491
+ },
+ "post_job_actions": {},
+ "tool_id": "toolshed.g2.bx.psu.edu/repos/crs4/taxonomy_krona_chart/taxonomy_krona_chart/2.7.1+galaxy0",
+ "tool_shed_repository": {
+ "changeset_revision": "e9005d1f3cfd",
+ "name": "taxonomy_krona_chart",
+ "owner": "crs4",
+ "tool_shed": "toolshed.g2.bx.psu.edu"
+ },
+ "tool_state": "{\"combine_inputs\": false, \"root_name\": \"Root\", \"type_of_data\": {\"type_of_data_selector\": \"text\", \"__current_case__\": 1, \"input\": {\"__class__\": \"ConnectedValue\"}}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "2.7.1+galaxy0",
+ "type": "tool",
+ "uuid": "5e144abe-de6e-443d-a88a-9c5ba50d3631",
+ "when": null,
+ "workflow_outputs": []
+ },
+ "11": {
+ "annotation": "",
+ "content_id": "interactive_tool_phinch",
+ "errors": null,
+ "id": 11,
+ "input_connections": {
+ "infile": {
+ "id": 6,
+ "output_name": "biomOutput"
+ }
+ },
+ "inputs": [],
+ "label": null,
+ "name": "Phinch Visualisation",
+ "outputs": [
+ {
+ "name": "outfile",
+ "type": "txt"
+ }
+ ],
+ "position": {
+ "left": 839.9937065276528,
+ "top": 690
+ },
+ "post_job_actions": {},
+ "tool_id": "interactive_tool_phinch",
+ "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": null, \"__rerun_remap_job_id__\": null}",
+ "tool_version": "0.1",
+ "type": "tool",
+ "uuid": "1fd294fd-8c05-4323-b2ed-458f89038e1e",
+ "when": null,
+ "workflow_outputs": []
+ }
+ },
+ "tags": [
+ "Metagenomics"
+ ],
+ "uuid": "db73fdcb-51b9-4830-a45a-acd1697cfa39",
+ "version": 4
+}
\ No newline at end of file
diff --git a/topics/single-cell/images/scrna-casestudy-monocle/switch_kernel.jpg b/topics/single-cell/images/scrna-casestudy-monocle/switch_kernel.jpg
new file mode 100644
index 00000000000000..6f1ca32d20cb0e
Binary files /dev/null and b/topics/single-cell/images/scrna-casestudy-monocle/switch_kernel.jpg differ
diff --git a/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/preamble.md b/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/preamble.md
index a62e5da3441ff4..065dd55abae62d 100644
--- a/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/preamble.md
+++ b/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/preamble.md
@@ -1,50 +1,44 @@
# Introduction
-This tutorial is the next one in the [Single-cell RNA-seq: Case Study]({% link topics/single-cell/index.md %}) series. This tutorial focuses on trajectory analysis using [monocle3](https://cole-trapnell-lab.github.io/monocle3/), similarly to the [Monocle3 in Galaxy tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}). However, in this tutorial we will use the R programming language that hides behind the user-friendly Galaxy tools. Sometimes you might encounter limitations when working with Galaxy tools, or you might want to make a wee modification that has to be done manually. It is therefore useful to be able to switch between R and Galaxy smoothly. If you do not feel confident using R, [this tutorial]({% link topics/data-science/tutorials/r-basics/tutorial.md %}) is a good place to start. However, our tutorial is quite straightforward to follow and at the end you will feel like a programmer! On the other hand, if you are not confident with the biological or statistical theory behind trajectory analysis, check out the [slide deck]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/slides.html %}). With those resources (including the previous case study tutorials) you are well-equipped to go through this tutorial with ease. Let’s get started!
+This tutorial is the next one in the [Single-cell RNA-seq: Case Study]({% link topics/single-cell/index.md %}) series. This tutorial focuses on trajectory analysis using [monocle3](https://cole-trapnell-lab.github.io/monocle3/), similar to the [Monocle3 in Galaxy tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}). However, in this tutorial we will use the R programming language that hides behind the user-friendly Galaxy tools. Sometimes you might encounter limitations when working with Galaxy tools, or you might want to make a wee modification that has to be done manually. It is therefore useful to be able to switch to R. If you do not feel confident using R, [this tutorial]({% link topics/data-science/tutorials/r-basics/tutorial.md %}) is a good place to start. However, our tutorial is quite straightforward to follow and at the end you will feel like a programmer! On the other hand, if you are not confident with the biological or statistical theory behind trajectory analysis, check out the [slide deck]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/slides.html %}). With those resources (including the previous case study tutorials) you are well-equipped to go through this tutorial with ease. Let’s get started!
>
> This tutorial is significantly based on the [Monocle3 documentation](https://cole-trapnell-lab.github.io/monocle3/docs/introduction/).
{: .comment}
-## Get data
-In the [Monocle3 in Galaxy tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}), we showed that Monocle3 works great with annotated data, but what if your data is not annotated yet? Is it still possible to use Monocle? The answer is yes, Monocle can annotate cells according to their type, which you will perform in this tutorial.
+## Optional: Get data into a Galaxy history
+In the [Monocle3 in Galaxy tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}), we showed that Monocle3 works great with annotated data, but what if your data is not annotated yet? Is it still possible to use Monocle? The answer is yes, Monocle can annotate cells according to their type.
-First, we need to retrieve the appropriate data. We will continue to work on the case study data from a mouse model of fetal growth restriction {% cite Bacon2018 %} (see [the study in Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6945/results/tsne) and [the project submission](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6945/)). We will use the filtered AnnData object, before normalisation and annotation, generated in the [filtering tutorial]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) as step `20: Filtered Object`. However for this tutorial, you don't have to download any files on your computer or even import files to Galaxy! We will show you the whole analysis in R, starting from AnnData object. If you wish, you can get this file to your Galaxy history, following the step below, but you can also jump directly into JupyLab.
+First, we need to retrieve the appropriate data. We will continue to work on the case study data from a mouse model of fetal growth restriction {% cite Bacon2018 %} (see [the study in Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6945/results/tsne) and [the project submission](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6945/)). We will use the filtered AnnData object, before normalisation and annotation, generated in the [filtering tutorial]({% link topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.md %}) as step `20: Filtered Object`. However for this tutorial, you don't have to download any files on your computer or even import files to Galaxy! We will show you the whole analysis in R, starting from AnnData object. However, if you'd like to examine the datasets in a history, the instructions are below.
-> Option 1: Data upload - Import history
->
-> 1. Import history from: [input history](https://usegalaxy.eu/u/wendi.bacon.training/h/cs4trajectories--monocle3--rstudio---input)
+> Optional data upload into Galaxy history
>
+> You have three options for importing the input data into a Galaxy history.
+>
+> 1. You can import a history from: [input history](https://usegalaxy.eu/u/wendi.bacon.training/h/cs4trajectories--monocle3--rstudio---input); Import the files from [Zenodo]({{ page.zenodo_link }}); or Import the files from the shared data library (`GTN - Material` -> `{{ page.topic_name }}`
+> -> `{{ page.title }}`):
>
> {% snippet faqs/galaxy/histories_import.md %}
>
-> 2. **Rename** {% icon galaxy-pencil %} the the history to your name of choice.
->
-{: .hands_on}
-
-> Option 2: Data upload - Add to history
->
-> 1. Create a new history for this tutorial
-> 2. Import the files from [Zenodo]({{ page.zenodo_link }}) or from
-> the shared data library (`GTN - Material` -> `{{ page.topic_name }}`
-> -> `{{ page.title }}`):
->
> ```
> {{ page.zenodo_link }}/files/AnnData_filtered.h5ad
> ```
>
+> {% snippet faqs/galaxy/datasets_import_from_data_library.md %}
> {% snippet faqs/galaxy/datasets_import_via_link.md %}
>
-> 3. Check that the datatype is `h5ad`
+> 2. Check that the datatype is `h5ad`
>
> {% snippet faqs/galaxy/datasets_change_datatype.md datatype="h5ad" %}
>
+> 3. **Rename** {% icon galaxy-pencil %} the history to your name of choice.
+>
{: .hands_on}
## Launching JupyterLab
-{% icon warning %} Please note: this is only currently available on the [usegalaxy.eu](https://usegalaxy.eu) and [usegalaxy.org](https://usegalaxy.org) sites.
+{% icon warning %} Please note: this is only currently available on the [usegalaxy.eu](https://usegalaxy.eu) site.
> Data uploads & JupyterLab
> There are a few ways of importing and uploading data into JupyterLab. You might find yourself accidentally doing this differently than the tutorial, and that's ok. There are a few key steps where you will call files from a location - if these don't work from you, check that the file location is correct and change accordingly!
@@ -106,12 +100,18 @@ If you followed the {% icon tip %} tip above, you should already have your Jupyt
>Installing the packages
>
> 1. Navigate back to the `Terminal`
-> 2. In the Terminal tab open, write the following, preferably one line at a time:
+> 2. In the Terminal tab open, write the following, one line at a time:
> ```
->conda install -c conda-forge -c bioconda r-monocle3
->conda install -c conda-forge -c bioconda anndata
->conda install -c conda-forge r-viridislite
->conda install -c conda-forge bioconductor-biomart
+>conda install -y -c conda-forge -c bioconda r-monocle3
+>```
+>```
+>conda install -y -c conda-forge -c bioconda anndata
+>```
+>```
+>conda install -y -c conda-forge r-viridislite
+>```
+>```
+>conda install -y -c conda-forge bioconductor-biomart
>```
> 3. If you are asked at any point `Proceed ([y]/n)?`, type `y` - surely we want to proceed!
>
@@ -139,7 +139,7 @@ Installation will take a while, so in the meantime, when it's running, you can o
>library(monocle3)
>```
> > Installation errors
-> > It may happen that you will encounter some problems with installation of monocle3 when using RStudio Galaxy instance or RStudio Cloud. It might be due to using older versions of R or required packages or lack of required dependencies. If it happens, you would need to carefully read the error messages and follow the suggestions. If you are facing any difficulties with installation process, it is recommended that you consult your problem with additional online resources. It is more likely that RStudio Cloud or Galaxy tool would fail rather than local RStudio. To make your analysis stress-free, you can follow the Jupyter Notebook instead, which should not give you installation issues.
+> > It may happen that you will encounter some problems with installation of monocle3 when using RStudio Galaxy instance or RStudio Cloud. It might be due to using older versions of R or required packages or lack of required dependencies. If it happens, you would need to carefully read the error messages and follow the suggestions. If you are facing any difficulties with installation process, it is recommended that you consult additional online resources. It is more likely that RStudio Cloud or Galaxy tool would fail rather than local RStudio. To make your analysis stress-free, you can follow the Jupyter Notebook instead, which should not give you installation issues.
> {: .warning}
>
{: .tip}
diff --git a/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.md b/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.md
index 6f52ac68e21132..18ff68b6813cc5 100644
--- a/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.md
+++ b/topics/single-cell/tutorials/scrna-case_monocle3-rstudio/tutorial.md
@@ -1,7 +1,7 @@
---
layout: tutorial_hands_on
-title: 'Trajectory Analysis: Monocle3 in R'
+title: 'Inferring Trajectories using Monocle3 (R)'
subtopic: single-cell-CS
priority: 6
zenodo_link: 'https://zenodo.org/record/7455590'
@@ -16,16 +16,14 @@ questions:
objectives:
- Identify which operations are necessary to transform an AnnData object into the files needed for Monocle
- Describe the Monocle3 functions in R
-- Execute tools and functions to switch between Galaxy and R fluently
-- Recognise steps that can only be performed in R, but not with Galaxy tools
-- Repeat the Monocle3 workflow and choose the right parameter values
+- Recognise steps that can be performed in R, but not with current Galaxy tools
+- Repeat the Monocle3 workflow and choose appropriate parameter values
- Compare the outputs from Scanpy, Monocle in Galaxy and Monocle in R
- Describe differential expression analysis methods
-time_estimation: 2H
+time_estimation: 3H
key_points:
-- Being able to switch between Galaxy and R when using Monocle is useful, particularly when you need to modify the CDS object manually.
- Monocle3 in R gives more flexibility when it comes to differential expression analysis and plotting, but Galaxy offers great reproducibility and ease of analysis.
- Comparing the output of several different methods applied on the same dataset might be useful to confirm the results, to ensure that the findings are reliable and even sometimes to find a new piece of information.
@@ -59,8 +57,8 @@ notebook:
snippet: topics/single-cell/tutorials/scrna-case_monocle3-rstudio/preamble.md
---
-## Setting the environment and files upload
-Once the installation is done, we should load the needed packages into our notebook.
+## Setting up the environment and file upload
+Once the installation is done, we should load the needed packages into our notebook. Navigate back to your `notebook`. If you are using our prepopulated notebook, you can follow the tutorial from there. Otherwise, input the following into your fresh notebook.
```r
install.packages("Rcpp") # needed for reduce_dimension to avoid AnnoyAngular error; library(Rcpp) might work as well depending on the version
@@ -70,6 +68,7 @@ install.packages("Rcpp") # needed for reduce_dimension to avoid AnnoyAngular
library(monocle3)
library(biomaRt)
library(magrittr) # needed for %>%
+library(viridisLite)
```
```r
@@ -88,7 +87,7 @@ Then we have to read in this h5ad file:
```r
ann <- anndata::read_h5ad('AnnData.h5ad')
```
-Now we store all the information we need in AnnData object. However, Monocle uses *cell_data_set class* to hold expression data,which requires three input files: `expression_matrix`, `cell_metadata` and `gene_metadata`. Therefore, we have to somehow 'transfer' that information from our AnnData object to Monocle's cell_data_set (cds). AnnData stores a data matrix `X` together with annotations of observations `obs` and variables `var`, so we can extract those parameters and use them for further analysis.
+Now we store all the information we need in AnnData object. However, Monocle uses *cell_data_set class* to hold expression data, which requires three input files: `expression_matrix`, `cell_metadata` and `gene_metadata`. Therefore, we have to somehow 'transfer' that information from our AnnData object to Monocle's cell_data_set (cds). AnnData stores a data matrix `X` together with annotations of observations `obs` and variables `var`, so we can extract those parameters and use them for further analysis.
```r
expression_matrix <- ann$X
@@ -98,7 +97,7 @@ gene_metadata <- ann$var
> Uploading files from your computer is also an option!
>
-> If you already have files containing the expression matrix, genes and cells metadata, you can upload them to JupyLab and generate cds file from them! For example, if you first downloaded the files from Galaxy your files will have `.tabular` extension. In this case, we will use `read.delim()` function to read them in. In this function, the first argument is the file path - in our case, the files are in the same folder as the notebook, so the file path is the same as the file name. You can always check that by right-clicking on the file and choosing `Copy path`. The second argument, `row.names=1` takes the column number of the data file from which to take the row names.
+> If you already have files containing the expression matrix, genes and cells metadata, you can upload them to JupyLab and generate a cds file from them instead. For example, if you first downloaded the files from Galaxy, your files will have the `.tabular` extension. In this case, we will use the `read.delim()` function to read them in. In this function, the first argument is the file path - in our case, the files are in the same folder as the notebook, so the file path is the same as the file name. You can always check that by right-clicking on the file and choosing `Copy path`. The second argument, `row.names=1` takes the column number of the data file from which to take the row names.
> ```r
> # read in the files
> cell_metadata <- read.delim('cells.tabular', row.names=1)
@@ -112,7 +111,7 @@ gene_metadata <- ann$var
> >
> > >
> > >
-> > > This allows us to ensure that the expression value matrix has the same number of columns as the `cell_metadata` has rows and the same number of rows as the `gene_metadata` has rows. Importantly, row names of the `cell_metadata` object should match the column names of the expression matrix and row names of the `gene_metadata` object should match row names of the expression matrix.
+> > > This allows us to ensure that the expression value matrix has the same number of columns as the `cell_metadata` has rows and the same number of rows as the `gene_metadata` has rows. Importantly, the row names of the `cell_metadata` object should match the column names of the expression matrix and the row names of the `gene_metadata` object should match the row names of the expression matrix.
> > >
> > {: .solution}
> {: .question}
@@ -147,18 +146,18 @@ gene_metadata <- ann$var
{: .tip}
-According to [Monocle3 documentation](https://cole-trapnell-lab.github.io/monocle3/docs/starting/), `expression_matrix` should have genes as rows and cells as columns. Let's check if that's the case here.
+According to the [Monocle3 documentation](https://cole-trapnell-lab.github.io/monocle3/docs/starting/), the `expression_matrix` should have genes as rows and cells as columns. Let's check if that's the case here.
```r
-expression_matrix # preview the content of the file by calling its name
+head(expression_matrix,c(5, 5)) # preview the content of the file by calling its the first 5 rows by 5 columns
```
We can see that in our matrix rows are cells and genes are columns, so we have to transpose the matrix simply using function `t()`. But before doing so, we will change its type from dataframe to matrix - this is Monocle's requirement to generate cell_data_set afterwards.
```r
expression_matrix <- as.matrix(expression_matrix) # change the type to matrix
expression_matrix <- t(expression_matrix) # transpose the matrix
```
-Another condition we have to satisfy if that one of the columns of the `gene_metadata` should be named "gene_short_name", which represents the gene symbol for each gene. Some functions won't work without that. Do we have such a column? Let's check.
+Another condition we have to satisfy is that one of the columns of the `gene_metadata` should be named "gene_short_name", which represents the gene symbol for each gene. Some functions won't work without that. Do we have such a column? Let's check.
```r
-gene_metadata # preview the content of the file to check the name of the column containing gene symbols
+head(gene_metadata) # preview the top ten rows of the file to check the name of the column containing gene symbols
```
The second column indeed contains gene symbols, but is called "Symbol" instead of "gene_short_name". That can be easily changed by a simple assignment, as long as we know the number of the column that we want to modify. In our case the gene symbols are stored in column 2. We can access the column names by `colnames()`.
@@ -168,7 +167,7 @@ colnames(gene_metadata) # see the changes
```
## Generating CDS object
-Now let’s store our files in one object – the cell_data_set. This is the main class used by Monocle to hold single cell expression data. The class is derived from the Bioconductor SingleCellExperiment class. It's similar to Python's AnnData storing a data matrix together with annotations of observations and variables. There are three ways of creating CDS object in monocle:
+Now let’s store our files in one object – the `cell_data_set`. This is the main class used by Monocle to hold single cell expression data. The class is derived from the Bioconductor SingleCellExperiment class. Similar to Python's AnnData, the cell_data_set stores a data matrix together with annotations of observations and variables. There are three ways of creating CDS object in monocle:
- Using ```new_cell_data_set() ``` function with three data frames as arguments (not their paths!): expression matrix (can also be a sparseMatrix), cell metadata and gene metadata
- Using ```load_cellranger_data() ``` function providing the path to the folder containing 10X Genomics Cell Ranger output files. This function takes an argument `umi_cutoff` that determines how many reads a cell must have to be included
- Using ```load_mm_data() ``` function providing the paths to matrix file and two metadata files (features and cell information).
@@ -182,7 +181,7 @@ We are now ready to process our data!
> Format conversion
>
-> Since Monocle’s CDS object is analogous to Python's AnnData, why don’t we use some kind of conversion between those two formats? There is indeed a package called `sceasy` that helps easy conversion of different single-cell data formats to each other. However, when we tested this conversion on our dataset and then used Monocle to plot the expression of genes, the plots were not correct – the expression was shown to be ideantical throughout the sample. For comparison, Seurat did well when plotting gene expression of the same converted object! Although conversion functions are very handy, you have to be aware that their output might be interpreted differently by certain packages. Therefore, to make sure that the analysis is reliable, we decided to generate CDS object directly using Monocle’s function.
+> Since Monocle’s CDS object is analogous to Python's AnnData, why don’t we use some kind of conversion between those two formats? There is indeed a package called `sceasy` that helps easy conversion of different single-cell data formats to each other. However, when we tested this conversion on our dataset and then used Monocle to plot the expression of genes, the plots were not correct – the expression was shown to be identical throughout the sample. For comparison, Seurat did well when plotting gene expression of the same converted object! Although conversion functions are very handy, you have to be aware that their output might be interpreted differently by certain packages. Therefore, to make sure that the analysis is reliable, we decided to generate CDS object directly using Monocle’s function.
> ![Comparison between plots of gene expression generated by Monocle and Seurat using the CDS object that was converted from AnnData using SCEasy tool. Gene expression in Monocle plots is identical throughout the sample, while the expression of the same genes plotted by Seurat is noticeable only in specific clusters.](../../images/scrna-casestudy-monocle/monocle_seurat.png "Comparison between plots of gene expression generated by Monocle and Seurat using the CDS object that was converted from AnnData using SCEasy tool.")
>
{: .details}
@@ -193,11 +192,11 @@ We are now ready to process our data!
>
{: .warning}
-If you remember the very first tutorial, we were starting with gene IDs and adding gene symbols based on the Ensembl GTF file.
+If you remember the very first tutorial in the case study, we started with gene IDs and added gene symbols based on the Ensembl GTF file.
But what if we didn’t have the genes symbols in our CDS object and wanted to add them now? Of course - it's possible! We will also base this annotation on Ensembl - the genome database – with the use of the library BioMart. We will use the same archive as in the Alevin tutorial (Genome assembly GRCm38) to get the gene names. Please note that the updated version (GRCm39) is available, but some of the gene IDs are not in that EnsEMBL database. The code below is written in a way that it will work for the updated dataset too, but will produce ‘NA’ where the corresponding gene name couldn’t be found.
```r
cds_extra <- cds # assign our CDS to a new object for the demonstration purpose
-rownames(fData(cds_extra)) # preview of the gene IDs as rownames
+head(rownames(fData(cds_extra))) # preview of the gene IDs as rownames
```
```r
# get relevant gene names
@@ -223,7 +222,7 @@ genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),
```
```r
# see the resulting data
-genes
+head(genes)
```
```r
# replace IDs for gene names
@@ -267,7 +266,9 @@ Do you remember the Monocle workflow introduced in the previous tutorial? Here i
![Monocle workflow: scRNA-seq dataset, pre-process data (normalise, remove batch effects), non-linear dimensionality reduction (t-SNE, UMAP), cluster cells, compare clusters (identify top markers, targeted contrasts), trajectory analysis](../../images/scrna-casestudy-monocle/monocle3_new_workflow.png "Workflow provided by Monocle3 documentation")
## Pre-processing
-Let’s start with normalisation and pre-processing that can be performed using the function `preprocess_cds()`. The argument `num_dim` is the number of principal components that will be computed when using PCA during normalisation. Then you can check that you're using enough PCs to capture most of the variation in gene expression across all the cells in the data set. Note that “PCA” is the default method of pre-processing in Monocle3, so although we can specify this in our function, we don’t have to.
+Let’s start with normalisation and pre-processing that can be performed using the function `preprocess_cds()`. The argument `num_dim` is the number of principal components that will be used. You can check that you're using enough PCs to capture most of the variation in gene expression across all the cells in the data set. Note that “PCA” is the default method of pre-processing in Monocle3, so although we can specify this in our function, we don’t have to.
+
+Note that this step can take awhile - around 5 or so minutes due to the high number of PCs calculated. Feel free to use a `num_dim` value around 100 to accelerate preprocessing and compare the results.
```r
# PCA pre-processing with 210 principal components
@@ -280,19 +281,19 @@ plot_pc_variance_explained(cds_preprocessing)
![Plot of variation in gene expression vs PCA components, decreasing exponentially.](../../images/scrna-casestudy-monocle/pca_plot.jpg " Plot of variation in gene expression vs PCA components.")
-The plot shows that actually using more than ~100 PCs captures only a small amount of additional variation. However, if we look at how the cells are plotted on 2D graph when using different values of PCs, it is easier to imagine how the `num_dim` actually affects the output. Therefore, for this demonstration we will use the value of 210, which, compared to the results from the previous tutorial, makes the most sense for our dataset. However, it will take quite some time to run this, and this is why generally smaller PCs are preferable, so feel free to use `num_dim` value around 100 to accelerate preprocessing and compare the results.
+The plot shows that actually using more than ~100 PCs captures only a small amount of additional variation. However, if we look at how the cells are plotted on 2D graph when using different values of PCs, it is easier to visualise how the `num_dim` actually affects the output. We will use the value of 210, which, compared to the results from the previous tutorial, makes the most sense for our dataset.
![Six plots showing only the shaded shape of how the cells are clustered depending on the num_dim argument. The general trend is maintained though.](../../images/scrna-casestudy-monocle/num_dim.jpg "The "shape" of the plot showing how the cells are clustered depending on the 'num_dim' argument.")
## Batch correction and Dimensionality reduction
-Our dataset actually comprises data from 7 samples, so there is a risk that the batch effects can be observed. Those are systematic differences in the transcriptome of cells measured in different experimental batches. However, we can use Monocle to deal with that!
-First, let’s check how our dataset looks like in terms of batch effects. We can do that by colouring the cells by batch. This information is stored in our CDS object from `cell_metadata` file. Before asking Monocle to plot anything, let’s check the exact column name of the batch information column.
+Our dataset actually comprises data from 7 samples, so there is a risk that batch effects will impact analysis. Batch effects are systematic differences in the transcriptome of cells measured in different experimental batches. However, we can use Monocle to deal with that!
+First, let’s check how our dataset looks in terms of batch effects. We can do that by colouring the cells by batch. This information is stored in our CDS object from the `cell_metadata` file. Before asking Monocle to plot anything, let’s check the exact column name of the batch information column.
```r
colnames(colData(cds_preprocessing)) # check column names
```
In our case it’s indeed ‘batch’, but your data might have another name (eg. "plate", etc.), so make sure you put the correct argument value.
-Then, we have to reduce dimension before attempting to plot. The [previous tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}) introduced the methods of dimensionality reduction in Monocle. Of course you can replicate what we did in Galaxy to compare the output of dimensionality reduction using different methods, simply by changing the `reduction_method` argument. Options currently supported by Monocle are "UMAP", "tSNE", "PCA", "LSI", and "Aligned". However, as for now, let’s just recall that UMAP gave the best results, so we will use UMAP here as well.
+In order to plot our cells on a 2D graph, we need to reduce the numbers of dimensions. The [previous tutorial]({% link topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md %}) introduced the methods of dimensionality reduction in Monocle. You can replicate what we did in Galaxy to compare the output of dimensionality reduction using different methods, simply by changing the `reduction_method` argument. Options currently supported by Monocle are "UMAP", "tSNE", "PCA", "LSI", and "Aligned". However, as for now, let’s just recall that UMAP gave the best results, so we will use UMAP here as well.
```r
# reduce dimension first
cds_preprocessing_UMAP <- reduce_dimension(cds_preprocessing, reduction_method = "UMAP", preprocess_method = "PCA")
@@ -308,7 +309,7 @@ We can see that upper and lower right branches mostly consist of N705 and N706,
cds_batch <- align_cds(cds_preprocessing_UMAP, preprocess_method = "PCA", alignment_group = "batch")
```
-To see the changes, we have to run UMAP again, but this time on the aligned dataset, so we will specify that `preprocess_method ` is now "Aligned" and not “PCA”, however after having used `align_cds()` Monocle would use "Aligned" argument automatically if no `preprocess_method` was specified.
+To see the changes, we have to run UMAP again, but this time on the aligned dataset. We will specify that `preprocess_method ` as "Aligned" and not “PCA". Monocle would use the "Aligned" argument automatically if no `preprocess_method` was specified.
```r
# dimensionality reduction after alignment
cds_red_dim <- reduce_dimension(cds_batch, preprocess_method = "Aligned", reduction_method = "UMAP")
@@ -320,7 +321,20 @@ plot_cells(cds_red_dim, color_cells_by="batch", label_cell_groups=FALSE)
![Left image showing dataset before batch correction: upper and lower right branches mostly consist of N705 and N706. Right image showing the dataset after batch correction: the cells from all the samples are evenly spread throughout the whole dataset.](../../images/scrna-casestudy-monocle/batch_correction.png "Comparison of the dataset before and after batch correction.")
-Do you see this? That’s amazing! Batch correction did a great job here! Now the dataset is nicely aligned, and the cells from all the samples are evenly spread throughout the whole dataset. It is worth mentioning that removing batch effects was done using mutual nearest neighbor alignment, a technique introduced by John Marioni's lab ({% cite Haghverdi_2018 %}) and supported by Aaron Lun's package [batchelor](https://bioconductor.org/packages/release/bioc/html/batchelor.html).
+>
+>
+> Does your plot look the same as the one in the Figure?
+>
+> >
+> >
+> > Your plot might be slightly different to the one shown in the Figure but this is fine, as long as you see analogical patterns. Some libraries that you're using might have been updated, giving non-identical output. However, the principle behind the analysis is still the same, so you can peacefully follow the tutorial. Just keep your eyes open and... THINK!
+> >
+> {: .solution}
+>
+{: .question}
+
+Do you see this? It’s amazing! Batch correction did a great job here! Now the dataset is nicely aligned, and the cells from all the samples are evenly spread throughout the whole dataset. It is worth mentioning that removing batch effects was done using mutual nearest neighbor alignment, a technique introduced by John Marioni's lab ({% cite Haghverdi_2018 %}) and supported by Aaron Lun's package [batchelor](https://bioconductor.org/packages/release/bioc/html/batchelor.html). Also, due to the machine learning elements of the code of this technique - as well as the fact that packages are updated regularly - your plots may not look identical to the ones pictured here. Nevertheless, the interpretation should be the same - the batch corrected plot should show better batch distribution than the uncorrected one.
+
Now we can move to the next step and perform dimensionality reduction.
@@ -373,13 +387,14 @@ plot_cells(cds_clustered, reduction_method = "UMAP", color_cells_by = 'cluster',
## Clustering: partitions
-OK, what about partitions? They were also created during the clustering step and it’s important to check them before learning the trajectory because it is performed only within one partition, so it is essential that all the cells that we want to analyse in pseudotime belong to the same partition.
+OK, what about partitions? They were also created during the clustering step and it’s important to check them before calculating the trajectory because it is performed only within one partition. It is essential that all the cells that we want to analyse in pseudotime belong to the same partition.
```r
# see the partitions
plot_cells(cds_clustered, reduction_method = "UMAP", color_cells_by = 'partition', label_cell_groups=FALSE)
```
-We can see that there are 3 partitions identified in `cds_clustered` object. Ideally, we would like to combine partitions 1 and 2 to draw a trajectory through all those cells (we can ignore cells in partition 3). Sometimes using the default values might result in multiple partitions while you only need one. Then you would have to change the q-value cutoff in `partition_qval`. The default is 0.05 and by increasing this value you can increase the span of partitions, meaning that you would get fewer partitions. When trying different values of q-value, you also have to check if the clusters didn't change. It's all about finding a balance between the value of `resolution` and `partition_qval` so that both clusters and partitions are satisfactory enough for downstream analysis. Let's try that on our dataset.
+While your plot might be slightly different due to package updates, we can see that there are 3 partitions identified in `cds_clustered` object. Ideally, we would like to combine partitions 1 and 2 to draw a trajectory through all those cells (we can ignore cells in outlier partition). Sometimes using the default values might result in multiple partitions while you only need one. Then you would have to change the q-value cutoff in `partition_qval`. The default is 0.05 and by increasing this value you can increase the span of partitions, meaning that you would get fewer partitions. When trying different values of q-value, you also have to check if the clusters didn't change. It's all about finding a balance between the value of `resolution` and `partition_qval` so that both clusters and partitions are satisfactory enough for downstream analysis. Let's try that on our dataset.
+
```r
# changing the partition q-value
cds_clustered <- cluster_cells(cds_red_dim, reduction_method = "UMAP", resolution = 0.0002, partition_qval = 1)
@@ -392,7 +407,20 @@ plot_cells(cds_clustered, reduction_method = "UMAP", color_cells_by = 'partition
# check if clusters didn't change
plot_cells(cds_clustered, reduction_method = "UMAP", color_cells_by = 'cluster', label_cell_groups=FALSE)
```
-Voila - it worked as expected! Now we have cells from partition 1 and 2 in one partition, we still have 7 clusters, so we can learn the trajectory. However, in some cases even this method might not be enough. Then, there is a last resort… assigning cells to a partition manually.
+
+>
+>
+> Have the clusters change after changing the partition q-value?
+>
+> >
+> >
+> > It might be the case that after changing partition q-value, you will notice that additional clusters appeared. In that situation, you might either play around the `resolution` and `partition_qval` values, go forward with the current clustering (adjusting the parameters accordingly), or check the other method of assigning cells to one partition given below.
+> >
+> {: .solution}
+>
+{: .question}
+
+Now we have cells of interest in one partition, we still have reasonable clusters, so now we can learn the trajectory. However, in some cases even this method might not be enough. Then, there is a last resort… assigning cells to a partition manually.
## Additional step: assigning cells to one partition
> Additional step
@@ -443,7 +471,7 @@ plot_cells(cds_partitions_extra, reduction_method = "UMAP", color_cells_by = 'pa
There are two main approaches to assigning cell types to clusters that we’ve just identified – supervised and unsupervised, both based on gene expression in each cluster.
## Supervised approach
-Supervised approach relies on the fact that when having a reference, we know which cell types to expect and we can simply check the expression of marker genes specific to the expected cell types. Let’s then check the markers mentioned in the original paper {% cite Bacon2018 %}.
+The supervised approach relies on prior knowledge of which cell types to expect. We can simply check the expression of marker genes specific to the expected cell types. Let’s then check the markers mentioned in the original paper {% cite Bacon2018 %}.
| Marker | Cell type |
|--------------------|
@@ -468,6 +496,7 @@ plot_cells(cds_clustered, genes=c('Il2ra','Cd8b1','Cd8a','Cd4','Itm2a','Aif1','H
>
> >
> >
+> > Keep in mind that these results refer to our numbered clusters, while yours might be slightly different.
> > - `Il2ra` (DN): mostly expressed in cluster 4
> > - `Cd8b1, Cd8a` (DP middle): expressed in clusters 1, 6, and highly in cluster 2
> > - `Cd4` (DP late): average expression in clusters 1, 6, 2 and high expression in cluster 5
@@ -480,7 +509,7 @@ If you remember, this gene was found to be expressed in the previous Scanpy tuto
> {: .solution}
{: .question}
-Having identified which cluster corresponds to a specific cell type, we can finally run some code to add the annotation to our CDS object. First, we will create a new column called `cell_type` in `colData()` - this is where the information about the cells is stored (eg. batch, genotype, sex, etc) - and initialize it with the values of clusters. Then, we will get the `dplyr` package which will be used for clusters annotation.
+Having identified which cluster corresponds to a specific cell type, we can finally run some code to add the annotation to our CDS object. First, we will create a new column called `cell_type` in `colData()` - this is where the information about the cells is stored (eg. batch, genotype, sex, etc) - and initialize it with the values of clusters. Then, we will get the `dplyr` package which will be used for cluster annotation.
```r
# just to keep the objects tidy and not overwrite them so that you can come back to any point of the analysis
@@ -500,6 +529,7 @@ colData(cds_annotated)$cell_type <- dplyr::recode(colData(cds_annotated)$cell_ty
'5'='DP-L', # double positive – late middle T-cell
'6'='DP-M3', # double positive – middle T-cell (3rd cluster)
'7'='Unknown') # no info for now, so call it ‘Unknown’
+ '8'='Unknown') # no info for now, so call it ‘Unknown’
```
```r
# check the annotation
@@ -509,7 +539,7 @@ plot_cells(cds_annotated, color_cells_by="cell_type", label_cell_groups=FALSE)
![A plot showing the identified clusters, now each coloured by the assigned cell type.](../../images/scrna-casestudy-monocle/annotated.png "Our annotated dataset.")
## Unsupervised approach
-But what if we don’t have any reference that we can use to assign our clusters? In that case, we will turn to the mentioned unsupervised approach - we will check what are the specifically expressed genes for each cluster. Then we can identify the cell types by looking up what cell types the found genes are markers for. That’s a more tedious process, but sometimes can lead to exciting and unexpected results.
+But what if we don’t have any reference that we can use to assign our clusters? In that case, we will turn to the mentioned unsupervised approach - we will check what are the specifically expressed genes for each cluster. Then we can identify the cell types by looking up what cell types contain those genes. That’s a more tedious process, but sometimes can lead to exciting and unexpected results.
We will use Monocle’s function `top_markers()` and store the information about specifically expressed genes for each cluster in the data frame `marker_test`.
```r
# find top marker genes in each cluster
@@ -539,14 +569,14 @@ You can group the cells by any categorical variable in `colData(cds_clustered)`.
>
{: .question}
-We can now use data in `marker_test` to rank the cells based on one of the specificity metrics and take the top gene(s) for each cluster. We will filter the expressing cells that constitute more than 10% of the cell group and we will take 2 genes in each cluster with the highest `pseudo_R2` value (you can of course modify this value and choose more genes to be selected).
+We can now use data in `marker_test` to rank the cells based on one of the specificity metrics and take the top gene(s) for each cluster. We will filter the expressing cells that constitute more than 10% of the cell group and we will take one gene in each cluster with the highest `pseudo_R2` value (you can of course modify this value and choose more genes to be selected).
```r
# filter the ‘marker_test’ data frame
top_specific_markers <- marker_test %>%
- dplyr::filter(fraction_expressing >= 0.10) %>%
- dplyr::group_by(cell_group) %>%
- dplyr::top_n(2, pseudo_R2)
+ dplyr::filter(fraction_expressing >= 0.10) %>% # set the fraction of expressing cells
+ dplyr::group_by(cell_group) %>% # set a group to which the cells belong
+ dplyr::top_n(1, pseudo_R2) # set the number of top genes and the variable from 'marker_test' to rank by
```
```r
# store the names of the marker genes
@@ -563,12 +593,25 @@ plot_genes_by_group(cds_clustered, # our CDS object
ordering_type="maximal_on_diag") # how to order the genes / groups on the dot plot
```
-![A dot plot showing the expression of genes and fraction of cells that express found markers in each group. On the diagonal there are two genes corresponding to each cluster with the highest pseudo_R2 score in their group.](../../images/scrna-casestudy-monocle/gene_group.png "A plot of the expression and fraction of cells that express found markers in each group.")
+> Unexpectedly too many genes on y-axis?
+>
+> If you notice that on your dot plot any cluster has more genes than you specified (here we set one gene per cluster `top_n(1, pseudo_R2)`), go back to the code and create a new cell, right after assigning `top_specific_markers` object and just before `top_marker_names`. In the new cell, paste and execute the following:
+>
+> ```r
+> top_specific_markers <- top_specific_markers %>%
+> dplyr::distinct(pseudo_R2, .keep_all=TRUE) # select only one row if there are multiple rows with the same value in 'pseudo_R2' column
+> ```
+> Then you can execute `top_marker_names` again and see how the plot looks like now. Better, right?
+> This problem may arise when several genes in one cluster have the same values of specific variable from 'marker_test' (in our case we chose `pseudo_R2`). It might likely happen in small and quite unsignificant clusters.
+{: .tip}
+
+
+![A dot plot showing the expression of genes and fraction of cells that express found markers in each group. On the diagonal there are two genes corresponding to each cluster with the highest pseudo_R2 score in their group.](../../images/scrna-casestudy-monocle/gene_group.png "A plot of the expression and fraction of cells that express found markers in each group. Here, the two top genes from each cluster were plotted on diagonal beased on the highest pseudo_R2 score. Shown is the result of changing the value from 1 to 2 in "top_n(1, pseudo_R2)" in the code above.")
Look at this – we have identified some more marker genes specific to each cluster! However, sometimes it happens that the found genes are not as specific as one would expect, and they appear across the whole sample. Therefore, it is a good idea to plot all those marker genes and check how they appear in the bigger picture.
```r
-# plot all the identified genes to check their expression
+# plot the identified genes to check their expression
plot_cells(cds_clustered, genes=c('Pcdhgc4','Pcdhga5','Gm5559','Gm10359','Ccr9','Cd8b1','Plac8', 'Il2ra', 'Cd52', 'Tmsb10', 'Mki67', 'Hmgb2', 'Pclaf', 'Npm1'), reduction_method = "UMAP")
```
@@ -644,7 +687,8 @@ plot_cells(cds_trajectory,
We have to tell Monocle where to start ordering the cells, ie. when we expect the analysed biological process to begin. Thanks to our biological knowledge, we know that the beginning of the trajectory should be at DN cluster.
There are a couple of ways to specify the root cells:
-1. Use `root_pr_nodes` argument in `order_cells()` function.
+## Option 1: Root nodes
+Here, you will use `root_pr_nodes` argument in `order_cells()` function.
To find the names of the principal points, you have to plot the learned trajectory again, specifying `label_principal_points = TRUE`
```r
@@ -690,7 +734,8 @@ DN_node_id # check the node found
cds_order_1_helper <- order_cells(cds_trajectory, root_pr_nodes = DN_node_id)
```
-2. Use `root_cells` argument in `order_cells()` function.
+## Option 2: Root cells
+Here, you will use `root_cells` argument in `order_cells()` function.
Specify a vector of starting cell IDs. You can provide only one cell as well as all cells of a given type.
@@ -720,6 +765,8 @@ cds_order_2 <- order_cells(cds_trajectory, root_cells = DN_cells)
>
{: .tip}
+# Plotting in pseudotime
+
You can use any `cds_order` object for the downstream analysis. Let’s pick one and assign it to an object with a shorter and more general name.
```r
cds_order <- cds_order_1_helper
@@ -852,6 +899,19 @@ pheatmap::pheatmap(agg_mat, cluster_rows=TRUE, cluster_cols=TRUE,
![A heatmap showing modules of co-regulated genes across the clusters. Modules listed vertically while clusters horizontally. Some modules are highly specific to certain partitions of cells, while others are shared across multiple partitions.](../../images/scrna-casestudy-monocle/heatmap.png "Heatmap showing modules of co-regulated genes across the clusters.")
You can also visualise the modules using `plot_cells()` function. We've chosen some modules to see how the differences on a heatmap correlate with the expression shown on our starting plot.
+
+>
+>
+> Which modules to plot?
+>
+> >
+> >
+> > This is totally up to you! It might be the case that you got different numbering of modules, so then using the numbers specified in the code below won't make much sense. Just look at your heatmap, compare the differences between modules and think which ones would be the most interesing to visualise.
+> >
+> {: .solution}
+>
+{: .question}
+
```r
# see the chosen modules across the whole sample
plot_cells(cds_order,
@@ -860,7 +920,7 @@ plot_cells(cds_order,
color_cells_by="cluster",
show_trajectory_graph=FALSE)
```
-![Plots showing expression of the genes belonging to specified modules across whole sample. Module 19 highly expressed in cluster 2 but also low expression thtoughout the sample, module 38 low expression but throughout the sample, module 41 highly expressed in cluster 5 only, module 42 almost no expression, except cluster 6.](../../images/scrna-casestudy-monocle/modules_plot.png "Plots of expression of the genes belonging to specified modules across whole sample.")
+![Plots showing expression of the genes belonging to specified modules across whole sample.](../../images/scrna-casestudy-monocle/modules_plot.png "Plots of expression of the genes belonging to specified modules across whole sample.")
With the visualisation methods above, you can now come back to the generated data frame `gene_module_df`, filter genes that belong to the module of interest and check their functions to get some more evidence for the correct biological interpretation.
@@ -878,6 +938,73 @@ cds_3d <- reduce_dimension(cds_order, preprocess_method = 'Aligned', max_compone
plot_cells_3d(cds_3d, color_cells_by="cell_type")
```
+# Export your data, figures, and notebook
+
+Don’t forget to save and export your data! First, we will get Jupyter to see those as files.
+
+## Export plots
+If you want to export your plot, you have to make sure that you assigned it to an object. For example, if you want to save the plot of cells in pseudotime, simply assign the function you used to generate this plot to an object. Here we call this object `plot_pseudotime`, like so:
+
+```r
+plot_pseudotime <- plot_cells(cds_order,
+ color_cells_by = "pseudotime",
+ label_cell_groups=FALSE,
+ label_leaves=FALSE,
+ label_branch_points=FALSE)
+```
+
+Then, if you want to save the plot as PDF:
+
+```r
+pdf("plot_pseudotime.pdf") # open the graphical device and specify the directory and the name of the output pdf file
+plot_pseudotime # specify the object that your plot is assigned to
+dev.off() # close the graphical device
+```
+
+The procedure is very similar if you want to export the file as PNG (or analogically JPEG – just replace png with jpeg):
+```r
+png("plot_pseudotime.png", # open the graphical device and specify the directory and the name of the output png file
+width=600, height=400) # optionally you can specify the width and height of the final plot
+plot_pseudotime # specify the object that your plot is assigned to
+dev.off() # close the graphical device
+```
+
+However, it often happens that the quality of the exported PNG and JPEG files is not perfect. For best results, we recommend exporting to SVG:
+```r
+svg("plot_pseudotime.svg") # open the graphical device and specify the directory and the name of the output svg file
+plot_pseudotime # specify the object that your plot is assigned to
+dev.off() # close the graphical device
+```
+
+You can do the same with any plot that you want to save! You will find the saved figures in the left panel of your JupyterLab. You can right-click on them and download directly onto your computer. You can also push them into your Galaxy history. To do so, you have to change Kernel to Python3 (either click on `Kernel` -> `Change Kernel...` in the upper left corner of your JupyterLab or click on the displayed current kernel in the upper right corner and change it).
+![Figure showing the JupyterLab interface with an arrow pointing to the left corner, showing the option `Kernel` -> `Change Kernel...` and another arrow pointing to the right corner, showing the icon of the current kernel. The pop-up window asks which kernel should be chosen instead.](../../images/scrna-casestudy-monocle/switch_kernel.jpg "Two ways of switching kernel.")
+
+Check in the upper right corner that selected kernel is Python3, and run the following:
+```python
+put("marker_file.txt")
+put("plot_pseudotime.pdf")
+put("plot_pseudotime.png")
+```
+
+In this way you can push all the files you've saved into your Galaxy history. You can also do the same with this notebook. The cell below will only work if you haven’t changed the name of the notebook. If you renamed it, simply type its new name in the parenthesis.
+```python
+put("single-cell-scrna-case_monocle3-rstudio.ipynb")
+```
+
+Now you can go check your Galaxy history to make sure your files have all made it back into your Galaxy history.
+
+
+# After Jupyter
+
+Congratulations! You've made it through Jupyter!
+
+> Closing JupyterLab
+> 1. Click **User**: **Active Interactive Tools**
+> 2. Tick the box of your Jupyter Interactive Tool, and click **Stop**
+{: .hands_on}
+
+If you want to run this notebook again, or share it with others, it now exists in your history. You can use this 'finished' version just the same way as you downloaded the directions file and uploaded it into the Jupyter environment.
+
# Conclusion
diff --git a/topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md b/topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md
index 82d53c7e978978..5585fbdd9dada2 100644
--- a/topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md
+++ b/topics/single-cell/tutorials/scrna-case_monocle3-trajectories/tutorial.md
@@ -1,7 +1,7 @@
---
layout: tutorial_hands_on
-title: 'Trajectory Analysis using Monocle3'
+title: 'Inferring trajectories using Monocle3'
subtopic: single-cell-CS
priority: 5
zenodo_link: 'https://zenodo.org/record/7078524'