You will learn how to run the workflow with your data.
-
Create and activate covSampler conda environment.
-
Prepare your sequence data and metadata.
Before we start running the workflow, let's get familiar with the workflow configuration profile.
You can configure the workflow by specifying values in the workflow configuration profile.
All profiles are in my_profiles/
directory. Each project has its corresponding profile.
For the example project, its corresponding profile is in my_profiles/example_profile/
. You can use it to better understand the workflow configuration profile.
The contents of my_profiles/example_project/config.yaml
:
configfile: my_profiles/example_project/parameters.yaml
cores: 2
printshellcmds: True
- type: str
- description: Configuration file of this project workflow
- format: my_profiles/<project_name>/parameter.yaml
- type: int
- description: Maximum number of cores you want to use in the workflow
- type: bool
- description: Print the commands
The contents of my_profiles/example_project/parameters.yaml
:
data_directory:
data/example_project
sequence_min_length:
27000
covizu_tree:
False
genbank_accession:
False
start_subsampling:
False
subsampling:
output_path:
results/example_project/subsamples.txt
description:
Example subsampling
location:
Global
date_start:
2020-01-01
date_end:
2022-01-01
variants:
- Lineage/WHO/Alpha
- Lineage/Nextstrain_clade/20I (Alpha, V1)
- Lineage/Pango_lineage/B.1.1.7
- Site/Nucleotide/A23403G
- Site/Amino_acid/S:N501Y
- Site/Amino_acid/S:H69-
- Site/Amino_acid/ORF7a:Q62*
size:
500
characteristic:
Representative
seed:
2019
temporally_even:
False
- type: str
- description: Project data directory
- format: data/<project_name>
- type: int
- description: Sequences with length < min length will be removed
- type: bool
- description: Remove sequences whose corresponding pango lineages are not in the provided CoVizu time-scaled tree. This parameter is mainly designed for the workflow for covSampler web application. In general, keep it =
False
.
- type: bool
- description: Return genbank accession. This parameter is designed for the workflow for covSampler web application. Please keep it =
False
.
-
type: bool
-
description: Start subsampling
This pipeline consists of two parts: 1) data processing and 2) subsampling. 1) Before subsampling, you should process the data for subsampling. In this stage, please set "start_subsampling" parameter = "False". 2) After data processing, you can perform subsampling. In this stage, please set "start_subsampling" parameter = "True".
- type: str
- description: Output path of subsamples
- format: results/<project_name>/<subsamples_file_name>.txt
- type: False or str
- description: Description recorded in the output file
- examples:
False
Global subsampling 2022-01-01
My subsampling project
- type: str
- description: Location of subsamples
- format: <Continent>/(<Country>)/(<Division>)
- examples:
Global
Europe
Europe/United Kingdom
Europe/United Kingdom/England
- note: For each project, after performing the data processing part, all available values will be recorded in the
data/<project_name>/args/locations.txt
- type: str
- description: Date range of subsamples
- format: YYYY-MM-DD
- examples:
2020-01-01
2022-01-01
- note: For each project, after performing the data processing part, all available values will be recorded in the
data/<project_name>/args/dates.txt
- type: False or str
- description: Variant of subsamples. There are three submodules in the
variants
module:Nonspecific
,Lineage
, andSite
.Nonspecific
means no restrictions on lineage or mutation of subsamples. InLineage
submodule,WHO clade
,Pango lineage
andNextstrain clade
are available. InSite
submodule,Nucleotide
andAmino Acid
are available. - format:
- Nonspecific: False
- Lineage:
- WHO: - Lineage/WHO/<WHO clade>
- Pango lineage: - Lineage/Pango_lineage/<Pango lineage>
- Nextstrain clade: - Lineage/Nextstrain_clade/<Nextstrain clade>
- Site:
- Nucleotide: - Site/Nucleotide/<ref><site><mut>
- Amino acid: - Site/Amino_acid/<gene>:<ref><site><mut> ("-" = deletion, "*" = stop)
- examples:
False
- Lineage/WHO/Alpha
- Lineage/Pango_lineage/B.1.1.7
- Lineage/Nextstrain_clade/20I (Alpha, V1)
- Site/Nucleotide/A23403G
- Site/Amino_acid/S:N501Y
- Site/Amino_acid/S:H69-
- Site/Amino_acid/ORF7a:Q62*
- note:
-
For each project, after performing the data processing part, all available values will be recorded in the
data/<project_name>/args/who_variants.txt
,data/<project_name>/args/pango_lineages.txt
,data/<project_name>/args/nextstrain_clades.txt
,data/<project_name>/args/nucleotide.txt
anddata/<project_name>/args/amino_acid.txt
-
You can use just one query or put multiple queries together:
# --- one query example --- variants: - Lineage/WHO/Alpha # --- multiple queries example --- variants: - Lineage/WHO/Alpha - Lineage/Nextstrain_clade/20I (Alpha, V1) - Lineage/Pango_lineage/B.1.1.7 - Site/Nucleotide/A23403G - Site/Amino_acid/S:N501Y - Site/Amino_acid/S:H69- - Site/Amino_acid/ORF7a:Q62*
-
The SARS-CoV-2 genome map may be helpful when specifying the amino acid mutation.
-
- type: int
- description: Number of subsamples
-
type: str
-
description: Two subsampling characteristic,
Comprehensive
andRepresentative
, are available. You can choose one of them according to the application scenario.- Comprehensive: 1) Geographic distribution: Evenly distributed across continents. 2) Temporal distribution: Similar with the original data set. 3) Genetic variation: Higher genetic diversity (compared to representative subsampling). 4) Application scenario: a. When the geographic bias (at the continent level) is unacceptable. b. When exploring the phylogeny of specific viruses in a rich and diverse genetic background. c. When exploring the phylogeny of specific viruses with low similarity to widely prevalent variants in the original data set. - Representative: 1) Geographic distribution: Consistent with the original data set. 2) Temporal distribution: Similar with the original data set. 3) Genetic variation: Similar with the original data set. 4) Application scenario: When the subsamples are expected to approximately reflect the spatiotemporal distribution and key phylogenetic relationships of genomes in the original data set. In addition, you can also perform subsampling multiple times with different parameters (characteristic and/or range) to get the final subsamples.
-
format: Representative or Comprehensive
- type: int
- description: Seed number for pseudorandom subsampling
- note: covSampler (v2.0.0) applies a pseudorandom strategy for subsampling from each virus cluster. If covSampler is reinitialized with the same seed, it will produce the same results.
- type: bool
- description: The number of subsamples is the same for each month
- note:
- The subsamples within each month still fit the selected Comprehensive or Representative distribution characteristic.
- The temporally even option will take a few more minutes.
Before subsampling, you should process the data for subsampling.
-
Create your project (here named
tutorial_project
) profile folder inmy_profiles/
directory. -
Copy
config.yaml
andparameters.yaml
files frommy_profiles/example_project
tomy_profiles/tutorial_project
.Now, the
my_profiles/
directory structure should look like this:my_profiles ├── README.md ├── example_project │ ├── config.yaml │ └── parameters.yaml └── tutorial_project ├── config.yaml └── parameters.yaml
-
Change parameters in
my_profiles/tutorial_project/config.yaml
andmy_profiles/tutorial_project/parameters.yaml
.my_profiles/tutorial_project/config.yaml: configfile: change path to your project parameter.yaml path (my_profiles/tutorial_project/parameter.yaml) my_profiles/tutorial_project/parameter.yaml: data_directory: change path to your project data path (data/tutorial_project) start_subsampling: False subsampling: you can change these parameters after data processing Other parameters not mentioned can be adjusted as required.
Change directory to the covSampler
directory if you are not there.
cd covSampler
Run the command for data processing.
snakemake --profile my_profiles/tutorial_project
The workflow can take a while to run. When the sequence numbers are in the millions, the workflow may run for several days.
After data processing, you can change the parameters for subsampling.
my_profiles/tutorial_project/parameter.yaml:
start_subsampling: True
subsampling: you can change these parameters as required
Change directory to the covSampler
directory if you are not there.
cd covSampler
Run the command for subsampling. (This is the second time running the command, the first run is for data processing.)
snakemake --profile my_profiles/tutorial_project
After calculation, you can see the result subsamples in your specified output file path.
If you want to subsample the same data set multiple times, you don't need to re-process the data.
You can just change the subsampling parameters in my_profiles/tutorial_project/parameter.yaml
. Don't forget to change the output file path.
Then, change directory to the covSampler
directory and run the command again to get your new subsamples.
snakemake --profile my_profiles/tutorial_project