diff --git a/docs/changelog.md b/docs/changelog.md index d4747ec..66e77b0 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -1,5 +1,9 @@ # Changelog +## [0.12.4] -- 2023-08-01 +- Fixed SRA convert +- Added how to convert SRA + ## [0.12.3] -- 2023-06-21 - Fixed preserving order of project keys (#119) diff --git a/docs/sra_convert.md b/docs/sra_convert.md index 2ae7a14..14f0725 100644 --- a/docs/sra_convert.md +++ b/docs/sra_convert.md @@ -15,4 +15,4 @@ This effectively makes it easier to interact with *project-level* management of ## Tutorial -See the [tutorial](raw-data-downloading.md) for an example of how to use `sraconvert`. \ No newline at end of file +See the [how-to](how_to_convert_fastq_from_sra.md) for an example of how to use `sraconvert`. \ No newline at end of file diff --git a/docs_jupyter/build/processed-data-downloading.md b/docs_jupyter/build/processed-data-downloading.md index cd080ee..b851a61 100644 --- a/docs_jupyter/build/processed-data-downloading.md +++ b/docs_jupyter/build/processed-data-downloading.md @@ -24,6 +24,11 @@ Calling geofetch will do 4 tasks: Complete details about geofetch outputs is cataloged in the [metadata outputs reference](metadata_output.md). +from IPython.core.display import SVG +SVG(filename='logo.svg') + +![arguments_outputs.svg](attachment:arguments_outputs.svg) + ## Download the data First, create the metadata for processed data (by adding --processed and --just-metadata): diff --git a/docs_jupyter/build/raw-data-downloading.md b/docs_jupyter/build/raw-data-downloading.md index f4b6485..54539e3 100644 --- a/docs_jupyter/build/raw-data-downloading.md +++ b/docs_jupyter/build/raw-data-downloading.md @@ -382,313 +382,6 @@ Writing: /home/bnt4me/Virginia/repos/geof2/geofetch/docs_jupyter/red_algae/GSE67 ``` -## Convert to fastq format - -Now the `.sra` files have been downloaded. The project that was automatically created by GEO contained an amendment for sra file conversion. This project expects you to have an environment variable called `SRARAW` that points to the location where `prefetch` stores your `.sra` files. We also should define a `$SRAFQ` variable to point to where we ant the fastq files stored. In this command below, we set these on the fly for this command, but you can also just use globals. - -We'll use `-d` first to do a dry run: - - -```bash -SRARAW=${HOME}/ncbi/public/sra/ SRAFQ=red_algae/fastq \ - looper run red_algae/red_algae_config.yaml -a sra_convert -p local -d -``` - -```.output -Looper version: 1.2.0-dev -Command: run -Using amendments: sra_convert -Activating compute package 'local' -## [1 of 4] sample: Cm_BlueLight_Rep1; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub -Dry run, not submitted -## [2 of 4] sample: Cm_BlueLight_Rep2; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub -Dry run, not submitted -## [3 of 4] sample: Cm_Darkness_Rep1; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub -Dry run, not submitted -## [4 of 4] sample: Cm_Darkness_Rep2; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub -Dry run, not submitted - -Looper finished -Samples valid for job generation: 4 of 4 -Commands submitted: 4 of 4 -Jobs submitted: 4 -Dry run. No jobs were actually submitted. - -``` - -And now the real thing: - - -```bash -SRARAW=${HOME}/ncbi/public/sra/ SRAFQ=red_algae/fastq \ - looper run red_algae/red_algae_config.yaml -a sra_convert -p local \ - --command-extra=--keep-sra -``` - -```.output -Looper version: 1.2.0-dev -Command: run -Using amendments: sra_convert -Activating compute package 'local' -## [1 of 4] sample: Cm_BlueLight_Rep1; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub -Compute node: zither -Start time: 2020-05-21 17:40:56 -Using outfolder: red_algae/results_pipeline/SRX969073 -### Pipeline run code and environment: - -* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930183.sra --sample-name SRX969073 -O red_algae/results_pipeline --keep-sra` -* Compute host: zither -* Working dir: /home/nsheff/code/geofetch/docs_jupyter -* Outfolder: red_algae/results_pipeline/SRX969073/ -* Pipeline started at: (05-21 17:40:57) elapsed: 0.0 _TIME_ - -### Version log: - -* Python version: 3.7.5 -* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper` -* Pypiper version: 0.12.1 -* Pipeline dir: `/home/nsheff/.local/bin` -* Pipeline version: None - -### Arguments passed to pipeline: - -* `bamfolder`: `` -* `config_file`: `sraconvert.yaml` -* `format`: `fastq` -* `fqfolder`: `red_algae/fastq` -* `keep_sra`: `True` -* `logdev`: `False` -* `mode`: `convert` -* `output_parent`: `red_algae/results_pipeline` -* `recover`: `False` -* `sample_name`: `['SRX969073']` -* `silent`: `False` -* `srafolder`: `/home/nsheff/ncbi/public/sra/` -* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930183.sra']` -* `verbosity`: `None` - ----------------------------------------- - -Processing 1 of 1 files: SRR1930183 -Target to produce: `red_algae/fastq/SRR1930183_1.fastq.gz` - -> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930183.sra --split-files --gzip -O red_algae/fastq` (9436) -
-Read 1068319 spots for /home/nsheff/ncbi/public/sra//SRR1930183.sra -Written 1068319 spots for /home/nsheff/ncbi/public/sra//SRR1930183.sra --Command completed. Elapsed time: 0:00:38. Running peak memory: 0.067GB. - PID: 9436; Command: fastq-dump; Return code: 0; Memory used: 0.067GB - -Already completed files: [] - -### Pipeline completed. Epilogue -* Elapsed time (this run): 0:00:38 -* Total elapsed time (all runs): 0:00:38 -* Peak memory (this run): 0.0666 GB -* Pipeline completed time: 2020-05-21 17:41:35 -## [2 of 4] sample: Cm_BlueLight_Rep2; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub -Compute node: zither -Start time: 2020-05-21 17:41:36 -Using outfolder: red_algae/results_pipeline/SRX969074 -### Pipeline run code and environment: - -* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930184.sra --sample-name SRX969074 -O red_algae/results_pipeline --keep-sra` -* Compute host: zither -* Working dir: /home/nsheff/code/geofetch/docs_jupyter -* Outfolder: red_algae/results_pipeline/SRX969074/ -* Pipeline started at: (05-21 17:41:36) elapsed: 0.0 _TIME_ - -### Version log: - -* Python version: 3.7.5 -* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper` -* Pypiper version: 0.12.1 -* Pipeline dir: `/home/nsheff/.local/bin` -* Pipeline version: None - -### Arguments passed to pipeline: - -* `bamfolder`: `` -* `config_file`: `sraconvert.yaml` -* `format`: `fastq` -* `fqfolder`: `red_algae/fastq` -* `keep_sra`: `True` -* `logdev`: `False` -* `mode`: `convert` -* `output_parent`: `red_algae/results_pipeline` -* `recover`: `False` -* `sample_name`: `['SRX969074']` -* `silent`: `False` -* `srafolder`: `/home/nsheff/ncbi/public/sra/` -* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930184.sra']` -* `verbosity`: `None` - ----------------------------------------- - -Processing 1 of 1 files: SRR1930184 -Target exists: `red_algae/fastq/SRR1930184_1.fastq.gz` -Already completed files: [] - -### Pipeline completed. Epilogue -* Elapsed time (this run): 0:00:00 -* Total elapsed time (all runs): 0:00:00 -* Peak memory (this run): 0 GB -* Pipeline completed time: 2020-05-21 17:41:36 -## [3 of 4] sample: Cm_Darkness_Rep1; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub -Compute node: zither -Start time: 2020-05-21 17:41:36 -Using outfolder: red_algae/results_pipeline/SRX969075 -### Pipeline run code and environment: - -* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930185.sra --sample-name SRX969075 -O red_algae/results_pipeline --keep-sra` -* Compute host: zither -* Working dir: /home/nsheff/code/geofetch/docs_jupyter -* Outfolder: red_algae/results_pipeline/SRX969075/ -* Pipeline started at: (05-21 17:41:36) elapsed: 0.0 _TIME_ - -### Version log: - -* Python version: 3.7.5 -* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper` -* Pypiper version: 0.12.1 -* Pipeline dir: `/home/nsheff/.local/bin` -* Pipeline version: None - -### Arguments passed to pipeline: - -* `bamfolder`: `` -* `config_file`: `sraconvert.yaml` -* `format`: `fastq` -* `fqfolder`: `red_algae/fastq` -* `keep_sra`: `True` -* `logdev`: `False` -* `mode`: `convert` -* `output_parent`: `red_algae/results_pipeline` -* `recover`: `False` -* `sample_name`: `['SRX969075']` -* `silent`: `False` -* `srafolder`: `/home/nsheff/ncbi/public/sra/` -* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930185.sra']` -* `verbosity`: `None` - ----------------------------------------- - -Processing 1 of 1 files: SRR1930185 -Target to produce: `red_algae/fastq/SRR1930185_1.fastq.gz` - -> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930185.sra --split-files --gzip -O red_algae/fastq` (9607) -
-Read 1707508 spots for /home/nsheff/ncbi/public/sra//SRR1930185.sra -Written 1707508 spots for /home/nsheff/ncbi/public/sra//SRR1930185.sra --Command completed. Elapsed time: 0:01:01. Running peak memory: 0.066GB. - PID: 9607; Command: fastq-dump; Return code: 0; Memory used: 0.066GB - -Already completed files: [] - -### Pipeline completed. Epilogue -* Elapsed time (this run): 0:01:01 -* Total elapsed time (all runs): 0:01:01 -* Peak memory (this run): 0.0656 GB -* Pipeline completed time: 2020-05-21 17:42:37 -## [4 of 4] sample: Cm_Darkness_Rep2; pipeline: sra_convert -Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub -Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub -Compute node: zither -Start time: 2020-05-21 17:42:38 -Using outfolder: red_algae/results_pipeline/SRX969076 -### Pipeline run code and environment: - -* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930186.sra --sample-name SRX969076 -O red_algae/results_pipeline --keep-sra` -* Compute host: zither -* Working dir: /home/nsheff/code/geofetch/docs_jupyter -* Outfolder: red_algae/results_pipeline/SRX969076/ -* Pipeline started at: (05-21 17:42:38) elapsed: 0.0 _TIME_ - -### Version log: - -* Python version: 3.7.5 -* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper` -* Pypiper version: 0.12.1 -* Pipeline dir: `/home/nsheff/.local/bin` -* Pipeline version: None - -### Arguments passed to pipeline: - -* `bamfolder`: `` -* `config_file`: `sraconvert.yaml` -* `format`: `fastq` -* `fqfolder`: `red_algae/fastq` -* `keep_sra`: `True` -* `logdev`: `False` -* `mode`: `convert` -* `output_parent`: `red_algae/results_pipeline` -* `recover`: `False` -* `sample_name`: `['SRX969076']` -* `silent`: `False` -* `srafolder`: `/home/nsheff/ncbi/public/sra/` -* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930186.sra']` -* `verbosity`: `None` - ----------------------------------------- - -Processing 1 of 1 files: SRR1930186 -Target to produce: `red_algae/fastq/SRR1930186_1.fastq.gz` - -> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930186.sra --split-files --gzip -O red_algae/fastq` (9780) -
-Read 1224029 spots for /home/nsheff/ncbi/public/sra//SRR1930186.sra -Written 1224029 spots for /home/nsheff/ncbi/public/sra//SRR1930186.sra --Command completed. Elapsed time: 0:00:44. Running peak memory: 0.067GB. - PID: 9780; Command: fastq-dump; Return code: 0; Memory used: 0.067GB - -Already completed files: [] - -### Pipeline completed. Epilogue -* Elapsed time (this run): 0:00:44 -* Total elapsed time (all runs): 0:00:44 -* Peak memory (this run): 0.0673 GB -* Pipeline completed time: 2020-05-21 17:43:22 - -Looper finished -Samples valid for job generation: 4 of 4 -Commands submitted: 4 of 4 -Jobs submitted: 4 - -``` - -Now that's done, let's take a look in the `red_algae/fastq` folder (where we set the `$SRAFQ` variable). - - -```bash -ls red_algae/fastq -``` - -```.output -SRR1930183_1.fastq.gz SRR1930184_2.fastq.gz SRR1930186_1.fastq.gz -SRR1930183_2.fastq.gz SRR1930185_1.fastq.gz SRR1930186_2.fastq.gz -SRR1930184_1.fastq.gz SRR1930185_2.fastq.gz - -``` - -By default, the sra conversion script will delete the `.sra` files after they have been converted to fastq. You can keep them if you want by passing `--keep-sra`, which you can do by passing `--command-extra=--keep-sra` to your `looper run` command. - ## Finalize the project config and sample annotation diff --git a/docs_jupyter/how_to_convert_fastq_from_sra.ipynb b/docs_jupyter/how_to_convert_fastq_from_sra.ipynb new file mode 100644 index 0000000..86758e7 --- /dev/null +++ b/docs_jupyter/how_to_convert_fastq_from_sra.ipynb @@ -0,0 +1,736 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b5093d6d", + "metadata": {}, + "source": [ + "## How to extract fastq files from SRA" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5d04aca7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "geofetch 0.12.4\n" + ] + } + ], + "source": [ + "geofetch --version" + ] + }, + { + "cell_type": "markdown", + "id": "51be28fa", + "metadata": {}, + "source": [ + "1) Download SRA files and PEP using GEOfetch\n", + "\n", + "Add flags: \n", + "a) `--add-convert-modifier` (To add looper configurations for conversion)\n", + "b) `--discard-soft` (To delete soft files. We don't need them :D)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "5d1d2a6a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Metadata folder: /home/bnt4me/virginia/repos/geofetch/docs_jupyter/red_algae\n", + "Trying GSE67303 (not a file) as accession...\n", + "Skipped 0 accessions. Starting now.\n", + "\u001B[38;5;200mProcessing accession 1 of 1: 'GSE67303'\u001B[0m\n", + "Processed 4 samples.\n", + "Expanding metadata list...\n", + "Found SRA Project accession: SRP056574\n", + "Downloading SRP056574 sra metadata\n", + "Parsing SRA file to download SRR records\n", + "Getting SRR: SRR1930183 in (GSE67303)\n", + "\n", + "2023-08-01T17:04:12 prefetch.2.11.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.\n", + "2023-08-01T17:04:12 prefetch.2.11.3: 1) Downloading 'SRR1930183'...\n", + "2023-08-01T17:04:12 prefetch.2.11.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.\n", + "2023-08-01T17:04:12 prefetch.2.11.3: Downloading via HTTPS...\n", + "2023-08-01T17:04:14 prefetch.2.11.3: HTTPS download succeed\n", + "2023-08-01T17:04:15 prefetch.2.11.3: 'SRR1930183' is valid\n", + "2023-08-01T17:04:15 prefetch.2.11.3: 1) 'SRR1930183' was downloaded successfully\n", + "2023-08-01T17:04:15 prefetch.2.11.3: 'SRR1930183' has 0 unresolved dependencies\n", + "Getting SRR: SRR1930184 in (GSE67303)\n", + "\n", + "2023-08-01T17:04:15 prefetch.2.11.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.\n", + "2023-08-01T17:04:16 prefetch.2.11.3: 1) Downloading 'SRR1930184'...\n", + "2023-08-01T17:04:16 prefetch.2.11.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.\n", + "2023-08-01T17:04:16 prefetch.2.11.3: Downloading via HTTPS...\n", + "2023-08-01T17:04:17 prefetch.2.11.3: HTTPS download succeed\n", + "2023-08-01T17:04:18 prefetch.2.11.3: 'SRR1930184' is valid\n", + "2023-08-01T17:04:18 prefetch.2.11.3: 1) 'SRR1930184' was downloaded successfully\n", + "2023-08-01T17:04:18 prefetch.2.11.3: 'SRR1930184' has 0 unresolved dependencies\n", + "Getting SRR: SRR1930185 in (GSE67303)\n", + "\n", + "2023-08-01T17:04:19 prefetch.2.11.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.\n", + "2023-08-01T17:04:19 prefetch.2.11.3: 1) Downloading 'SRR1930185'...\n", + "2023-08-01T17:04:19 prefetch.2.11.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.\n", + "2023-08-01T17:04:19 prefetch.2.11.3: Downloading via HTTPS...\n", + "2023-08-01T17:04:22 prefetch.2.11.3: HTTPS download succeed\n", + "2023-08-01T17:04:22 prefetch.2.11.3: 'SRR1930185' is valid\n", + "2023-08-01T17:04:22 prefetch.2.11.3: 1) 'SRR1930185' was downloaded successfully\n", + "2023-08-01T17:04:22 prefetch.2.11.3: 'SRR1930185' has 0 unresolved dependencies\n", + "Getting SRR: SRR1930186 in (GSE67303)\n", + "\n", + "2023-08-01T17:04:22 prefetch.2.11.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.\n", + "2023-08-01T17:04:23 prefetch.2.11.3: 1) Downloading 'SRR1930186'...\n", + "2023-08-01T17:04:23 prefetch.2.11.3: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.\n", + "2023-08-01T17:04:23 prefetch.2.11.3: Downloading via HTTPS...\n", + "2023-08-01T17:04:25 prefetch.2.11.3: HTTPS download succeed\n", + "2023-08-01T17:04:25 prefetch.2.11.3: 'SRR1930186' is valid\n", + "2023-08-01T17:04:25 prefetch.2.11.3: 1) 'SRR1930186' was downloaded successfully\n", + "2023-08-01T17:04:25 prefetch.2.11.3: 'SRR1930186' has 0 unresolved dependencies\n", + "Finished processing 1 accession(s)\n", + "Cleaning soft files ...\n", + "Creating complete project annotation sheets and config file...\n", + "\u001B[92mSample annotation sheet: /home/bnt4me/virginia/repos/geofetch/docs_jupyter/red_algae/GSE67303_PEP/GSE67303_PEP_raw.csv . Saved!\u001B[0m\n", + "\u001B[92mFile has been saved successfully\u001B[0m\n", + " Config file: /home/bnt4me/virginia/repos/geofetch/docs_jupyter/red_algae/GSE67303_PEP/GSE67303_PEP.yaml\n" + ] + } + ], + "source": [ + "geofetch -i GSE67303 -n red_algae -m `pwd` --add-convert-modifier --discard-soft" + ] + }, + { + "cell_type": "markdown", + "id": "a6b24693", + "metadata": {}, + "source": [ + "Let's see if files were downloaded:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "37def9a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001B[0m\u001B[01;34mbuild\u001B[0m python-usage.ipynb \u001B[01;34mSRR1930184\u001B[0m\n", + "\u001B[01;34mcode\u001B[0m raw-data-downloading.ipynb \u001B[01;34mSRR1930185\u001B[0m\n", + "how_to_fastq_from_sra.ipynb \u001B[01;34mred_algae\u001B[0m \u001B[01;34mSRR1930186\u001B[0m\n", + "processed-data-downloading.ipynb \u001B[01;34mSRR1930183\u001B[0m\n" + ] + } + ], + "source": [ + "ls" + ] + }, + { + "cell_type": "markdown", + "id": "6831883b", + "metadata": {}, + "source": [ + "now let's check how does our config file looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c13991dd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "# Autogenerated by geofetch\n", + "\n", + "name: GSE67303\n", + "pep_version: 2.1.0\n", + "sample_table: GSE67303_PEP_raw.csv\n", + "\n", + "\"experiment_metadata\":\n", + " \"series_contact_address\": \"930 N University Ave\"\n", + " \"series_contact_city\": \"Ann Arbor\"\n", + " \"series_contact_country\": \"USA\"\n", + " \"series_contact_department\": \"Chemistry\"\n", + " \"series_contact_email\": \"mtardu@umich.edu\"\n", + " \"series_contact_institute\": \"University of Michigan\"\n", + " \"series_contact_laboratory\": \"Koutmou Lab\"\n", + " \"series_contact_name\": \"mehmet,,tardu\"\n", + " \"series_contact_state\": \"MI\"\n", + " \"series_contact_zip_postal_code\": \"48109\"\n", + " \"series_contributor\": \"Mehmet,,Tardu + Ugur,M,Dikbas + Ibrahim,,Baris + Ibrahim,H,Kavakli\"\n", + " \"series_geo_accession\": \"GSE67303\"\n", + " \"series_last_update_date\": \"May 15 2019\"\n", + " \"series_overall_design\": \"Identification of blue light and red light regulated genes\\\n", + " \\ by deep sequencing in biological duplicates. qRT-PCR was performed to verify\\\n", + " \\ the RNA-seq results.\"\n", + " \"series_platform_id\": \"GPL19949\"\n", + " \"series_platform_organism\": \"Cyanidioschyzon merolae strain 10D\"\n", + " \"series_platform_taxid\": \"280699\"\n", + " \"series_pubmed_id\": \"27614431\"\n", + " \"series_relation\": \"BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA279462\\\n", + " \\ + SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP056574\"\n", + " \"series_sample_id\": \"GSM1644066 + GSM1644067 + GSM1644068 + GSM1644069\"\n", + " \"series_sample_organism\": \"Cyanidioschyzon merolae strain 10D\"\n", + " \"series_sample_taxid\": \"280699\"\n", + " \"series_status\": \"Public on Sep 01 2016\"\n", + " \"series_submission_date\": \"Mar 26 2015\"\n", + " \"series_summary\": \"Light is one of the main environmental cues that affects the\\\n", + " \\ physiology and behavior of many organisms. The effect of light on genome-wide\\\n", + " \\ transcriptional regulation has been well-studied in green algae and plants,\\\n", + " \\ but not in red algae. Cyanidioschyzon merolae is used as a model red algae,\\\n", + " \\ and is suitable for studies on transcriptomics because of its compact genome\\\n", + " \\ with a relatively small number of genes. In addition, complete genome sequences\\\n", + " \\ of the nucleus, mitochondrion, and chloroplast of this organism have been determined.\\\n", + " \\ Together, these attributes make C. merolae an ideal model organism to study\\\n", + " \\ the response to light stimuli at the transcriptional and the systems biology\\\n", + " \\ levels. Previous studies have shown that light significantly affects cell signaling\\\n", + " \\ in this organism, but there are no reports on its blue light- and red light-mediated\\\n", + " \\ transcriptional responses. We investigated the direct effects of blue and red\\\n", + " \\ light at the transcriptional level using RNA-seq. Blue and red light were found\\\n", + " \\ to regulate 35% of the total genes in C. merolae. Blue light affected the transcription\\\n", + " \\ of genes involved protein synthesis while red light specifically regulated the\\\n", + " \\ transcription of genes involved in photosynthesis and DNA repair. Blue or red\\\n", + " \\ light regulated genes involved in carbon metabolism and pigment biosynthesis.\\\n", + " \\ Overall, our data showed that red and blue light regulate the majority of the\\\n", + " \\ cellular, cell division, and repair processes in C. merolae.\"\n", + " \"series_supplementary_file\": \"ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE67nnn/GSE67303/suppl/GSE67303_DEG_cuffdiff.xlsx\"\n", + " \"series_title\": \"RNA-seq analysis of the transcriptional response to blue and red\\\n", + " \\ light in the extremophilic red alga, Cyanidioschyzon merolae\"\n", + " \"series_type\": \"Expression profiling by high throughput sequencing\"\n", + "\n", + "\n", + "\n", + "sample_modifiers:\n", + " append:\n", + " # Project metadata:\n", + " sample_treatment_protocol_ch1: \"Cells were exposed to blue-light (15 µmole m-2s-1) for 30 minutes\"\n", + " sample_growth_protocol_ch1: \"Cyanidioschyzon merolae cells were grown in 2xMA media\"\n", + " sample_extract_protocol_ch1: \"Dark kept and blue-light exposed C.merolae cells were removed and RNA was harvested using Trizol reagent. Illumina TruSeq RNA Sample Prep Kit (Cat#RS-122-2001) was used with 1 ug of total RNA for the construction of sequencing libraries., RNA libraries were prepared for sequencing using standard Illumina protocols\"\n", + " sample_data_processing: \"The purified cDNA library was sequenced on Illumina''s MiSeq sequencing platform following vendor''s instruction for running the instrument., Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Cyanidioschyzon merolae 10D reference genome (assembly ID:ASM9120v1) using TopHat (v2.0.5)., Differential expression analysis was conducted by using cuffdiff tool in cufflink suite (v2.2)\"\n", + " supplementary_files_format_and_content: \"Excel spreadsheet includes FPKM values for Darkness and Blue-Light exposed samples with p and q values of cuffdiff output.\"\n", + " # End of project metadata\n", + " \n", + "\n", + " # Adding sra convert looper pipeline\n", + " SRR_files: SRA\n", + "\n", + " derive:\n", + " attributes: [read1, read2, SRR_files]\n", + " sources:\n", + " SRA: \"${SRABAM}/{srr}.bam\"\n", + " FQ: \"${SRAFQ}/{srr}.fastq.gz\"\n", + " FQ1: \"${SRAFQ}/{srr}_1.fastq.gz\"\n", + " FQ2: \"${SRAFQ}/{srr}_2.fastq.gz\"\n", + " imply:\n", + " - if:\n", + " organism: \"Mus musculus\"\n", + " then:\n", + " genome: mm10\n", + " - if:\n", + " organism: \"Homo sapiens\"\n", + " then:\n", + " genome: hg38\n", + " - if:\n", + " read_type: \"PAIRED\"\n", + " then:\n", + " read1: FQ1\n", + " read2: FQ2\n", + " - if:\n", + " read_type: \"SINGLE\"\n", + " then:\n", + " read1: FQ1\n", + "\n", + "project_modifiers:\n", + " amend:\n", + " sra_convert:\n", + " looper:\n", + " results_subdir: sra_convert_results\n", + " sample_modifiers:\n", + " append:\n", + " SRR_files: SRA\n", + " pipeline_interfaces: ${CODE}/geofetch/pipeline_interface_convert.yaml\n", + " derive:\n", + " attributes: [read1, read2, SRR_files]\n", + " sources:\n", + " SRA: \"${SRARAW}/{srr}/{srr}.sra\"\n", + " FQ: \"${SRAFQ}/{srr}.fastq.gz\"\n", + " FQ1: \"${SRAFQ}/{srr}_1.fastq.gz\"\n", + " FQ2: \"${SRAFQ}/{srr}_2.fastq.gz\"\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "cat ./red_algae/GSE67303_PEP/GSE67303_PEP.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "13a128a6", + "metadata": {}, + "source": [ + "To run pipeline, you should set up few enviromental variables:\n", + "1) SRARAW - folder where SRA files were downloaded\n", + "2) SRAFQ -folder where fastq should be produced\n", + "3) CODE - (first you should clone geofetch), and $CODE is where geofetch folder is located" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "d4af5280", + "metadata": {}, + "outputs": [], + "source": [ + "# Set SRARAW env\n", + "export SRARAW=`pwd`" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "981f6073", + "metadata": {}, + "outputs": [], + "source": [ + "# Create folder where you want to store fq\n", + "mkdir fq_folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c2cb5330", + "metadata": {}, + "outputs": [], + "source": [ + "# Set SRAFQ env\n", + "export SRAFQ=`pwd`/fq_folder" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "45bee81f", + "metadata": {}, + "outputs": [], + "source": [ + "# Unfortunately you have to pull gefetch folder from github, and set CODE variable:\n", + "mkdir code && cd code && git clone https://github.com/pepkit/geofetch.git && export CODE=`pwd` && cd .." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "1153dab2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001B[0m\u001B[01;34mbuild\u001B[0m processed-data-downloading.ipynb \u001B[01;34mSRR1930183\u001B[0m\n", + "\u001B[01;34mcode\u001B[0m python-usage.ipynb \u001B[01;34mSRR1930184\u001B[0m\n", + "\u001B[01;34mfq_folder\u001B[0m raw-data-downloading.ipynb \u001B[01;34mSRR1930185\u001B[0m\n", + "how_to_fastq_from_sra.ipynb \u001B[01;34mred_algae\u001B[0m \u001B[01;34mSRR1930186\u001B[0m\n" + ] + } + ], + "source": [ + "ls" + ] + }, + { + "cell_type": "markdown", + "id": "d03578ac", + "metadata": {}, + "source": [ + "### Now install looper if you don't have it" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "b4aa8176", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "looper 1.4.3\n", + "\u001B[0m\n" + ] + } + ], + "source": [ + "looper --version" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "0bcd03a7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001B[0m\u001B[01;34mGSE67303_PEP\u001B[0m\n" + ] + } + ], + "source": [ + "ls red_algae" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "a9a67e5c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Looper version: 1.4.3\n", + "Command: run\n", + "Using default config. No config found in env var: ['DIVCFG']\n", + "Using amendments: sra_convert\n", + "Activating compute package 'local'\n", + "Pipestat compatible: False\n", + "\u001B[36m## [1 of 4] sample: cm_bluelight_rep1; pipeline: sra_convert\u001B[0m\n", + "Writing script to /home/bnt4me/virginia/repos/geofetch/docs_jupyter/submission/sra_convert_cm_bluelight_rep1.sub\n", + "Job script (n=1; 0.06Gb): ./submission/sra_convert_cm_bluelight_rep1.sub\n", + "Compute node: bnt4me-Precision-5560\n", + "Start time: 2023-08-01 13:06:42\n", + "Using outfolder: ./sra_convert_results/SRR1930183\n", + "### Pipeline run code and environment:\n", + "\n", + "* Command: `/home/bnt4me/virginia/venv/jupyter/bin/sraconvert --srr /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930183/SRR1930183.sra -O ./sra_convert_results`\n", + "* Compute host: bnt4me-Precision-5560\n", + "* Working dir: /home/bnt4me/virginia/repos/geofetch/docs_jupyter\n", + "* Outfolder: ./sra_convert_results/SRR1930183/\n", + "* Pipeline started at: (08-01 13:06:42) elapsed: 0.0 _TIME_\n", + "\n", + "### Version log:\n", + "\n", + "* Python version: 3.10.6\n", + "* Pypiper dir: `/home/bnt4me/virginia/venv/jupyter/lib/python3.10/site-packages/pypiper`\n", + "* Pypiper version: 0.12.3\n", + "* Pipeline dir: `/home/bnt4me/virginia/venv/jupyter/bin`\n", + "* Pipeline version: None\n", + "\n", + "### Arguments passed to pipeline:\n", + "\n", + "* `bamfolder`: ``\n", + "* `config_file`: `sraconvert.yaml`\n", + "* `format`: `fastq`\n", + "* `fqfolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder`\n", + "* `keep_sra`: `False`\n", + "* `logdev`: `False`\n", + "* `mode`: `convert`\n", + "* `output_parent`: `./sra_convert_results`\n", + "* `recover`: `False`\n", + "* `sample_name`: `None`\n", + "* `silent`: `False`\n", + "* `srafolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter`\n", + "* `srr`: `['/home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930183/SRR1930183.sra']`\n", + "* `verbosity`: `None`\n", + "\n", + "----------------------------------------\n", + "\n", + "Processing 1 of 1 files: SRR1930183\n", + "Target to produce: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder/SRR1930183_1.fastq.gz` \n", + "\n", + "> `fasterq-dump /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930183/SRR1930183.sra -O /home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder` (744928)\n", + "
\n", + "spots read : 1,068,319\n", + "reads read : 2,136,638\n", + "reads written : 2,136,638\n", + "\n", + "Command completed. Elapsed time: 0:00:02. Running peak memory: 0.08GB. \n", + " PID: 744928;\tCommand: fasterq-dump;\tReturn code: 0;\tMemory used: 0.08GB\n", + "\n", + "Already completed files: []\n", + "\n", + "### Pipeline completed. Epilogue\n", + "* Elapsed time (this run): 0:00:02\n", + "* Total elapsed time (all runs): 0:00:02\n", + "* Peak memory (this run): 0.0803 GB\n", + "* Pipeline completed time: 2023-08-01 13:06:44\n", + "\u001B[36m## [2 of 4] sample: cm_bluelight_rep2; pipeline: sra_convert\u001B[0m\n", + "Writing script to /home/bnt4me/virginia/repos/geofetch/docs_jupyter/submission/sra_convert_cm_bluelight_rep2.sub\n", + "Job script (n=1; 0.04Gb): ./submission/sra_convert_cm_bluelight_rep2.sub\n", + "Compute node: bnt4me-Precision-5560\n", + "Start time: 2023-08-01 13:06:44\n", + "Using outfolder: ./sra_convert_results/SRR1930184\n", + "### Pipeline run code and environment:\n", + "\n", + "* Command: `/home/bnt4me/virginia/venv/jupyter/bin/sraconvert --srr /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930184/SRR1930184.sra -O ./sra_convert_results`\n", + "* Compute host: bnt4me-Precision-5560\n", + "* Working dir: /home/bnt4me/virginia/repos/geofetch/docs_jupyter\n", + "* Outfolder: ./sra_convert_results/SRR1930184/\n", + "* Pipeline started at: (08-01 13:06:45) elapsed: 0.0 _TIME_\n", + "\n", + "### Version log:\n", + "\n", + "* Python version: 3.10.6\n", + "* Pypiper dir: `/home/bnt4me/virginia/venv/jupyter/lib/python3.10/site-packages/pypiper`\n", + "* Pypiper version: 0.12.3\n", + "* Pipeline dir: `/home/bnt4me/virginia/venv/jupyter/bin`\n", + "* Pipeline version: None\n", + "\n", + "### Arguments passed to pipeline:\n", + "\n", + "* `bamfolder`: ``\n", + "* `config_file`: `sraconvert.yaml`\n", + "* `format`: `fastq`\n", + "* `fqfolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder`\n", + "* `keep_sra`: `False`\n", + "* `logdev`: `False`\n", + "* `mode`: `convert`\n", + "* `output_parent`: `./sra_convert_results`\n", + "* `recover`: `False`\n", + "* `sample_name`: `None`\n", + "* `silent`: `False`\n", + "* `srafolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter`\n", + "* `srr`: `['/home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930184/SRR1930184.sra']`\n", + "* `verbosity`: `None`\n", + "\n", + "----------------------------------------\n", + "\n", + "Processing 1 of 1 files: SRR1930184\n", + "Target to produce: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder/SRR1930184_1.fastq.gz` \n", + "\n", + "> `fasterq-dump /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930184/SRR1930184.sra -O /home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder` (744973)\n", + "
\n", + "spots read : 762,229\n", + "reads read : 1,524,458\n", + "reads written : 1,524,458\n", + "\n", + "Command completed. Elapsed time: 0:00:02. Running peak memory: 0.012GB. \n", + " PID: 744973;\tCommand: fasterq-dump;\tReturn code: 0;\tMemory used: 0.012GB\n", + "\n", + "Already completed files: []\n", + "\n", + "### Pipeline completed. Epilogue\n", + "* Elapsed time (this run): 0:00:02\n", + "* Total elapsed time (all runs): 0:00:02\n", + "* Peak memory (this run): 0.0118 GB\n", + "* Pipeline completed time: 2023-08-01 13:06:47\n", + "\u001B[36m## [3 of 4] sample: cm_darkness_rep1; pipeline: sra_convert\u001B[0m\n", + "Writing script to /home/bnt4me/virginia/repos/geofetch/docs_jupyter/submission/sra_convert_cm_darkness_rep1.sub\n", + "Job script (n=1; 0.09Gb): ./submission/sra_convert_cm_darkness_rep1.sub\n", + "Compute node: bnt4me-Precision-5560\n", + "Start time: 2023-08-01 13:06:47\n", + "Using outfolder: ./sra_convert_results/SRR1930185\n", + "### Pipeline run code and environment:\n", + "\n", + "* Command: `/home/bnt4me/virginia/venv/jupyter/bin/sraconvert --srr /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930185/SRR1930185.sra -O ./sra_convert_results`\n", + "* Compute host: bnt4me-Precision-5560\n", + "* Working dir: /home/bnt4me/virginia/repos/geofetch/docs_jupyter\n", + "* Outfolder: ./sra_convert_results/SRR1930185/\n", + "* Pipeline started at: (08-01 13:06:47) elapsed: 0.0 _TIME_\n", + "\n", + "### Version log:\n", + "\n", + "* Python version: 3.10.6\n", + "* Pypiper dir: `/home/bnt4me/virginia/venv/jupyter/lib/python3.10/site-packages/pypiper`\n", + "* Pypiper version: 0.12.3\n", + "* Pipeline dir: `/home/bnt4me/virginia/venv/jupyter/bin`\n", + "* Pipeline version: None\n", + "\n", + "### Arguments passed to pipeline:\n", + "\n", + "* `bamfolder`: ``\n", + "* `config_file`: `sraconvert.yaml`\n", + "* `format`: `fastq`\n", + "* `fqfolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder`\n", + "* `keep_sra`: `False`\n", + "* `logdev`: `False`\n", + "* `mode`: `convert`\n", + "* `output_parent`: `./sra_convert_results`\n", + "* `recover`: `False`\n", + "* `sample_name`: `None`\n", + "* `silent`: `False`\n", + "* `srafolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter`\n", + "* `srr`: `['/home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930185/SRR1930185.sra']`\n", + "* `verbosity`: `None`\n", + "\n", + "----------------------------------------\n", + "\n", + "Processing 1 of 1 files: SRR1930185\n", + "Target to produce: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder/SRR1930185_1.fastq.gz` \n", + "\n", + "> `fasterq-dump /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930185/SRR1930185.sra -O /home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder` (745021)\n", + "
\n", + "spots read : 1,707,508\n", + "reads read : 3,415,016\n", + "reads written : 3,415,016\n", + "\n", + "Command completed. Elapsed time: 0:00:03. Running peak memory: 0.079GB. \n", + " PID: 745021;\tCommand: fasterq-dump;\tReturn code: 0;\tMemory used: 0.079GB\n", + "\n", + "Already completed files: []\n", + "\n", + "### Pipeline completed. Epilogue\n", + "* Elapsed time (this run): 0:00:03\n", + "* Total elapsed time (all runs): 0:00:03\n", + "* Peak memory (this run): 0.0793 GB\n", + "* Pipeline completed time: 2023-08-01 13:06:50\n", + "\u001B[36m## [4 of 4] sample: cm_darkness_rep2; pipeline: sra_convert\u001B[0m\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Writing script to /home/bnt4me/virginia/repos/geofetch/docs_jupyter/submission/sra_convert_cm_darkness_rep2.sub\n", + "Job script (n=1; 0.07Gb): ./submission/sra_convert_cm_darkness_rep2.sub\n", + "Compute node: bnt4me-Precision-5560\n", + "Start time: 2023-08-01 13:06:50\n", + "Using outfolder: ./sra_convert_results/SRR1930186\n", + "### Pipeline run code and environment:\n", + "\n", + "* Command: `/home/bnt4me/virginia/venv/jupyter/bin/sraconvert --srr /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930186/SRR1930186.sra -O ./sra_convert_results`\n", + "* Compute host: bnt4me-Precision-5560\n", + "* Working dir: /home/bnt4me/virginia/repos/geofetch/docs_jupyter\n", + "* Outfolder: ./sra_convert_results/SRR1930186/\n", + "* Pipeline started at: (08-01 13:06:51) elapsed: 0.0 _TIME_\n", + "\n", + "### Version log:\n", + "\n", + "* Python version: 3.10.6\n", + "* Pypiper dir: `/home/bnt4me/virginia/venv/jupyter/lib/python3.10/site-packages/pypiper`\n", + "* Pypiper version: 0.12.3\n", + "* Pipeline dir: `/home/bnt4me/virginia/venv/jupyter/bin`\n", + "* Pipeline version: None\n", + "\n", + "### Arguments passed to pipeline:\n", + "\n", + "* `bamfolder`: ``\n", + "* `config_file`: `sraconvert.yaml`\n", + "* `format`: `fastq`\n", + "* `fqfolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder`\n", + "* `keep_sra`: `False`\n", + "* `logdev`: `False`\n", + "* `mode`: `convert`\n", + "* `output_parent`: `./sra_convert_results`\n", + "* `recover`: `False`\n", + "* `sample_name`: `None`\n", + "* `silent`: `False`\n", + "* `srafolder`: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter`\n", + "* `srr`: `['/home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930186/SRR1930186.sra']`\n", + "* `verbosity`: `None`\n", + "\n", + "----------------------------------------\n", + "\n", + "Processing 1 of 1 files: SRR1930186\n", + "Target to produce: `/home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder/SRR1930186_1.fastq.gz` \n", + "\n", + "> `fasterq-dump /home/bnt4me/virginia/repos/geofetch/docs_jupyter/SRR1930186/SRR1930186.sra -O /home/bnt4me/virginia/repos/geofetch/docs_jupyter/fq_folder` (745069)\n", + "
\n", + "spots read : 1,224,029\n", + "reads read : 2,448,058\n", + "reads written : 2,448,058\n", + "\n", + "Command completed. Elapsed time: 0:00:02. Running peak memory: 0.081GB. \n", + " PID: 745069;\tCommand: fasterq-dump;\tReturn code: 0;\tMemory used: 0.081GB\n", + "\n", + "Already completed files: []\n", + "\n", + "### Pipeline completed. Epilogue\n", + "* Elapsed time (this run): 0:00:02\n", + "* Total elapsed time (all runs): 0:00:02\n", + "* Peak memory (this run): 0.0813 GB\n", + "* Pipeline completed time: 2023-08-01 13:06:53\n", + "\n", + "Looper finished\n", + "Samples valid for job generation: 4 of 4\n", + "Commands submitted: 4 of 4\n", + "Jobs submitted: 4\n", + "\u001B[0m\n" + ] + } + ], + "source": [ + "looper run red_algae/GSE67303_PEP/GSE67303_PEP.yaml -a sra_convert -p local --output-dir ." + ] + }, + { + "cell_type": "markdown", + "id": "36d24512", + "metadata": {}, + "source": [ + "### Check if everything worked:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "2a79f578", + "metadata": {}, + "outputs": [], + "source": [ + "cd fq_folder" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "fefdf187", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SRR1930183_1.fastq SRR1930184_1.fastq SRR1930185_1.fastq SRR1930186_1.fastq\n", + "SRR1930183_2.fastq SRR1930184_2.fastq SRR1930185_2.fastq SRR1930186_2.fastq\n" + ] + } + ], + "source": [ + "ls" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Bash", + "language": "bash", + "name": "bash" + }, + "language_info": { + "codemirror_mode": "shell", + "file_extension": ".sh", + "mimetype": "text/x-sh", + "name": "bash" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs_jupyter/raw-data-downloading.ipynb b/docs_jupyter/raw-data-downloading.ipynb index 94373f7..831e98c 100644 --- a/docs_jupyter/raw-data-downloading.ipynb +++ b/docs_jupyter/raw-data-downloading.ipynb @@ -476,362 +476,6 @@ "geofetch -i GSE67303 -n red_algae -m `pwd`" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Convert to fastq format\n", - "\n", - "Now the `.sra` files have been downloaded. The project that was automatically created by GEO contained an amendment for sra file conversion. This project expects you to have an environment variable called `SRARAW` that points to the location where `prefetch` stores your `.sra` files. We also should define a `$SRAFQ` variable to point to where we ant the fastq files stored. In this command below, we set these on the fly for this command, but you can also just use globals.\n", - "\n", - "We'll use `-d` first to do a dry run:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looper version: 1.2.0-dev\n", - "Command: run\n", - "Using amendments: sra_convert\n", - "Activating compute package 'local'\n", - "\u001b[36m## [1 of 4] sample: Cm_BlueLight_Rep1; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub\n", - "Dry run, not submitted\n", - "\u001b[36m## [2 of 4] sample: Cm_BlueLight_Rep2; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub\n", - "Dry run, not submitted\n", - "\u001b[36m## [3 of 4] sample: Cm_Darkness_Rep1; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub\n", - "Dry run, not submitted\n", - "\u001b[36m## [4 of 4] sample: Cm_Darkness_Rep2; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub\n", - "Dry run, not submitted\n", - "\n", - "Looper finished\n", - "Samples valid for job generation: 4 of 4\n", - "Commands submitted: 4 of 4\n", - "Jobs submitted: 4\n", - "Dry run. No jobs were actually submitted.\n", - "\u001b[0m" - ] - } - ], - "source": [ - "SRARAW=${HOME}/ncbi/public/sra/ SRAFQ=red_algae/fastq \\\n", - " looper run red_algae/red_algae_config.yaml -a sra_convert -p local -d" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And now the real thing:" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Looper version: 1.2.0-dev\n", - "Command: run\n", - "Using amendments: sra_convert\n", - "Activating compute package 'local'\n", - "\u001b[36m## [1 of 4] sample: Cm_BlueLight_Rep1; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep1.sub\n", - "Compute node: zither\n", - "Start time: 2020-05-21 17:40:56\n", - "Using outfolder: red_algae/results_pipeline/SRX969073\n", - "### Pipeline run code and environment:\n", - "\n", - "* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930183.sra --sample-name SRX969073 -O red_algae/results_pipeline --keep-sra`\n", - "* Compute host: zither\n", - "* Working dir: /home/nsheff/code/geofetch/docs_jupyter\n", - "* Outfolder: red_algae/results_pipeline/SRX969073/\n", - "* Pipeline started at: (05-21 17:40:57) elapsed: 0.0 _TIME_\n", - "\n", - "### Version log:\n", - "\n", - "* Python version: 3.7.5\n", - "* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`\n", - "* Pypiper version: 0.12.1\n", - "* Pipeline dir: `/home/nsheff/.local/bin`\n", - "* Pipeline version: None\n", - "\n", - "### Arguments passed to pipeline:\n", - "\n", - "* `bamfolder`: ``\n", - "* `config_file`: `sraconvert.yaml`\n", - "* `format`: `fastq`\n", - "* `fqfolder`: `red_algae/fastq`\n", - "* `keep_sra`: `True`\n", - "* `logdev`: `False`\n", - "* `mode`: `convert`\n", - "* `output_parent`: `red_algae/results_pipeline`\n", - "* `recover`: `False`\n", - "* `sample_name`: `['SRX969073']`\n", - "* `silent`: `False`\n", - "* `srafolder`: `/home/nsheff/ncbi/public/sra/`\n", - "* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930183.sra']`\n", - "* `verbosity`: `None`\n", - "\n", - "----------------------------------------\n", - "\n", - "Processing 1 of 1 files: SRR1930183\n", - "Target to produce: `red_algae/fastq/SRR1930183_1.fastq.gz` \n", - "\n", - "> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930183.sra --split-files --gzip -O red_algae/fastq` (9436)\n", - "
\n", - "Read 1068319 spots for /home/nsheff/ncbi/public/sra//SRR1930183.sra\n", - "Written 1068319 spots for /home/nsheff/ncbi/public/sra//SRR1930183.sra\n", - "\n", - "Command completed. Elapsed time: 0:00:38. Running peak memory: 0.067GB. \n", - " PID: 9436;\tCommand: fastq-dump;\tReturn code: 0;\tMemory used: 0.067GB\n", - "\n", - "Already completed files: []\n", - "\n", - "### Pipeline completed. Epilogue\n", - "* Elapsed time (this run): 0:00:38\n", - "* Total elapsed time (all runs): 0:00:38\n", - "* Peak memory (this run): 0.0666 GB\n", - "* Pipeline completed time: 2020-05-21 17:41:35\n", - "\u001b[36m## [2 of 4] sample: Cm_BlueLight_Rep2; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_BlueLight_Rep2.sub\n", - "Compute node: zither\n", - "Start time: 2020-05-21 17:41:36\n", - "Using outfolder: red_algae/results_pipeline/SRX969074\n", - "### Pipeline run code and environment:\n", - "\n", - "* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930184.sra --sample-name SRX969074 -O red_algae/results_pipeline --keep-sra`\n", - "* Compute host: zither\n", - "* Working dir: /home/nsheff/code/geofetch/docs_jupyter\n", - "* Outfolder: red_algae/results_pipeline/SRX969074/\n", - "* Pipeline started at: (05-21 17:41:36) elapsed: 0.0 _TIME_\n", - "\n", - "### Version log:\n", - "\n", - "* Python version: 3.7.5\n", - "* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`\n", - "* Pypiper version: 0.12.1\n", - "* Pipeline dir: `/home/nsheff/.local/bin`\n", - "* Pipeline version: None\n", - "\n", - "### Arguments passed to pipeline:\n", - "\n", - "* `bamfolder`: ``\n", - "* `config_file`: `sraconvert.yaml`\n", - "* `format`: `fastq`\n", - "* `fqfolder`: `red_algae/fastq`\n", - "* `keep_sra`: `True`\n", - "* `logdev`: `False`\n", - "* `mode`: `convert`\n", - "* `output_parent`: `red_algae/results_pipeline`\n", - "* `recover`: `False`\n", - "* `sample_name`: `['SRX969074']`\n", - "* `silent`: `False`\n", - "* `srafolder`: `/home/nsheff/ncbi/public/sra/`\n", - "* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930184.sra']`\n", - "* `verbosity`: `None`\n", - "\n", - "----------------------------------------\n", - "\n", - "Processing 1 of 1 files: SRR1930184\n", - "Target exists: `red_algae/fastq/SRR1930184_1.fastq.gz` \n", - "Already completed files: []\n", - "\n", - "### Pipeline completed. Epilogue\n", - "* Elapsed time (this run): 0:00:00\n", - "* Total elapsed time (all runs): 0:00:00\n", - "* Peak memory (this run): 0 GB\n", - "* Pipeline completed time: 2020-05-21 17:41:36\n", - "\u001b[36m## [3 of 4] sample: Cm_Darkness_Rep1; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep1.sub\n", - "Compute node: zither\n", - "Start time: 2020-05-21 17:41:36\n", - "Using outfolder: red_algae/results_pipeline/SRX969075\n", - "### Pipeline run code and environment:\n", - "\n", - "* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930185.sra --sample-name SRX969075 -O red_algae/results_pipeline --keep-sra`\n", - "* Compute host: zither\n", - "* Working dir: /home/nsheff/code/geofetch/docs_jupyter\n", - "* Outfolder: red_algae/results_pipeline/SRX969075/\n", - "* Pipeline started at: (05-21 17:41:36) elapsed: 0.0 _TIME_\n", - "\n", - "### Version log:\n", - "\n", - "* Python version: 3.7.5\n", - "* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`\n", - "* Pypiper version: 0.12.1\n", - "* Pipeline dir: `/home/nsheff/.local/bin`\n", - "* Pipeline version: None\n", - "\n", - "### Arguments passed to pipeline:\n", - "\n", - "* `bamfolder`: ``\n", - "* `config_file`: `sraconvert.yaml`\n", - "* `format`: `fastq`\n", - "* `fqfolder`: `red_algae/fastq`\n", - "* `keep_sra`: `True`\n", - "* `logdev`: `False`\n", - "* `mode`: `convert`\n", - "* `output_parent`: `red_algae/results_pipeline`\n", - "* `recover`: `False`\n", - "* `sample_name`: `['SRX969075']`\n", - "* `silent`: `False`\n", - "* `srafolder`: `/home/nsheff/ncbi/public/sra/`\n", - "* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930185.sra']`\n", - "* `verbosity`: `None`\n", - "\n", - "----------------------------------------\n", - "\n", - "Processing 1 of 1 files: SRR1930185\n", - "Target to produce: `red_algae/fastq/SRR1930185_1.fastq.gz` \n", - "\n", - "> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930185.sra --split-files --gzip -O red_algae/fastq` (9607)\n", - "
\n", - "Read 1707508 spots for /home/nsheff/ncbi/public/sra//SRR1930185.sra\n", - "Written 1707508 spots for /home/nsheff/ncbi/public/sra//SRR1930185.sra\n", - "\n", - "Command completed. Elapsed time: 0:01:01. Running peak memory: 0.066GB. \n", - " PID: 9607;\tCommand: fastq-dump;\tReturn code: 0;\tMemory used: 0.066GB\n", - "\n", - "Already completed files: []\n", - "\n", - "### Pipeline completed. Epilogue\n", - "* Elapsed time (this run): 0:01:01\n", - "* Total elapsed time (all runs): 0:01:01\n", - "* Peak memory (this run): 0.0656 GB\n", - "* Pipeline completed time: 2020-05-21 17:42:37\n", - "\u001b[36m## [4 of 4] sample: Cm_Darkness_Rep2; pipeline: sra_convert\u001b[0m\n", - "Writing script to /home/nsheff/code/geofetch/docs_jupyter/red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub\n", - "Job script (n=1; 0.00Gb): red_algae/submission/sra_convert_Cm_Darkness_Rep2.sub\n", - "Compute node: zither\n", - "Start time: 2020-05-21 17:42:38\n", - "Using outfolder: red_algae/results_pipeline/SRX969076\n", - "### Pipeline run code and environment:\n", - "\n", - "* Command: `/home/nsheff/.local/bin/sraconvert --srr /home/nsheff/ncbi/public/sra//SRR1930186.sra --sample-name SRX969076 -O red_algae/results_pipeline --keep-sra`\n", - "* Compute host: zither\n", - "* Working dir: /home/nsheff/code/geofetch/docs_jupyter\n", - "* Outfolder: red_algae/results_pipeline/SRX969076/\n", - "* Pipeline started at: (05-21 17:42:38) elapsed: 0.0 _TIME_\n", - "\n", - "### Version log:\n", - "\n", - "* Python version: 3.7.5\n", - "* Pypiper dir: `/home/nsheff/.local/lib/python3.7/site-packages/pypiper`\n", - "* Pypiper version: 0.12.1\n", - "* Pipeline dir: `/home/nsheff/.local/bin`\n", - "* Pipeline version: None\n", - "\n", - "### Arguments passed to pipeline:\n", - "\n", - "* `bamfolder`: ``\n", - "* `config_file`: `sraconvert.yaml`\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* `format`: `fastq`\n", - "* `fqfolder`: `red_algae/fastq`\n", - "* `keep_sra`: `True`\n", - "* `logdev`: `False`\n", - "* `mode`: `convert`\n", - "* `output_parent`: `red_algae/results_pipeline`\n", - "* `recover`: `False`\n", - "* `sample_name`: `['SRX969076']`\n", - "* `silent`: `False`\n", - "* `srafolder`: `/home/nsheff/ncbi/public/sra/`\n", - "* `srr`: `['/home/nsheff/ncbi/public/sra//SRR1930186.sra']`\n", - "* `verbosity`: `None`\n", - "\n", - "----------------------------------------\n", - "\n", - "Processing 1 of 1 files: SRR1930186\n", - "Target to produce: `red_algae/fastq/SRR1930186_1.fastq.gz` \n", - "\n", - "> `fastq-dump /home/nsheff/ncbi/public/sra//SRR1930186.sra --split-files --gzip -O red_algae/fastq` (9780)\n", - "
\n", - "Read 1224029 spots for /home/nsheff/ncbi/public/sra//SRR1930186.sra\n", - "Written 1224029 spots for /home/nsheff/ncbi/public/sra//SRR1930186.sra\n", - "\n", - "Command completed. Elapsed time: 0:00:44. Running peak memory: 0.067GB. \n", - " PID: 9780;\tCommand: fastq-dump;\tReturn code: 0;\tMemory used: 0.067GB\n", - "\n", - "Already completed files: []\n", - "\n", - "### Pipeline completed. Epilogue\n", - "* Elapsed time (this run): 0:00:44\n", - "* Total elapsed time (all runs): 0:00:44\n", - "* Peak memory (this run): 0.0673 GB\n", - "* Pipeline completed time: 2020-05-21 17:43:22\n", - "\n", - "Looper finished\n", - "Samples valid for job generation: 4 of 4\n", - "Commands submitted: 4 of 4\n", - "Jobs submitted: 4\n", - "\u001b[0m" - ] - } - ], - "source": [ - "SRARAW=${HOME}/ncbi/public/sra/ SRAFQ=red_algae/fastq \\\n", - " looper run red_algae/red_algae_config.yaml -a sra_convert -p local \\\n", - " --command-extra=--keep-sra" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that's done, let's take a look in the `red_algae/fastq` folder (where we set the `$SRAFQ` variable)." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[0m\u001b[01;31mSRR1930183_1.fastq.gz\u001b[0m \u001b[01;31mSRR1930184_2.fastq.gz\u001b[0m \u001b[01;31mSRR1930186_1.fastq.gz\u001b[0m\n", - "\u001b[01;31mSRR1930183_2.fastq.gz\u001b[0m \u001b[01;31mSRR1930185_1.fastq.gz\u001b[0m \u001b[01;31mSRR1930186_2.fastq.gz\u001b[0m\n", - "\u001b[01;31mSRR1930184_1.fastq.gz\u001b[0m \u001b[01;31mSRR1930185_2.fastq.gz\u001b[0m\n" - ] - } - ], - "source": [ - "ls red_algae/fastq" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "By default, the sra conversion script will delete the `.sra` files after they have been converted to fastq. You can keep them if you want by passing `--keep-sra`, which you can do by passing `--command-extra=--keep-sra` to your `looper run` command." - ] - }, { "cell_type": "markdown", "metadata": {}, diff --git a/geofetch/__init__.py b/geofetch/__init__.py index da89d27..5065195 100644 --- a/geofetch/__init__.py +++ b/geofetch/__init__.py @@ -9,4 +9,9 @@ __author__ = ["Oleksandr Khoroshevskyi", "Vince Reuter", "Nathan Sheffield"] __all__ = ["Finder", "Geofetcher"] -logmuse.init_logger("geofetch") +_LOGGER = logmuse.init_logger("geofetch") +coloredlogs.install( + logger=_LOGGER, + datefmt="%H:%M:%S", + fmt="[%(levelname)s] [%(asctime)s] %(message)s", +) diff --git a/geofetch/_version.py b/geofetch/_version.py index 8e1395b..6dd4954 100644 --- a/geofetch/_version.py +++ b/geofetch/_version.py @@ -1 +1 @@ -__version__ = "0.12.3" +__version__ = "0.12.4" diff --git a/geofetch/geofetch.py b/geofetch/geofetch.py index 1a67e0f..4703686 100755 --- a/geofetch/geofetch.py +++ b/geofetch/geofetch.py @@ -1435,7 +1435,7 @@ def _sra_to_bam_conversion_fastq_damp( # check to make sure it worked cmd = ( - "fastq-dump --split-3 -O " + "fasterq-dump --split-3 -O " + os.path.realpath(self.sra_folder) + " " + os.path.join(self.sra_folder, run_name + ".sra") diff --git a/geofetch/looper_sra_convert.yaml b/geofetch/looper_sra_convert.yaml index bf5905d..94525f1 100644 --- a/geofetch/looper_sra_convert.yaml +++ b/geofetch/looper_sra_convert.yaml @@ -4,10 +4,10 @@ derive: attributes: [read1, read2, SRR_files] sources: - SRA: "${SRABAM}/{SRR}.bam" - FQ: "${SRAFQ}/{SRR}.fastq.gz" - FQ1: "${SRAFQ}/{SRR}_1.fastq.gz" - FQ2: "${SRAFQ}/{SRR}_2.fastq.gz" + SRA: "${SRABAM}/{srr}.bam" + FQ: "${SRAFQ}/{srr}.fastq.gz" + FQ1: "${SRAFQ}/{srr}_1.fastq.gz" + FQ2: "${SRAFQ}/{srr}_2.fastq.gz" imply: - if: organism: "Mus musculus" @@ -39,7 +39,7 @@ project_modifiers: derive: attributes: [read1, read2, SRR_files] sources: - SRA: "${SRARAW}/{SRR}.sra" - FQ: "${SRAFQ}/{SRR}.fastq.gz" - FQ1: "${SRAFQ}/{SRR}_1.fastq.gz" - FQ2: "${SRAFQ}/{SRR}_2.fastq.gz" + SRA: "${SRARAW}/{srr}/{srr}.sra" + FQ: "${SRAFQ}/{srr}.fastq.gz" + FQ1: "${SRAFQ}/{srr}_1.fastq.gz" + FQ2: "${SRAFQ}/{srr}_2.fastq.gz" diff --git a/geofetch/sraconvert.py b/geofetch/sraconvert.py index a320c02..d524895 100755 --- a/geofetch/sraconvert.py +++ b/geofetch/sraconvert.py @@ -143,7 +143,7 @@ def main(): # fastq-dump --split-files will produce *_1.fastq and *_2.fastq # for paired-end data, and only *_1.fastq for single-end data. outfile = "{fq_prefix}_1.fastq.gz".format(fq_prefix=fq_prefix) - cmd = "fastq-dump {data_source} --split-files --gzip -O {outfolder}".format( + cmd = "fasterq-dump {data_source} -O {outfolder}".format( data_source=infile, outfolder=args.fqfolder, nofail=True ) elif args.format == "bam": diff --git a/mkdocs.yml b/mkdocs.yml index e3a6c2f..e8bc1b9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -18,6 +18,7 @@ nav: - How-to Guides: - Specifying samples to download: file-specification.md - Set SRA data download location: howto-location.md + - Run SRA convert: how_to_convert_fastq_from_sra.md - Reference: - Metadata output: metadata_output.md - Usage: usage.md