Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanse str gwas update #52

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
25 changes: 20 additions & 5 deletions doc/Expanse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,21 @@ on the login nodes, which will cause weird error messages if you try to use them

First: :code:`module load slurm`. I like to put this in my `.bashrc`

Grabbing an interactive job:
.. _getting_an_interactive_node_on_expanse:

Grabbing an interactive node
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

srun --partition=ind-shared --pty --nodes=1 --ntasks-per-node=1 --mem=50G -t 24:00:00 --wait=0 --account=ddp268 /bin/bash

``--pty`` is what specifically makes this treated as an interactive session

To run a script noninteractively with SLURM: first add special :code:`SBATCH` flags to the script
Running a script noninteractively with SLURM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First add special :code:`SBATCH` flags to the script

.. code-block:: bash

Expand Down Expand Up @@ -90,8 +96,17 @@ Some notes:
environment variables to set such values, you must pass them to the :code:`sbatch` command directly
(e.g. :code:`sbatch --output=$SOMEWHERE/out slurm_script.sh`)

Managing jobs
-------------
.. _increasing_job_runtime_up_to_one_week:

Increasing job runtime up to one week
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You should be able to submit jobs/grab interactive nodes for up to two days. If you want to be able to do so for up to a week,
* first ask the Expanse team for :code:`ind-shared-oneweek` permissions
* then add :code:`--qos ind-shared-oneweek` to your interactive node/noninteractive job submission and increase the time you're requesting for that node/job.

Managing noninteractive jobs
----------------------------

* :code:`squeue -u <user>` - look at your jobs
* :code:`-p <partition>` - look at a specific partition
Expand Down Expand Up @@ -133,7 +148,7 @@ To make singularity work, I add the following to my :code:`.bashrc`:
export SINGULARITY_TMPDIR="/scratch/$USER/job_$SLURM_JOB_ID"
fi

The general idea is, first grab an interactive node (or put this in a script that you submit) and then:
If you want to run inside a singularity image, first grab an interactive node (or put this in a script that you submit) and then:

.. code-block:: bash

Expand Down
112 changes: 49 additions & 63 deletions doc/UKB_Expanse_STR_GWAS.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
UKB Expanse SNP/indel/STR GWAS
==============================
UKB Expanse SNP/indel/STR GWAS and fine-mapping
===============================================

This guide will show you how to run a GWAS against a UK Biobank phenotype on Expanse.
The GWAS will include both SNPs, indels and STRs. This uses the WDL and scripts pipeline
written for the UKB blood-traits imputed STRs paper.

First, choose a phenotype you want to perform the GWAS against.
Setting up the GWAS and WDL inputs
----------------------------------

First, check out `my paper's repository <https://github.com/LiterallyUniqueLogin/ukbiobank_strs>`_ into some directory you manage.

Then, choose a phenotype you want to perform the GWAS against.
You can explore UKB phenotypes `here <https://biobank.ndph.ox.ac.uk/showcase/index.cgi>`__.
You'll need the data field ID of the phenotype, and the data field IDs of any fields
you wish to use as categorical covariates.
Expand All @@ -17,86 +22,66 @@ Caveats:
(otherwise age calculations will be thrown off or may crash)
* Currently, only categorical covariates are supported

Create a json file for input:
Create a json file for input, setting script_dir to the root of the git repo you checked out above, and all the others as appropriate:

.. code-block:: json

{
"expanse_gwas.script_dir": "repo_source"
"expanse_gwas.phenotype_name": "your_phenotype_name",
"expanse_gwas.phenotype_id": "its_ID",
"expanse_gwas.phenotype_id": "its_data_field_ID",
"expanse_gwas.categorical_covariate_names": ["a_list_of_categorical_covariates"],
"expanse_gwas.categorical_covariate_ids": ["their_IDs"]
}

Create a json options file specifying where you want your output to be written:

.. code-block:: json

{
"final_workflow_outputs_dir": "your_output_directory"
}


Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`.

In the cromwell.conf file you create, add this:

.. code-block:: text
Running the GWAS
----------------

workflow-options {
workflow-log-dir = "absolute_path_to_store_cromwell_logs_in"
}
Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Singularity.
The two docker containers you'll want to cache with Singularity prior to your run are
:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` and
:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.4`

And then for the two lines that says :code:`root = "cromwell-executions`, change them to an
absolute path to the location you want all of your Cromwell run's work to be stored in.
The WDL workflow file you'll point Cromwell to is in the repo you checked out at :code:`workflow/expanse_wdl/gwas.wdl`. Run
Cromwell as normal using the standard Cromwell instructions with your input and that workflow.

Once you're ready to run WDL

.. code-block:: bash

# you need to be in this directory for the WDL config to find the scripts
# appropriately, but all the work, outputs and logs will be written to locations
# you've specified and not this directory
cd /expanse/projects/gymreklab/jmargoli/ukbiobank

java \
-Dconfig.file=<path_to_your_cromwell.conf_file> \
-jar <path_to_the_cromwell_jar_you_downloaded> \
run \
-i <path_to_your_input_file> \
-o <path_to_your_options_file> \
/expanse/projects/gymreklab/jmargoli/ukbiobank/workflow/expanse_targets/expanse_gwas_workflow.wdl \
> <path_to_your_output.log>

Then you can follow along in another window with :code:`tail -f <your_output.log>`
If you want to run the GWAS only in a specific subpopulation of the UKB, see :ref:`running_on_a_subpopulation` below.

What the GWAS does
------------------

The full details are in the methods of the paper `here <https://www.biorxiv.org/content/10.1101/2022.08.01.502370v3>`_. In short, this pipeline:
aryarm marked this conversation as resolved.
Show resolved Hide resolved

* Gets a sample list of QCed, unrelated white brits that has the specified phenotype and each specified covariate
* Includes age at time of measurement, genetic PCs and sex as additional covariates.
* Rank quantile normalizes the phenotype (this in theory gives better power to work with non-normal phenotype data,
but makes output effect sizes and standard errors uninterpretable and uncomparable with those from other runs or studies)
* Performs a GWAS for each imputed SNP, indel and STR of the transformed phenotype against the genotype of that variant
and all the covariates.
* Calculates the peaks of the signals across the genome.
* Gets a sample list of participants in the same manner as white brits, but for the five ethnicities:
* Gets a sample list of participants in a similar manner as white brits, but for the five ethnicities:
[black, south_asian, chinese, irish, white_other]
* Runs the GWAS for STRs ONLY in those populations on the subset of regions containing a variant with p<5e-8 in the White Brits.
* Runs the GWAS for STRs in those populations *only* on the regions containing a variant with p<5e-8 in the White Brits.

Output files
------------
Output file names
-----------------

Will all be located in :code:`your_output_dir/expanse_gwas`. Unfortunately, they paths to them
will also include IDs which are random alphanumeric strings with dashes in them.
Final outputs:

* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-plink_snp_association/execution/out.tab`
* associaTR GWAS output for imputed STRs in white_brits :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-my_str_gwas_/execution/out.tab`
* associaTR GWAS output for imputed STRs in the other ethnicities:
:code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-ethnic_my_str_gwas_/shard_X/execution/out.tab` where X in shard_X is a number from 0 to 4 indicating
the index of the ethnicity in the list of ethnicities above
* List of GWAS peaks across all variant types in white brits: :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-generate_peaks/execution/peaks.tab`
* Other intermediate outputs will also be there if you want to look at those.
* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`white_brits_snp_gwas.tab`
* associaTR GWAS output for imputed STRs in white_brits :code:`white_brits_str_gwas.tab`
* associaTR GWAS output for imputed STRs in the other ethnicities :code:`<ethnicity>_str_gwas.tab`
* List of GWAS peaks across all variant types, at least 250kb separate, in white brits: :code:`peaks.tab`
* List of regions for followup fine-mapping regions in all variant types, in white brits: :code:`finemapping_regions.tab`

Intermediate outputs potentially useful for debugging:

* Lists of all the participants used in the GWAS after all subsetting, entitled :code:`<ethnicity>.samples`
* The shared covars array :code:`shared_covars.npy` and their names :code:`covar_names.txt`
* The (original) untransformed phenotype data, deposited for your reference, :code:`<ethnicity>_original_pheno.npy`
* The transformed phenotype data used in the regression plus all the covariates you specified :code:`<ethnicity>_pheno.npy`, as well as the names of those covariates :code:`<ethnicity>_pheno_covar_names.txt`

.. _running_on_a_subpopulation:

Running on a subpopulation
--------------------------
Expand All @@ -106,19 +91,20 @@ of sample IDs into a file, one per line, with the first line having the header '

.. code-block:: json

"expanse_gwas.subpop_sample_list": "your_sample_file"
"gwas.subpop_sample_list": "your_sample_file"

to the json input file.

This subpopulation file must contain all samples of all ethnicities that you want included
(so any samples not included will be omitted).
(i.e. any samples not included will be omitted).

Note that providing this file doesn't change the pipeline's workflow:

* Samples that fail QC will still be removed.
* Analyses will still be split per ethnicity.
* Each ethnicity's sample list will still be shrunk to remove related participants
* You should include some samples from each ethnicity or the workflow will probably fail
- you'll still likely get GWAS results from the ethnicities you included, but you'll have to dig for those
instead of getting them put into the output location you asked for.

You may find the files at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/sample_qc/runs/<ethnicity>/no_phenotype/combined.sample`
helpful for building your subpopulation - those location contains the QCed (but not yet unrelated) samples for the six ethincities used in the imputed UKB STRs paper.
Running fine-mapping
--------------------

TODO
Loading