gymrek-lab · LiterallyUniqueLogin · Dec 5, 2023 · Dec 5, 2023 · Dec 5, 2023 · Dec 5, 2023
@@ -34,15 +34,21 @@ on the login nodes, which will cause weird error messages if you try to use them
 
 First: :code:`module load slurm`. I like to put this in my `.bashrc`
 
-Grabbing an interactive job:
+.. _getting_an_interactive_node_on_expanse:
+
+Grabbing an interactive node
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: bash
 
    srun --partition=ind-shared --pty --nodes=1 --ntasks-per-node=1 --mem=50G -t 24:00:00 --wait=0 --account=ddp268 /bin/bash
 
 ``--pty`` is what specifically makes this treated as an interactive session
 
-To run a script noninteractively with SLURM: first add special :code:`SBATCH` flags to the script
+Running a script noninteractively with SLURM
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First add special :code:`SBATCH` flags to the script
 
 .. code-block:: bash
 
@@ -90,8 +96,17 @@ Some notes:
   environment variables to set such values, you must pass them to the :code:`sbatch` command directly
   (e.g. :code:`sbatch --output=$SOMEWHERE/out slurm_script.sh`) 
 
-Managing jobs
--------------
+.. _increasing_job_runtime_up_to_one_week:
+
+Increasing job runtime up to one week
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You should be able to submit jobs/grab interactive nodes for up to two days. If you want to be able to do so for up to a week,
+* first ask the Expanse team for :code:`ind-shared-oneweek` permissions
+* then add :code:`--qos ind-shared-oneweek` to your interactive node/noninteractive job submission and increase the time you're requesting for that node/job.
+
+Managing noninteractive jobs
+----------------------------
 
 * :code:`squeue -u <user>` - look at your jobs
 * :code:`-p <partition>` - look at a specific partition
@@ -133,7 +148,7 @@ To make singularity work, I add the following to my :code:`.bashrc`:
     export SINGULARITY_TMPDIR="/scratch/$USER/job_$SLURM_JOB_ID"
   fi
 
-The general idea is, first grab an interactive node (or put this in a script that you submit) and then:
+If you want to run inside a singularity image, first grab an interactive node (or put this in a script that you submit) and then:
 
 .. code-block:: bash
 

@@ -1,11 +1,16 @@
-UKB Expanse SNP/indel/STR GWAS
-==============================
+UKB Expanse SNP/indel/STR GWAS and fine-mapping
+===============================================
 
 This guide will show you how to run a GWAS against a UK Biobank phenotype on Expanse.
 The GWAS will include both SNPs, indels and STRs. This uses the WDL and scripts pipeline
 written for the UKB blood-traits imputed STRs paper.
 
-First, choose a phenotype you want to perform the GWAS against.
+Setting up the GWAS and WDL inputs
+----------------------------------
+
+First, check out `my paper's repository <https://github.com/LiterallyUniqueLogin/ukbiobank_strs>`_ into some directory you manage.
+
+Then, choose a phenotype you want to perform the GWAS against.
 You can explore UKB phenotypes `here <https://biobank.ndph.ox.ac.uk/showcase/index.cgi>`__.
 You'll need the data field ID of the phenotype, and the data field IDs of any fields
 you wish to use as categorical covariates.
@@ -17,86 +22,66 @@ Caveats:
   (otherwise age calculations will be thrown off or may crash)
 * Currently, only categorical covariates are supported
 
-Create a json file for input:
+Create a json file for input, setting script_dir to the root of the git repo you checked out above, and all the others as appropriate:
 
 .. code-block:: json
 
   {
+    "expanse_gwas.script_dir": "repo_source"
     "expanse_gwas.phenotype_name": "your_phenotype_name",
-    "expanse_gwas.phenotype_id": "its_ID",
+    "expanse_gwas.phenotype_id": "its_data_field_ID",
     "expanse_gwas.categorical_covariate_names": ["a_list_of_categorical_covariates"],
     "expanse_gwas.categorical_covariate_ids": ["their_IDs"]
   }
 
-Create a json options file specifying where you want your output to be written:
-
-.. code-block:: json
-
-  {
-    "final_workflow_outputs_dir": "your_output_directory"
-  }
-
-
-Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`.
-
-In the cromwell.conf file you create, add this:
-
-.. code-block:: text
+Running the GWAS
+----------------
 
-  workflow-options {
-    workflow-log-dir = "absolute_path_to_store_cromwell_logs_in"
-  } 
+Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Singularity.
+The two docker containers you'll want to cache with Singularity prior to your run are 
+:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` and
+:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.4`
 
-And then for the two lines that says :code:`root = "cromwell-executions`, change them to an
-absolute path to the location you want all of your Cromwell run's work to be stored in.
+The WDL workflow file you'll point Cromwell to is in the repo you checked out at :code:`workflow/expanse_wdl/gwas.wdl`. Run
+Cromwell as normal using the standard Cromwell instructions with your input and that workflow.
 
-Once you're ready to run WDL
-
-.. code-block:: bash
-
-  # you need to be in this directory for the WDL config to find the scripts
-  # appropriately, but all the work, outputs and logs will be written to locations
-  # you've specified and not this directory
-  cd /expanse/projects/gymreklab/jmargoli/ukbiobank
-
-  java \
-    -Dconfig.file=<path_to_your_cromwell.conf_file> \
-    -jar <path_to_the_cromwell_jar_you_downloaded> \
-    run \
-    -i <path_to_your_input_file> \
-    -o <path_to_your_options_file> \
-    /expanse/projects/gymreklab/jmargoli/ukbiobank/workflow/expanse_targets/expanse_gwas_workflow.wdl \
-    > <path_to_your_output.log>
-
-Then you can follow along in another window with :code:`tail -f <your_output.log>`
+If you want to run the GWAS only in a specific subpopulation of the UKB, see :ref:`running_on_a_subpopulation` below.
 
 What the GWAS does
 ------------------
 
+The full details are in the methods of the paper `here <https://www.biorxiv.org/content/10.1101/2022.08.01.502370v3>`_. In short, this pipeline:
+
 * Gets a sample list of QCed, unrelated white brits that has the specified phenotype and each specified covariate
 * Includes age at time of measurement, genetic PCs and sex as additional covariates.
 * Rank quantile normalizes the phenotype (this in theory gives better power to work with non-normal phenotype data,
   but makes output effect sizes and standard errors uninterpretable and uncomparable with those from other runs or studies)
 * Performs a GWAS for each imputed SNP, indel and STR of the transformed phenotype against the genotype of that variant
   and all the covariates.
 * Calculates the peaks of the signals across the genome.
-* Gets a sample list of participants in the same manner as white brits, but for the five ethnicities:
+* Gets a sample list of participants in a similar manner as white brits, but for the five ethnicities:
   [black, south_asian, chinese, irish, white_other]
-* Runs the GWAS for STRs ONLY in those populations on the subset of regions containing a variant with p<5e-8 in the White Brits.
+* Runs the GWAS for STRs in those populations *only* on the regions containing a variant with p<5e-8 in the White Brits.
 
-Output files
-------------
+Output file names
+-----------------
 
-Will all be located in :code:`your_output_dir/expanse_gwas`. Unfortunately, they paths to them
-will also include IDs which are random alphanumeric strings with dashes in them.
+Final outputs:
 
-* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-plink_snp_association/execution/out.tab`
-* associaTR GWAS output for imputed STRs in white_brits :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-my_str_gwas_/execution/out.tab`
-* associaTR GWAS output for imputed STRs in the other ethnicities:
-  :code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-ethnic_my_str_gwas_/shard_X/execution/out.tab` where X in shard_X is a number from 0 to 4 indicating
-  the index of the ethnicity in the list of ethnicities above
-* List of GWAS peaks across all variant types in white brits: :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-generate_peaks/execution/peaks.tab`
-* Other intermediate outputs will also be there if you want to look at those.
+* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`white_brits_snp_gwas.tab`
+* associaTR GWAS output for imputed STRs in white_brits :code:`white_brits_str_gwas.tab`
+* associaTR GWAS output for imputed STRs in the other ethnicities :code:`<ethnicity>_str_gwas.tab`
+* List of GWAS peaks across all variant types, at least 250kb separate, in white brits: :code:`peaks.tab`
+* List of regions for followup fine-mapping regions in all variant types, in white brits: :code:`finemapping_regions.tab`
+
+Intermediate outputs potentially useful for debugging:
+
+* Lists of all the participants used in the GWAS after all subsetting, entitled :code:`<ethnicity>.samples`
+* The shared covars array :code:`shared_covars.npy` and their names :code:`covar_names.txt`
+* The (original) untransformed phenotype data, deposited for your reference, :code:`<ethnicity>_original_pheno.npy`
+* The transformed phenotype data used in the regression plus all the covariates you specified :code:`<ethnicity>_pheno.npy`, as well as the names of those covariates :code:`<ethnicity>_pheno_covar_names.txt`
+
+  .. _running_on_a_subpopulation:
 
 Running on a subpopulation
 --------------------------
@@ -106,19 +91,20 @@ of sample IDs into a file, one per line, with the first line having the header '
 
 .. code-block:: json
 
-  "expanse_gwas.subpop_sample_list": "your_sample_file"
+  "gwas.subpop_sample_list": "your_sample_file"
 
 to the json input file.
 
 This subpopulation file must contain all samples of all ethnicities that you want included
-(so any samples not included will be omitted).
+(i.e. any samples not included will be omitted).
+
+Note that providing this file doesn't change the pipeline's workflow:
 
 * Samples that fail QC will still be removed.
 * Analyses will still be split per ethnicity.
 * Each ethnicity's sample list will still be shrunk to remove related participants
-* You should include some samples from each ethnicity or the workflow will probably fail
-  - you'll still likely get GWAS results from the ethnicities you included, but you'll have to dig for those
-  instead of getting them put into the output location you asked for.
 
-You may find the files at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/sample_qc/runs/<ethnicity>/no_phenotype/combined.sample`
-helpful for building your subpopulation - those location contains the QCed (but not yet unrelated) samples for the six ethincities used in the imputed UKB STRs paper.
+Running fine-mapping
+--------------------
+
+TODO