From 1ec4d7660c2245c2d2d391b3b0540e6be1a4ce7d Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Mon, 4 Dec 2023 16:12:31 -0800 Subject: [PATCH 01/10] Working on cleaning up the docs --- doc/Expanse.rst | 23 ++- doc/UKB_Expanse_STR_GWAS.rst | 9 +- doc/WDL.rst | 373 ++++++++++++++++++----------------- 3 files changed, 220 insertions(+), 185 deletions(-) diff --git a/doc/Expanse.rst b/doc/Expanse.rst index c5cef17..d2d928a 100644 --- a/doc/Expanse.rst +++ b/doc/Expanse.rst @@ -34,7 +34,10 @@ on the login nodes, which will cause weird error messages if you try to use them First: :code:`module load slurm`. I like to put this in my `.bashrc` -Grabbing an interactive job: +.. _getting_an_interactive_node_on_expanse: + +Grabbing an interactive node +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash @@ -42,7 +45,10 @@ Grabbing an interactive job: ``--pty`` is what specifically makes this treated as an interactive session -To run a script noninteractively with SLURM: first add special :code:`SBATCH` flags to the script +Running a script noninteractively with SLURM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +First add special :code:`SBATCH` flags to the script .. code-block:: bash @@ -90,8 +96,17 @@ Some notes: environment variables to set such values, you must pass them to the :code:`sbatch` command directly (e.g. :code:`sbatch --output=$SOMEWHERE/out slurm_script.sh`) -Managing jobs -------------- +.. _increasing_job_runtime_up_to_one_week: + +Increasing job runtime up to one week +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You should be able to submit jobs/grab interactive nodes for up to two days. If you want to be able to do so for up to a week, +* first ask the Expanse team for :code:`ind-shared-oneweek` permissions +* then add :code:`--qos ind-shared-oneweek` to your interactive node/noninteractive job submission and increase the time you're requesting for that node/job. + +Managing noninteractive jobs +---------------------------- * :code:`squeue -u ` - look at your jobs * :code:`-p ` - look at a specific partition diff --git a/doc/UKB_Expanse_STR_GWAS.rst b/doc/UKB_Expanse_STR_GWAS.rst index 267ad88..2ab7656 100644 --- a/doc/UKB_Expanse_STR_GWAS.rst +++ b/doc/UKB_Expanse_STR_GWAS.rst @@ -5,6 +5,9 @@ This guide will show you how to run a GWAS against a UK Biobank phenotype on Exp The GWAS will include both SNPs, indels and STRs. This uses the WDL and scripts pipeline written for the UKB blood-traits imputed STRs paper. +Setting up the GWAS and WDL inputs +---------------------------------- + First, choose a phenotype you want to perform the GWAS against. You can explore UKB phenotypes `here `__. You'll need the data field ID of the phenotype, and the data field IDs of any fields @@ -37,7 +40,11 @@ Create a json options file specifying where you want your output to be written: } -Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`. +Running the GWAS +---------------- + +Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Singularity. +The docker container you'll want to cache with Singularity is :code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` In the cromwell.conf file you create, add this: diff --git a/doc/WDL.rst b/doc/WDL.rst index f2c8ff8..4b9cc57 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -1,155 +1,77 @@ WDL === -Last update: 2023/01/25 +Last update: 2023/12/04 -You'll want to read the "Constraints on how you write your WDL" sections -for each runtime environment you plan to execute your WDL in before -beginning to write your WDL +WDL is a configuration language. You use an executor to run it. If you're using Expanse or All of Us, +you'll use Cromwell as the executor. If you're using DNANexus, you'll use dxCompiler. -* `WDL 1.0 spec `_ - (it's quite readable!) -* `differences between WDL versions `_ - -WDL by itself is just configuration, it needs an executor to run it. If you're using Expanse or All of Us, -you'll use Cromwell. If you're using DNANexus, you'll use dxCompiler. You'll need to understand -the ins and outs of those in addition to how to write WDL. - -TODO intro to writing WDL - all the below assumes you can write basic WDL and is about running it -or making it runnable on certain platforms +The first sections below are about getting set up with those executors on the platform you're using. +You should read those whether you're planning on writing your own WDL or running someone else's. -Containers ----------- -You'll likely want to specify a container with the :code:`docker` runtime flag as it's -necessary to execute your WDL on cloud platforms. (Cromwell doesn't support the -equivalent :code:`container` flag). - -Constraints imposed by runtime environments: - -* If running All of Us, seems like you'll need to host on Google Container Registry? (not tested) -* If running with Cromwell on Expanse, will need to either store the image locally, or host - on one of the following supported environments: quay.io, dockerhub, google container registry (GCR) - or google artifact registry (GAR). I'm not sure storing locally will work though, - as I'm not sure you can get call caching to work with that - haven't tried. -* No constraints for UKB RAP as far as I know - you can upload the docker container to DNA Nexus, - or pull from an cloud container registry. - -quay.io is my cloud container registry of choice. Terminology: - -* quay.io - Red Hat's cloud container registry -* Red Hat Quay - Red Hat's private deployment container registry service -* Project quay - an open source version of Red Hat Quay where you can - deploy and stand up your own private container registry - -It's my container registry of choice because it has free accounts -(though this isn't super clear from their pricing docs), doesn't charge -for public containers, and because at least -so far I haven't found any pull restrictions. If you do run into issues, -I'd recommend moving to GCR. Yang has tried Dockerhub, but that has really -restrictive pull limits if you're using the free account. The paid account -isn't such an issue (only $7/mo.) but Yang couldn't figure out how to get -the authentication to work on UKB RAP so that you could log in from each task -before pulling the docker container so as to circumvent the pull limit. - -Repositories in quay.io start as private, even on the free account -which in theory hasn't paid for private repos (not sure why?). -After pushing to them for the first time, -sign into the web interface, select the repo, click on the wheel icon -on the left (settings) and click Make Public. - -To push to quay.io after building your docker image, do - -.. code-block:: bash - - docker login --username quay.io - docker tag : quay.io//: - docker push quay.io//: - -depending on how you configured docker, you may need to run those commands with sudo. - -Tips on building a container with conda -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -* Use :code:`continuumio/miniconda3` as the base container. -* Put :code:`RUN conda init --system bash` in your Dockerfile -* See the section about conda and dxCompiler below to get - a script for activating conda. Then either configure that to run - automatically with the Dockerfile commands ENTRYPOINT - or SHELL if you're running the container with run or shell, or make sure - to call that script manually as part of the container exec invocation. +After that there's some information if you're planning to learn to write your own WDL pipelines. +Note that each executor has different constraints on the WDL you write, so if you're writing your own WDL, +first figure out what platforms you want it to run on and then read the "Constraints" sections +for those executors/platforms before beginning to write your WDL. .. _WDL_with_Cromwell_on_Expanse: WDL with Cromwell on Expanse (or other clusters) ------------------------------------------------ -Constraints on how you write your WDL -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Constraints (important if you're writing your own WDL) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cromwell only supports WDL 1.0, not 1.1 or development (2.0) -Running -^^^^^^^ +Getting Cromwell +^^^^^^^^^^^^^^^^ -Requires Java. Download the JAR file from `here `__. -(Can ignore womtool) +First, install java. -Cromwell will require configuration before working well (see below), but just as an intro: +Jonathan compiled Cromwell from source with two changes to make it run better on Expanse. You can access that JAR +at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/utilities/cromwell-86-90af36d-SNAP-600ReadWaitTimeout-FingerprintNoTimestamp.jar`. -#. Grab an interactive node. -#. Copy the config below to a location you want and modify it as necessary, and get your WDL workflow. -#. Stand up the MySQL server (see below) -#. Have singularity cache whatever containers you plan to use (see the Singularity section of the Expanse notes) -#. To run the WDL in cromwell on the interactive node, run the command :code:`java -Dconfig.file= -jar /cromwell-.jar run ` -#. If you want to submit jobs for each task instead of running them directly on the interactive node, - change which :code:`backend.default` is commented out in the config. +Alternatively, you can download Cromwell's JAR official file from `here `__. You can +ignore the womtool JAR at that location. -Either way, you need to be running Cromwell on an interactive node. -If you want more logs from cromwell for debugging purposes (you don't), prepend the flag :code:`-DLOG_LEVEL=DEBUG` - -Cromwell has a server mode where you stand it up and can inspect running jobs through a web interface. I haven't -learned how to use that, so I'm not documenting it here. +Running +^^^^^^^ -Inputs and outputs -^^^^^^^^^^^^^^^^^^ +Once you have the WDL pipeline you want to run, here are the steps for running it with Cromwell: -If you're using a container then your inputs cannot contain symlinks. +#. See :ref:`_cromwell_configuration` below for setting up your cromwell configuration. +#. If you're running with Docker containers, see :ref:`_Using_Singularity_to_run_Docker_containers` for setting up your :code:`.bashrc` file to make singularity work on Expanse, + and then cache the singularity images before you start your job, also documented there. +#. Start by :ref:`_getting_an_interactive_node_on_Expanse`. That should last for as long as the entire pipeline you are running with WDL. + Depending on how long it will take, consider :ref:`_increasing_job_runtime_up_to_one_week`. +#. If you want to enable call-caching, stand up the MySQL server on the interactive node (see below) +#. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json ` + to run the WDL using Cromwell. Feel free to omit the input and options flags if you're not using them. -* If they do, you'll get something like file does not exist errors. -* It's possible symlinks to files underneath the root of the run will work, but not to files outside of the run root. I'd just avoid. -* Instead of symlinks, use hardlinks. +Note: Cromwell has a server mode where you stand it up and can inspect running jobs through a web interface. As I (Jonathan) haven't +learned how to use that, so I'm not documenting it here. -Cromwell will dump its outputs to :code:`cromwell-executions///call-/execution` -That folder can also be used to inspect the stdout and stderr of that task for debugging. -Worfklow run ids are unhelpful randomly generated strings. To figure out which belongs to your -most recent run, you can look at the logs on the terminal for that run, or use -:code:`ls -t` to sort them by recency, e.g. :code:`cd cromwell-executions/ | ls -t | head -1`. -To check a task's inputs, looks at :code:`cromwell-executions///call-/inputs//` -If you use subworkflows in your WDL then those workflows will be represented by nested folders between -the base workflow and the end task leaf. If your task has multiple inputs, then you'll have to look -at all the input folders with arbitrary numbers to determine which is the input you're looking for. -If you move task outputs from those folders they will no longer be available for call caching (see below), -so don't do that. I would instead hard link or copy them if you want the output in a more memorable location. +Cromwell -Cromwell's outputs will keep growing as you keep running it if you don't delete them. And due to randomized workflow run IDs it'll be very -hard to track which workflows have results important to caching and which errored out or are no longer needed. -No clue how to make managing that easier. +.. _cromwell_configuration: Configuration ^^^^^^^^^^^^^ -I recommend you make a copy of my config `here `. +I (Jonathan) recommend you make a copy of my config `here `. Another reference is the `example config `_ -from Cromwell's docs, but it doesn't explain everything or have every option +from Cromwell's docs, but it doesn't explain everything or have every option you might want. After copying my config, you will need to: * swap my email address for yours -* Either set up call caching below, or call-caching.enabled = False - If you disable it, then every time you run a job it will be run again from scratch -* When running jobs, if you want to run them all on the local node, change - :code:`default = "SLRUM"` to :code:`default = "Local"` +* Either set up :ref:`_call_caching_with_Cromwell`, or set :code:`call-caching.enabled = False`. + If you disable it, then every time you run a job it will be run again from the beginning instead of reusing intermediate results that finished successfully. +* When running jobs, if you want to run them all on the cluster, make sure under backend that :code:`default = "SLURM"`. If you only have a small number of jobs and + you'd rather run them on your local node for debugging purposes or because the Expanse queue is backed up right now, instead change that to :code:`default = "Local"` -Note that +If you want to understand the config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text @@ -169,54 +91,57 @@ is equivalent to :code:`foo.bar.baz = "bop"` are ignored. Runtime attributes with :code:`?` or that have defaults :code:`= ` are optional, runtime attributes that are just declared (e.g. :code:`String dx_timeout`) are required. +.. _call_caching_with_cromwell: + Call caching with Cromwell ^^^^^^^^^^^^^^^^^^^^^^^^^^ -Call caching allows you to reuse results of an old call in place of rerunning it if they have -the same inputs. This is generally necessary for developing most large workflows. (In general -these tasks may have different runtime-attributes and still be equivalent for call-caching, -docker is the main exception, see below) +Call caching allows you to reuse results of a successful previous run of a WDL task in place of rerunning that task. +Note that the task being reused must have had the exact same inputs and docker file as the task being replaced. -You need to configure Cromwell with a database to store the cache results. While its unpleasantly complex, -I'd if you want call caching I'd recommend the MySQL database as the others do not function well. -This requires a running MySQL server. +Call caching is generally helpful for large workflows where you might find an error halfway through your workflow run +and want to restart the workflow without having to rerun everything from the beginning. Unfortunately, this requires configuring Cromwell with a database to store the cache results +which is unpleasantly complex, as it requires running a MySQL server. -First, make sure you've set up your :code:`.bashrc` to handle :ref:`Using_Singularity_to_run_Docker_containers` - -Then, from the node which you plan to execute cromwell from, run: +To enable call caching, you will need to do the following once: +* make sure you've set up your :code:`.bashrc` to handle :ref:`Using_Singularity_to_run_Docker_containers` +* :code:`cd` into the directory you want to launch cromwell from and make the following directories: .. code-block:: bash - singularity run --containall --env MYSQL_ROOT_PASSWORD=pass --bind :/var/lib/mysql --bind :/var/run/mysqld docker://mysql > 2>&1 & + mkdir -p cromwell-executions/mysql_var_run_mysqld + mkdir -p cromwell-executions/mysqldb -This uses the default mysql docker continaer from DockerHub to start a mysql server. Here :code:`` should -be an absolute path to the directory where you want to store the MySQL database, :code:`` should be an absolute -path to a directory where MySQL can store some working files (I have it as be a sibling directory to :code:``), and :code:`` -should be a path to a file where you want MySQL to write its log for the current session (for debugging if necessary). -So, for example +Then, each time you want to run Cromwell, after logging in to the interactive node but before running Cromwell, run .. code-block:: bash singularity run --containall --env MYSQL_ROOT_PASSWORD=pass --bind ${PWD}/cromwell-executions/mysqldb:/var/lib/mysql --bind ${PWD}/cromwell-executions/mysql_var_run_mysqld:/var/run/mysqld docker://mysql > cromwell-executions/mysql.run.log 2>&1 & -To take down the MySQL server, just kill the process from that command. - -Note: I've configured the MySQL database with a dummy user and password (user = root, password = pass) -which is not secure. I'm just assuming the Expanse nodes are secure enough already and no one -malicious is on them. Also, this uses the default MySQL port (3306). You may need to change that -if someone's already taken that port. +This starts a MySQL server running on the interactive node by using singularity to run the the default MySQL docker. +This command stores the MySQL log at :code:`cromwell-executions/mysql.run.log` if you need it for debuging. -The first time you stand up the mysql database with those paths, you'll need to run the following: +The first time you stand up MySQL, you'll need to run the following: .. code-block:: bash # start an interactive my sql session - mysql -h localhost -P --protocol tcp -u root -ppass cromwell + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell # from within the mysql prompt create database cromwell; exit; You should now (finally!) be good to go with call caching. +Debugging MySQL issues +~~~~~~~~~~~~~~~~~~~~~~ + +To take down the MySQL server, just kill the process spawned by that command. + +Note: I've configured the MySQL database with a dummy user and password (user = root, password = pass) +which is not secure. I'm just assuming the Expanse nodes are secure enough already and no one +malicious is on them. Also, this uses the default MySQL port (3306). You may need to change that +(I don't know how) if someone's already taken that port. + *Debugging tip if cromwell hangs at* :code:`[info] Running with database db.url = jdbc:mysql://localhost/cromwell?rewriteBatchedStatements=true`: If the previous cromwell execution didn't shut down cleanly (say, you kill it because it's hanging) then the MySQL server may remain locked and @@ -224,15 +149,19 @@ uninteractable, causing the next cromwell session to hang. To fix this, run: .. code-block:: bash - mysql -h localhost -P --protocol tcp -u root -ppass cromwell \ + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell \ < <(echo "update DATABASECHANGELOGLOCK set locked=0, lockgranted=null, lockedby=null where id=1;" ) + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell \ + < <(echo "update SQLMETADATADATABASECHANGELOGLOCK set locked=0, lockgranted=null, lockedby=null where id=1;" ) To check this has worked, you can run: .. code-block:: bash - mysql -h localhost -P --protocol tcp -u root -ppass cromwell \ + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell \ < <(echo "select * from DATABASECHANGELOGLOCK;") + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell \ + < <(echo "select * from SQLMETADATADATABASECHANGELOGLOCK;") that should return output something like: @@ -240,6 +169,8 @@ that should return output something like: ID LOCKED LOCKGRANTED LOCKEDBY 1 \0 NULL NULL + ID LOCKED LOCKGRANTED LOCKEDBY + 1 \0 NULL NULL *Debugging tip if the mysql log at path3 says* :code:`another process is using this socket` @@ -249,51 +180,34 @@ Delete the lock files at `/*lock`, kill the mysql server and then restart .. code-block:: bash - mysql -h localhost -P --protocol tcp -u root -ppass cromwell + mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell Notice there is no space between the -p and the password, unlike all the other flags. Unexpected call caching behaviors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you set the docker runtime attribute for a task -then for call caching Cromwell insists on trying to find -the corresponding docker image and using its digest (i.e. hash code) -as one of the keys for caching that task (not just the docker string -itself) (see `here `__). If cromwell can't figure out how to locate the docker image -then it simply refuses to try to load the call from cache. -Cromwell's log method of telling you this is very unclear, I think -it's something like "task not eligible for call caching". +then Cromwell insists on looking up the +corresponding docker image and using its digest (i.e. hash code) +as one of the keys for caching that task. This is unintuitive because it's not just using the string +in the runtime attribute as the cache key (see `here `__). +Moreover, if cromwell can't figure out how to locate the docker image's digest during this process, +then it simply refuses to try to load the call from cache at all, with a very inspecific +log message to the effect of "task not eligible for call caching". Because of this design choice, I'm not sure if you can get Cromwell -call caching to work with local docker image tarballs. +call caching to work with local docker image tarballs, which cause the image digest lookup step to fail. -Another unexpected input to call caching seems to be the backend +Another surprising behavior is that call caching seems to be backend specific (though I've not seen this confirmed in the docs), so for instance -if you run your job sometimes with SLURM and sometimes on an interactive -node, I can't seem to use the results of one in the other. - -Other call caching optimizations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Even with the above, my caching was quite slow, I think one of these options sped it up. -Not 100% sure which. They both have some details that might be worth knowing. - -* :code:`backend.SLURM.filesystems.local.caching.check-sibling-md5: true`. In theory - this means that if your input file is `foo.txt.` and you have `foo.txt.md5` in the same directory - then instead of hashing the entirety of `foo.txt` you just read the md5 from the nearby file. - This can be used to avoid hashing large input files more than once. Just use - :code:`md5sum $file | awk '{print $1}'> ${file}.md5` to write the md5 checksum. -* :code:`backend.SLURM.filesystems.local.caching.fingerprint-size: true`. This isn't documented - anywhere that I saw, but does exist in the `code `_ - This reduces the amount of file that's read by the hashing strategy. Note that this means that two files - with the first MB of data identical and the sam mod time will be treated as identical, even if the - remaining MBs differ - -Disabling call caching -~~~~~~~~~~~~~~~~~~~~~~ +if you run your job sometimes with SLURM and sometimes locally on an interactive +node, I can't seem to use the cached results of one for the other. + +Disabling call caching for a task +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Add -.. code-block: text +.. code-block:: text meta { volatile: true @@ -301,11 +215,36 @@ Add to a task definition to prevent it from being cached. +Cromwell Inputs and outputs +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you're using a container then your inputs cannot contain symlinks. + +* If they do, you'll get something like file does not exist errors. +* It's possible symlinks to files underneath the root of the run will work, but not to files outside of the run root. I'd just avoid. +* Instead of symlinks, use hardlinks. + +Cromwell will dump its outputs to :code:`cromwell-executions///call-/execution` +That folder can also be used to inspect the stdout and stderr of that task for debugging. +Worfklow run ids are unhelpful randomly generated strings. To figure out which belongs to your +most recent run, you can look at the logs on the terminal for that run, or use +:code:`ls -t` to sort them by recency, e.g. :code:`cd cromwell-executions/ | ls -t | head -1`. +To check a task's inputs, looks at :code:`cromwell-executions///call-/inputs//` +If you use subworkflows in your WDL then those workflows will be represented by nested folders between +the base workflow and the end task leaf. If your task has multiple inputs, then you'll have to look +at all the input folders with arbitrary numbers to determine which is the input you're looking for. +If you move task outputs from those folders they will no longer be available for call caching (see below), +so don't do that. I would instead hard link or copy them if you want the output in a more memorable location. + +Cromwell's outputs will keep growing as you keep running it if you don't delete them. And due to randomized workflow run IDs it'll be very +hard to track which workflows have results important to caching and which errored out or are no longer needed. +No clue how to make managing that easier. + WDL with dxCompiler on DNANexus/UKB Research Analysis Platform -------------------------------------------------------------- -Constraints on how you write your WDL -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Constraints (important if you're writing your own WDL) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Unlike Cromwell, dxCompiler supports WDL 1.1. So if you don't need your WDL to be cross-platform, you can use those features. @@ -387,6 +326,80 @@ Constraints on how you write your WDL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cromwell only supports WDL 1.0, not 1.1 or development (2.0) + +!!!! TODO + +I recommend these links for learning WDL. Supplement with tutorials as helps. + +* `WDL 1.0 spec `_ + (it's quite readable!) +* `differences between WDL versions `_ + + +TODO intro to writing WDL - all the below assumes you can write basic WDL and is about running it +or making it runnable on certain platforms + +Containers +---------- +You'll likely want to specify a container with the :code:`docker` runtime flag as it's +necessary to execute your WDL on cloud platforms. (Cromwell doesn't support the +equivalent :code:`container` flag). + +Constraints imposed by runtime environments: + +* If running All of Us, seems like you'll need to host on Google Container Registry? (not tested) +* If running with Cromwell on Expanse, will need to either store the image locally, or host + on one of the following supported environments: quay.io, dockerhub, google container registry (GCR) + or google artifact registry (GAR). I'm not sure storing locally will work though, + as I'm not sure you can get call caching to work with that - haven't tried. +* No constraints for UKB RAP as far as I know - you can upload the docker container to DNA Nexus, + or pull from an cloud container registry. + +quay.io is my cloud container registry of choice. Terminology: + +* quay.io - Red Hat's cloud container registry +* Red Hat Quay - Red Hat's private deployment container registry service +* Project quay - an open source version of Red Hat Quay where you can + deploy and stand up your own private container registry + +It's my container registry of choice because it has free accounts +(though this isn't super clear from their pricing docs), doesn't charge +for public containers, and because at least +so far I haven't found any pull restrictions. If you do run into issues, +I'd recommend moving to GCR. Yang has tried Dockerhub, but that has really +restrictive pull limits if you're using the free account. The paid account +isn't such an issue (only $7/mo.) but Yang couldn't figure out how to get +the authentication to work on UKB RAP so that you could log in from each task +before pulling the docker container so as to circumvent the pull limit. + +Repositories in quay.io start as private, even on the free account +which in theory hasn't paid for private repos (not sure why?). +After pushing to them for the first time, +sign into the web interface, select the repo, click on the wheel icon +on the left (settings) and click Make Public. + +To push to quay.io after building your docker image, do + +.. code-block:: bash + + docker login --username quay.io + docker tag : quay.io//: + docker push quay.io//: + +depending on how you configured docker, you may need to run those commands with sudo. + +Tips on building a container with conda +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* Use :code:`continuumio/miniconda3` as the base container. +* Put :code:`RUN conda init --system bash` in your Dockerfile +* See the section about conda and dxCompiler below to get + a script for activating conda. Then either configure that to run + automatically with the Dockerfile commands ENTRYPOINT + or SHELL if you're running the container with run or shell, or make sure + to call that script manually as part of the container exec invocation. + + Gotchas ------- (I'm unclear if these gotchas only exist for Cromwell running WDL 1.0 or for all versions of WDL and also for dxCompiler) From 86803e199d228cb83d5257fc78afb6365a83274e Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Mon, 4 Dec 2023 17:19:02 -0800 Subject: [PATCH 02/10] More updates --- doc/UKB_Expanse_STR_GWAS.rst | 7 +- doc/WDL.rst | 161 ++++++++++++++++++++++------------- 2 files changed, 105 insertions(+), 63 deletions(-) diff --git a/doc/UKB_Expanse_STR_GWAS.rst b/doc/UKB_Expanse_STR_GWAS.rst index 2ab7656..4964516 100644 --- a/doc/UKB_Expanse_STR_GWAS.rst +++ b/doc/UKB_Expanse_STR_GWAS.rst @@ -25,6 +25,7 @@ Create a json file for input: .. code-block:: json { + "expanse_gwas.script_dir": "...", "expanse_gwas.phenotype_name": "your_phenotype_name", "expanse_gwas.phenotype_id": "its_ID", "expanse_gwas.categorical_covariate_names": ["a_list_of_categorical_covariates"], @@ -48,12 +49,6 @@ The docker container you'll want to cache with Singularity is :code:`quay.io/the In the cromwell.conf file you create, add this: -.. code-block:: text - - workflow-options { - workflow-log-dir = "absolute_path_to_store_cromwell_logs_in" - } - And then for the two lines that says :code:`root = "cromwell-executions`, change them to an absolute path to the location you want all of your Cromwell run's work to be stored in. diff --git a/doc/WDL.rst b/doc/WDL.rst index 4b9cc57..6c3cf6e 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -9,7 +9,7 @@ you'll use Cromwell as the executor. If you're using DNANexus, you'll use dxComp The first sections below are about getting set up with those executors on the platform you're using. You should read those whether you're planning on writing your own WDL or running someone else's. -After that there's some information if you're planning to learn to write your own WDL pipelines. +After that there's some information if you're planning to learn to write your own WDL workflows. Note that each executor has different constraints on the WDL you write, so if you're writing your own WDL, first figure out what platforms you want it to run on and then read the "Constraints" sections for those executors/platforms before beginning to write your WDL. @@ -34,24 +34,57 @@ at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/utilities/cromwell-86-9 Alternatively, you can download Cromwell's JAR official file from `here `__. You can ignore the womtool JAR at that location. +Specifying inputs to WDL workflows with Cromwell +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Cromwell passes inputs to WDL workflows via a JSON file which looks like: + +.. code-block:: json + + { + "workflow_name.input1_name": "value1", + "workflow_name.input2_name": "value2", + ... + } + +Add `-i .json` to your Cromwell run command to pass the input file. + +If the WDL workflow you're running uses containers (e.g. Docker) then file inputs to the workflow cannot be symlinks. +If the files are symlinks, you'll get something like "file does not exist" errors. Instead of symlinks, use hardlinks. + +Specifying Cromwell output locations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:ref:`_Cromwells_execution_directory` is confusingly organized, so it's hard to find the outputs of a Cromwell run +if you don't tell it to put them anywhere specific. Instead, add an options JSON file to your Cromwell run with `-o .json` +and tell it to put the workflow's outputs in the location you'd like: + +.. code-block:: json + + { + "final_workflow_outputs_dir": "" + } + Running ^^^^^^^ -Once you have the WDL pipeline you want to run, here are the steps for running it with Cromwell: +Here are the steps for running it with Cromwell: #. See :ref:`_cromwell_configuration` below for setting up your cromwell configuration. #. If you're running with Docker containers, see :ref:`_Using_Singularity_to_run_Docker_containers` for setting up your :code:`.bashrc` file to make singularity work on Expanse, and then cache the singularity images before you start your job, also documented there. -#. Start by :ref:`_getting_an_interactive_node_on_Expanse`. That should last for as long as the entire pipeline you are running with WDL. +#. Start by :ref:`_getting_an_interactive_node_on_Expanse`. You should set that to last for as long as the entire WDL workflow you are running with Cromwell. Depending on how long it will take, consider :ref:`_increasing_job_runtime_up_to_one_week`. #. If you want to enable call-caching, stand up the MySQL server on the interactive node (see below) -#. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json ` +#. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json | tee .txt` to run the WDL using Cromwell. Feel free to omit the input and options flags if you're not using them. Note: Cromwell has a server mode where you stand it up and can inspect running jobs through a web interface. As I (Jonathan) haven't learned how to use that, so I'm not documenting it here. -Cromwell +If you need help debugging, start by looking at Cromwell's log file, which will be written to the log file you specified at the end of the command above. +If the workflow completed successfully, the lines toward the end of the log should tell you where it put the workflow's outputs (if you didn't specify an output location above). +If a task failed and you want to inspect its intermediate inputs/outputs for debugging, see :ref:`_Cromwells_execution_directory`. .. _cromwell_configuration: @@ -215,26 +248,42 @@ Add to a task definition to prevent it from being cached. -Cromwell Inputs and outputs -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -If you're using a container then your inputs cannot contain symlinks. +.. _Cromwells_execution_directory: -* If they do, you'll get something like file does not exist errors. -* It's possible symlinks to files underneath the root of the run will work, but not to files outside of the run root. I'd just avoid. -* Instead of symlinks, use hardlinks. +Cromwell's execution directory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Cromwell will dump its outputs to :code:`cromwell-executions///call-/execution` -That folder can also be used to inspect the stdout and stderr of that task for debugging. +Cromwell runs its executions (including task inputs and outputs) in :code:`cromwell-executions//` Worfklow run ids are unhelpful randomly generated strings. To figure out which belongs to your most recent run, you can look at the logs on the terminal for that run, or use :code:`ls -t` to sort them by recency, e.g. :code:`cd cromwell-executions/ | ls -t | head -1`. -To check a task's inputs, looks at :code:`cromwell-executions///call-/inputs//` -If you use subworkflows in your WDL then those workflows will be represented by nested folders between -the base workflow and the end task leaf. If your task has multiple inputs, then you'll have to look -at all the input folders with arbitrary numbers to determine which is the input you're looking for. -If you move task outputs from those folders they will no longer be available for call caching (see below), -so don't do that. I would instead hard link or copy them if you want the output in a more memorable location. +Once you're in the your workflow run's folder, you should see one folder named `call-` +for each task called in the workflow. The task folder will contain two important directories :code:`inputs` and :code:`executions`. +:code:`inputs` contains a bunch of subfolders with random numbers, each of which contain one or more input files (input files +originally stored in the same directory will be put into the same inputs subdirectory). Note that input files will be named +by their original filenames, not by the variable names they were referred to in the task, so it can be hard to match which inputs +in this directory correspond to which inputs in the task. :code:`executions` contains a number of useful files for debugging: + +* :code:`rc` contains the return code of the task (if it completed) +* :code:`script.submit` is the script used to submit the task to SLURM (not sure if this is present on local runs) +* :code:`stdout.submit` and :code:`stderr.submit` are the stdout/err for the job submission to SLURM. +* :code:`script` contains the script that Cromwell executed to run this task on a SLRUM node (which is the command section of the task wrapped in + some autogenerated code) +* :code:`stdout` and :code:`stderr` are the stdout/err for the actual run of the task (if you didn't capture them inside + WDL with :code:`stdout()` or :code:`stderr()`). +* All the output files generated by the task should be in this folder as well. + If you move task outputs from this folders they will no longer be available for call caching, + so don't do that. Instead, hard or symlink them to another location. + +If the task was call cached, then instead `call-` will contain `cacheCopy/execution` as a subdirectory +and there will be no inputs folder you can cross reference against (which can make debugging harder). + +If the workflow you called in turn called subworkflows, those workflows will be represented by nested folders between +the base workflow and the end task leaf, looking something like: +:code:`cromwell-executions///call-///call-...` +If a task or subworkflow is called in a scatter block, then between the `call-` folder and its +usual contents there will be a bunch of `shard-` folders which contain each of the scattered subcalls. All this nesting +can get a bit overwhelming when you're trying to debug. Cromwell's outputs will keep growing as you keep running it if you don't delete them. And due to randomized workflow run IDs it'll be very hard to track which workflows have results important to caching and which errored out or are no longer needed. @@ -327,21 +376,49 @@ Constraints on how you write your WDL Cromwell only supports WDL 1.0, not 1.1 or development (2.0) -!!!! TODO +Learning WDL +------------ -I recommend these links for learning WDL. Supplement with tutorials as helps. +I recommend these links for learning WDL. There are also good tutorials you can find for parts of the spec you're confused by. * `WDL 1.0 spec `_ (it's quite readable!) * `differences between WDL versions `_ +WDL Gotchas +^^^^^^^^^^^ + +(I'm unclear if these gotchas only exist for Cromwell running WDL 1.0 or for all versions of WDL and also for dxCompiler) + +* There are no :code:`else` statements to pair with :code:`if` statements. Instead + write :code:`if (x) {}`, then :code:`if (!x) {}`, and then use :code:`select_first()` + to condense the results of both branches to single variables. +* For whatever reason, trying :code:`my_array[x+1]` will fail at compile time. Instead, write + :code:`Int x_plus_one = x + 1` and then :code:`my_array[x_plus_one]`. +* There is no array slicing. If you want to scatter over :code:`item in my_array[1:]`, instead + scatter over :code:`idx in range(length(my_array)-1)` and manually access the array at + `Int idx_plus_one = idx + 1` +* If you want to create an array literal that's easier to specify via a list comprehension than to type it all out, + do so by writing out the expression inside a scatter block in a worfklow. There's no way to get list comprehensions to work + anywhere in tasks or within the input or output sections of a workflow. +* The :code:`glob()` library function can only be used within tasks, not within workflows. + It will not error out at language examination time but at runtime if used within a workflow. +* The :code:`write_XXX()` functions will fail in weird ways if used in a workflow and not a task. +* The :code:`write_XXX()` functions will not accept :code:`Array[X?]`, only :code:`Array[X]`. + +These gotchas I know only apply to WDL 1.0 (but perhaps to both Cromwell and dxCompiler?) + +* The :code:`write_objects()` function will crash when passed an empty array of structs + instead of writing a header line and no content rows. +* The :code:`write_objects()` function will crash at runtime when passed a struct with a member + that is a compound type (struct, map, array, object). +* While structs can contain members of multiple types, maps cannot, and so to create such a struct + it must be assigned from an object literal and not a map literal. -TODO intro to writing WDL - all the below assumes you can write basic WDL and is about running it -or making it runnable on certain platforms +Using Docker containers from WDL +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Containers ----------- -You'll likely want to specify a container with the :code:`docker` runtime flag as it's +You'll likely want to specify a container within each tasks' :code:`docker` runtime flag as that's necessary to execute your WDL on cloud platforms. (Cromwell doesn't support the equivalent :code:`container` flag). @@ -399,33 +476,3 @@ Tips on building a container with conda or SHELL if you're running the container with run or shell, or make sure to call that script manually as part of the container exec invocation. - -Gotchas -------- -(I'm unclear if these gotchas only exist for Cromwell running WDL 1.0 or for all versions of WDL and also for dxCompiler) - -* There are no :code:`else` statements to pair with :code:`if` statements. Instead - write :code:`if (x) {}`, then :code:`if (!x) {}`, and then use :code:`select_first()` - to condense the results of both branches to single variables. -* For whatever reason, trying :code:`my_array[x+1]` will fail at compile time. Instead, write - :code:`Int x_plus_one = x + 1` and then :code:`my_array[x_plus_one]`. -* There is no array slicing. If you want to scatter over :code:`item in my_array[1:]`, instead - scatter over :code:`idx in range(length(my_array)-1)` and manually access the array at - `Int idx_plus_one = idx + 1` -* If you want to create an array literal that's easier to specify via a list comprehension than to type it all out, - do so by writing out the expression inside a scatter block in a worfklow. There's no way to get list comprehensions to work - anywhere in tasks or within the input or output sections of a workflow. -* The :code:`glob()` library function can only be used within tasks, not within workflows. - It will not error out at language examination time but at runtime if used within a workflow. -* The :code:`write_XXX()` functions will fail in weird ways if used in a workflow and not a task. -* The :code:`write_XXX()` functions will not accept :code:`Array[X?]`, only :code:`Array[X]`. - -These gotchas I know only apply to WDL 1.0 (but perhaps to both Cromwell and dxCompiler?) - -* The :code:`write_objects()` function will crash when passed an empty array of structs - instead of writing a header line and no content rows. -* The :code:`write_objects()` function will crash at runtime when passed a struct with a member - that is a compound type (struct, map, array, object). -* While structs can contain members of multiple types, maps cannot, and so to create such a struct - it must be assigned from an object literal and not a map literal. - From ad88d1885e7059f9f2138a2bcf91ecc1b1f04ebf Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Tue, 5 Dec 2023 10:06:29 -0800 Subject: [PATCH 03/10] Still updating --- doc/WDL.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/doc/WDL.rst b/doc/WDL.rst index 6c3cf6e..949a18d 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -31,7 +31,7 @@ First, install java. Jonathan compiled Cromwell from source with two changes to make it run better on Expanse. You can access that JAR at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/utilities/cromwell-86-90af36d-SNAP-600ReadWaitTimeout-FingerprintNoTimestamp.jar`. -Alternatively, you can download Cromwell's JAR official file from `here `__. You can +Alternatively, you can download Cromwell's official JAR file from `here `__. You can ignore the womtool JAR at that location. Specifying inputs to WDL workflows with Cromwell @@ -47,7 +47,7 @@ Cromwell passes inputs to WDL workflows via a JSON file which looks like: ... } -Add `-i .json` to your Cromwell run command to pass the input file. +Add `-i .json` to your Cromwell run command to pass in the input file. If the WDL workflow you're running uses containers (e.g. Docker) then file inputs to the workflow cannot be symlinks. If the files are symlinks, you'll get something like "file does not exist" errors. Instead of symlinks, use hardlinks. @@ -55,7 +55,7 @@ If the files are symlinks, you'll get something like "file does not exist" error Specifying Cromwell output locations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -:ref:`_Cromwells_execution_directory` is confusingly organized, so it's hard to find the outputs of a Cromwell run +:ref:`_Cromwells_execution_directory` is confusingly organized, so it's hard to find the outputs of the final tasks(s) in a Cromwell run if you don't tell it to put them anywhere specific. Instead, add an options JSON file to your Cromwell run with `-o .json` and tell it to put the workflow's outputs in the location you'd like: @@ -72,10 +72,11 @@ Here are the steps for running it with Cromwell: #. See :ref:`_cromwell_configuration` below for setting up your cromwell configuration. #. If you're running with Docker containers, see :ref:`_Using_Singularity_to_run_Docker_containers` for setting up your :code:`.bashrc` file to make singularity work on Expanse, - and then cache the singularity images before you start your job, also documented there. + and then cache the singularity images before you start your job (also documented there). #. Start by :ref:`_getting_an_interactive_node_on_Expanse`. You should set that to last for as long as the entire WDL workflow you are running with Cromwell. Depending on how long it will take, consider :ref:`_increasing_job_runtime_up_to_one_week`. -#. If you want to enable call-caching, stand up the MySQL server on the interactive node (see below) +#. To enable :ref:`call-caching <_call_caching_with_Cromwell>`, (create the necessary directories, if this is your first time) and stand up the MySQL server on the interactive node (and + then create the cromwell database, if this is your first time) #. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json | tee .txt` to run the WDL using Cromwell. Feel free to omit the input and options flags if you're not using them. From 197502b600f62ee3e25fc95c441cf8b0cc812717 Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Tue, 5 Dec 2023 10:19:50 -0800 Subject: [PATCH 04/10] Fix references --- doc/Expanse.rst | 2 +- doc/WDL.rst | 31 +++++++++++++++++++------------ 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/doc/Expanse.rst b/doc/Expanse.rst index d2d928a..49bd010 100644 --- a/doc/Expanse.rst +++ b/doc/Expanse.rst @@ -148,7 +148,7 @@ To make singularity work, I add the following to my :code:`.bashrc`: export SINGULARITY_TMPDIR="/scratch/$USER/job_$SLURM_JOB_ID" fi -The general idea is, first grab an interactive node (or put this in a script that you submit) and then: +If you want to run inside a singularity image, first grab an interactive node (or put this in a script that you submit) and then: .. code-block:: bash diff --git a/doc/WDL.rst b/doc/WDL.rst index 949a18d..79687b7 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -44,7 +44,7 @@ Cromwell passes inputs to WDL workflows via a JSON file which looks like: { "workflow_name.input1_name": "value1", "workflow_name.input2_name": "value2", - ... + "...": "..." } Add `-i .json` to your Cromwell run command to pass in the input file. @@ -55,7 +55,7 @@ If the files are symlinks, you'll get something like "file does not exist" error Specifying Cromwell output locations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -:ref:`_Cromwells_execution_directory` is confusingly organized, so it's hard to find the outputs of the final tasks(s) in a Cromwell run +:ref:`Cromwells_execution_directory` is confusingly organized, so it's hard to find the outputs of the final tasks(s) in a Cromwell run if you don't tell it to put them anywhere specific. Instead, add an options JSON file to your Cromwell run with `-o .json` and tell it to put the workflow's outputs in the location you'd like: @@ -68,16 +68,23 @@ and tell it to put the workflow's outputs in the location you'd like: Running ^^^^^^^ -Here are the steps for running it with Cromwell: +Here are the steps you need for running Cromwell the first time: -#. See :ref:`_cromwell_configuration` below for setting up your cromwell configuration. -#. If you're running with Docker containers, see :ref:`_Using_Singularity_to_run_Docker_containers` for setting up your :code:`.bashrc` file to make singularity work on Expanse, +#. See :ref:`cromwell_configuration` below for setting up your cromwell configuration. +#. If you're running with Docker containers, see :ref:`Using_Singularity_to_run_Docker_containers` for setting up your :code:`.bashrc` file to make singularity work on Expanse, and then cache the singularity images before you start your job (also documented there). -#. Start by :ref:`_getting_an_interactive_node_on_Expanse`. You should set that to last for as long as the entire WDL workflow you are running with Cromwell. - Depending on how long it will take, consider :ref:`_increasing_job_runtime_up_to_one_week`. -#. To enable :ref:`call-caching <_call_caching_with_Cromwell>`, (create the necessary directories, if this is your first time) and stand up the MySQL server on the interactive node (and - then create the cromwell database, if this is your first time) -#. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json | tee .txt` + +And here are the steps you'll perform each time you run Cromwell: + +#. Start by :ref:`getting_an_interactive_node_on_Expanse`. You should set that to last for as long as the entire WDL workflow you are running with Cromwell. + Depending on how long it will take, consider :ref:`increasing_job_runtime_up_to_one_week`. +#. Enable :ref:`call-caching `, which outlines the following steps: + + #. First time only: create the necessary directories + #. Each time: stand up the MySQL server on the interactive node + #. First time only: create the the cromwell database + +#. From the interactive node, execute the command :code:`java -Dconfig.file= -jar .jar run -i .json -o .json | tee .txt` to run the WDL using Cromwell. Feel free to omit the input and options flags if you're not using them. Note: Cromwell has a server mode where you stand it up and can inspect running jobs through a web interface. As I (Jonathan) haven't @@ -85,7 +92,7 @@ learned how to use that, so I'm not documenting it here. If you need help debugging, start by looking at Cromwell's log file, which will be written to the log file you specified at the end of the command above. If the workflow completed successfully, the lines toward the end of the log should tell you where it put the workflow's outputs (if you didn't specify an output location above). -If a task failed and you want to inspect its intermediate inputs/outputs for debugging, see :ref:`_Cromwells_execution_directory`. +If a task failed and you want to inspect its intermediate inputs/outputs for debugging, see :ref:`Cromwells_execution_directory`. .. _cromwell_configuration: @@ -99,7 +106,7 @@ from Cromwell's docs, but it doesn't explain everything or have every option you After copying my config, you will need to: * swap my email address for yours -* Either set up :ref:`_call_caching_with_Cromwell`, or set :code:`call-caching.enabled = False`. +* Either set up :ref:`call_caching_with_Cromwell`, or set :code:`call-caching.enabled = False`. If you disable it, then every time you run a job it will be run again from the beginning instead of reusing intermediate results that finished successfully. * When running jobs, if you want to run them all on the cluster, make sure under backend that :code:`default = "SLURM"`. If you only have a small number of jobs and you'd rather run them on your local node for debugging purposes or because the Expanse queue is backed up right now, instead change that to :code:`default = "Local"` From 05ef9288c3b1ac6abc8416d82c6420c34d321d97 Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Tue, 5 Dec 2023 11:25:22 -0800 Subject: [PATCH 05/10] Updated instructions for GWAS, still need to add for fine-mapping --- doc/UKB_Expanse_STR_GWAS.rst | 102 +++++++++++++++-------------------- doc/WDL.rst | 7 ++- 2 files changed, 48 insertions(+), 61 deletions(-) diff --git a/doc/UKB_Expanse_STR_GWAS.rst b/doc/UKB_Expanse_STR_GWAS.rst index 4964516..c9f04ae 100644 --- a/doc/UKB_Expanse_STR_GWAS.rst +++ b/doc/UKB_Expanse_STR_GWAS.rst @@ -1,5 +1,5 @@ -UKB Expanse SNP/indel/STR GWAS -============================== +UKB Expanse SNP/indel/STR GWAS and fine-mapping +=============================================== This guide will show you how to run a GWAS against a UK Biobank phenotype on Expanse. The GWAS will include both SNPs, indels and STRs. This uses the WDL and scripts pipeline @@ -8,7 +8,9 @@ written for the UKB blood-traits imputed STRs paper. Setting up the GWAS and WDL inputs ---------------------------------- -First, choose a phenotype you want to perform the GWAS against. +First, check out `my paper's repository `_ into some directory you manage. + +Then, choose a phenotype you want to perform the GWAS against. You can explore UKB phenotypes `here `__. You'll need the data field ID of the phenotype, and the data field IDs of any fields you wish to use as categorical covariates. @@ -20,61 +22,36 @@ Caveats: (otherwise age calculations will be thrown off or may crash) * Currently, only categorical covariates are supported -Create a json file for input: +Create a json file for input, setting script_dir to the root of the git repo you checked out above, and all the others as appropriate: .. code-block:: json { - "expanse_gwas.script_dir": "...", + "expanse_gwas.script_dir": "repo_source" "expanse_gwas.phenotype_name": "your_phenotype_name", - "expanse_gwas.phenotype_id": "its_ID", + "expanse_gwas.phenotype_id": "its_data_field_ID", "expanse_gwas.categorical_covariate_names": ["a_list_of_categorical_covariates"], "expanse_gwas.categorical_covariate_ids": ["their_IDs"] } -Create a json options file specifying where you want your output to be written: - -.. code-block:: json - - { - "final_workflow_outputs_dir": "your_output_directory" - } - - Running the GWAS ---------------- Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Singularity. -The docker container you'll want to cache with Singularity is :code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` - -In the cromwell.conf file you create, add this: - -And then for the two lines that says :code:`root = "cromwell-executions`, change them to an -absolute path to the location you want all of your Cromwell run's work to be stored in. +The two docker containers you'll want to cache with Singularity prior to your run are +:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` and +:code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.4` -Once you're ready to run WDL +The WDL workflow file you'll point Cromwell to is in the repo you checked out at :code:`workflow/expanse_wdl/gwas.wdl`. Run +Cromwell as normal using the standard Cromwell instructions with your input and that workflow. -.. code-block:: bash - - # you need to be in this directory for the WDL config to find the scripts - # appropriately, but all the work, outputs and logs will be written to locations - # you've specified and not this directory - cd /expanse/projects/gymreklab/jmargoli/ukbiobank - - java \ - -Dconfig.file= \ - -jar \ - run \ - -i \ - -o \ - /expanse/projects/gymreklab/jmargoli/ukbiobank/workflow/expanse_targets/expanse_gwas_workflow.wdl \ - > - -Then you can follow along in another window with :code:`tail -f ` +If you want to run the GWAS only in a specific subpopulation of the UKB, see :ref:`running_on_a_subpopulation` below. What the GWAS does ------------------ +The full details are in the methods of the paper `here `_. In short, this pipeline: + * Gets a sample list of QCed, unrelated white brits that has the specified phenotype and each specified covariate * Includes age at time of measurement, genetic PCs and sex as additional covariates. * Rank quantile normalizes the phenotype (this in theory gives better power to work with non-normal phenotype data, @@ -82,23 +59,29 @@ What the GWAS does * Performs a GWAS for each imputed SNP, indel and STR of the transformed phenotype against the genotype of that variant and all the covariates. * Calculates the peaks of the signals across the genome. -* Gets a sample list of participants in the same manner as white brits, but for the five ethnicities: +* Gets a sample list of participants in a similar manner as white brits, but for the five ethnicities: [black, south_asian, chinese, irish, white_other] -* Runs the GWAS for STRs ONLY in those populations on the subset of regions containing a variant with p<5e-8 in the White Brits. +* Runs the GWAS for STRs in those populations *only* on the regions containing a variant with p<5e-8 in the White Brits. + +Output file names +----------------- -Output files ------------- +Final outputs: -Will all be located in :code:`your_output_dir/expanse_gwas`. Unfortunately, they paths to them -will also include IDs which are random alphanumeric strings with dashes in them. +* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`white_brits_snp_gwas.tab` +* associaTR GWAS output for imputed STRs in white_brits :code:`white_brits_str_gwas.tab` +* associaTR GWAS output for imputed STRs in the other ethnicities :code:`_str_gwas.tab` +* List of GWAS peaks across all variant types, at least 250kb separate, in white brits: :code:`peaks.tab` +* List of regions for followup fine-mapping regions in all variant types, in white brits: :code:`finemapping_regions.tab` -* PLINK GWAS output for imputed SNPs and indels in white_brits :code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-plink_snp_association/execution/out.tab` -* associaTR GWAS output for imputed STRs in white_brits :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-my_str_gwas_/execution/out.tab` -* associaTR GWAS output for imputed STRs in the other ethnicities: - :code:`workflow_ID/call-gwas/gwas/subworkflow_ID/call-ethnic_my_str_gwas_/shard_X/execution/out.tab` where X in shard_X is a number from 0 to 4 indicating - the index of the ethnicity in the list of ethnicities above -* List of GWAS peaks across all variant types in white brits: :code:`workflow_id/call-gwas/gwas/subworkflow_id/call-generate_peaks/execution/peaks.tab` -* Other intermediate outputs will also be there if you want to look at those. +Intermediate outputs potentially useful for debugging: + +* Lists of all the participants used in the GWAS after all subsetting, entitled :code:`.samples` +* The shared covars array :code:`shared_covars.npy` and their names :code:`covar_names.txt` +* The (original) untransformed phenotype data, deposited for your reference, :code:`_original_pheno.npy` +* The transformed phenotype data used in the regression plus all the covariates you specified :code:`_pheno.npy`, as well as the names of those covariates :code:`_pheno_covar_names.txt` + + .. _running_on_a_subpopulation: Running on a subpopulation -------------------------- @@ -108,19 +91,20 @@ of sample IDs into a file, one per line, with the first line having the header ' .. code-block:: json - "expanse_gwas.subpop_sample_list": "your_sample_file" + "gwas.subpop_sample_list": "your_sample_file" to the json input file. This subpopulation file must contain all samples of all ethnicities that you want included -(so any samples not included will be omitted). +(i.e. any samples not included will be omitted). + +Note that providing this file doesn't change the pipeline's workflow: * Samples that fail QC will still be removed. * Analyses will still be split per ethnicity. * Each ethnicity's sample list will still be shrunk to remove related participants -* You should include some samples from each ethnicity or the workflow will probably fail - - you'll still likely get GWAS results from the ethnicities you included, but you'll have to dig for those - instead of getting them put into the output location you asked for. -You may find the files at :code:`/expanse/projects/gymreklab/jmargoli/ukbiobank/sample_qc/runs//no_phenotype/combined.sample` -helpful for building your subpopulation - those location contains the QCed (but not yet unrelated) samples for the six ethincities used in the imputed UKB STRs paper. +Running fine-mapping +-------------------- + +TODO diff --git a/doc/WDL.rst b/doc/WDL.rst index 79687b7..32f13b7 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -111,6 +111,8 @@ After copying my config, you will need to: * When running jobs, if you want to run them all on the cluster, make sure under backend that :code:`default = "SLURM"`. If you only have a small number of jobs and you'd rather run them on your local node for debugging purposes or because the Expanse queue is backed up right now, instead change that to :code:`default = "Local"` +Note that this is configured to put cromwell's execution directory in the subfolder :code:`cromwell-executions` of wherever you launch Cromwell from. + If you want to understand the config file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -159,9 +161,10 @@ Then, each time you want to run Cromwell, after logging in to the interactive no singularity run --containall --env MYSQL_ROOT_PASSWORD=pass --bind ${PWD}/cromwell-executions/mysqldb:/var/lib/mysql --bind ${PWD}/cromwell-executions/mysql_var_run_mysqld:/var/run/mysqld docker://mysql > cromwell-executions/mysql.run.log 2>&1 & This starts a MySQL server running on the interactive node by using singularity to run the the default MySQL docker. -This command stores the MySQL log at :code:`cromwell-executions/mysql.run.log` if you need it for debuging. +This command stores the MySQL log at :code:`cromwell-executions/mysql.run.log`. +Wait to proceed till the last line in that file says :code:`X Plugin ready for connections`. -The first time you stand up MySQL, you'll need to run the following: +After that, if this is your first time running MySQL this way, you'll need to run the following: .. code-block:: bash From 470998a6876b22f845902ab1037e86c255d32dad Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Mon, 11 Dec 2023 15:14:50 -0800 Subject: [PATCH 06/10] Some clarifications from chatting with Tara --- doc/Expanse.rst | 24 ++++++++++++++++-------- doc/UKB_Expanse_STR_GWAS.rst | 12 +++++++++--- doc/WDL.rst | 4 +++- 3 files changed, 28 insertions(+), 12 deletions(-) diff --git a/doc/Expanse.rst b/doc/Expanse.rst index 49bd010..6f9c8c3 100644 --- a/doc/Expanse.rst +++ b/doc/Expanse.rst @@ -148,12 +148,27 @@ To make singularity work, I add the following to my :code:`.bashrc`: export SINGULARITY_TMPDIR="/scratch/$USER/job_$SLURM_JOB_ID" fi +Caching Singularity images +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you want to cache a singularity image on disk from a Docker source for the GWAS pipeline, and don't +need to interact with Singularity beyond that, grab an interactive node and on it simply run + +.. code-block:: bash + + singularity exec docker:// /bin/bash -c "echo pulled the image" + +Running with Singularity images +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + If you want to run inside a singularity image, first grab an interactive node (or put this in a script that you submit) and then: .. code-block:: bash singularity exec --containall docker:// +Singularity should only be used on compute nodes, not the login nodes. + You'll notice the first time you run a new docker image Singularity takes a while (~10min) building it into a singularity image. They are cached at :code:`$SINGUALRITY_CACHEDIR` if that's set or :code:`~/.singularity/cache` otherwise. For Expanse, IIRC the home directory is @@ -164,13 +179,7 @@ Any calls to :code:`singularity exec|shell|pull` will cache the image. I wouldn' trust that the cache is thread-safe, so if you're going to kick off a bunch of jobs, either cache the image before hand, or have them all check. -To cache the image beforehand: - -.. code-block:: bash - - singularity exec docker:// /bin/bash -c "echo pulled the image" - -or, to check in a synchronized manner: +To cache the image beforehand, see above. To check in a synchronized manner: .. code-block:: bash @@ -183,7 +192,6 @@ or, to check in a synchronized manner: flock --verbose --exclusive --timeout 900 $LOCK_FILE \ SINGULARITY_TMPDIR=/scratch/$USER/job_$SLURM_JOB_ID singularity exec --containall docker:// echo "successfully pulled image" - Singularity run tips ^^^^^^^^^^^^^^^^^^^^ diff --git a/doc/UKB_Expanse_STR_GWAS.rst b/doc/UKB_Expanse_STR_GWAS.rst index c9f04ae..d58be6e 100644 --- a/doc/UKB_Expanse_STR_GWAS.rst +++ b/doc/UKB_Expanse_STR_GWAS.rst @@ -13,7 +13,9 @@ First, check out `my paper's repository `__. You'll need the data field ID of the phenotype, and the data field IDs of any fields -you wish to use as categorical covariates. +you wish to use as categorical covariates. Sex, age at measurement, and genetic PCs 1-40 +calculated by the UKB team `here `_ are +automatically included as covariates and should not be specified in the input file below. Caveats: @@ -22,7 +24,10 @@ Caveats: (otherwise age calculations will be thrown off or may crash) * Currently, only categorical covariates are supported -Create a json file for input, setting script_dir to the root of the git repo you checked out above, and all the others as appropriate: +Create a json file for input, setting script_dir to the root of the git repo you checked out above, and all the others as appropriate. +Covariate and phenotype IDs should be integers, don't include a suffix similar to :code:`-0.0` specifying the +measurement number and the array number. If you do not wish to include covariates, simply pass in empty lists - do +not omit the full covariate lines from the input json file. .. code-block:: json @@ -37,7 +42,8 @@ Create a json file for input, setting script_dir to the root of the git repo you Running the GWAS ---------------- -Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Singularity. +Then, get set up with :ref:`WDL_with_Cromwell_on_Expanse`, including the bit about Docker and Singularity +which are needed for this pipeline. The two docker containers you'll want to cache with Singularity prior to your run are :code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.3` and :code:`quay.io/thedevilinthedetails/work/ukb_strs:v1.4` diff --git a/doc/WDL.rst b/doc/WDL.rst index 32f13b7..92ad3fc 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -77,7 +77,9 @@ Here are the steps you need for running Cromwell the first time: And here are the steps you'll perform each time you run Cromwell: #. Start by :ref:`getting_an_interactive_node_on_Expanse`. You should set that to last for as long as the entire WDL workflow you are running with Cromwell. - Depending on how long it will take, consider :ref:`increasing_job_runtime_up_to_one_week`. + Depending on how long it will take, consider :ref:`increasing_job_runtime_up_to_one_week`. If you're submitting jobs to the cluster by running with the :code:`SLURM` configuration, + this head node does not need much memory (4GB should be fine). If you're running everything on the head node with the :code:`Local` configuration, then grab as much memory as your + pipeline will need at any one time. #. Enable :ref:`call-caching `, which outlines the following steps: #. First time only: create the necessary directories From a88282adf0ae732c1e9ec021a5865edbe58451ab Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Tue, 12 Dec 2023 10:06:49 -0800 Subject: [PATCH 07/10] Don't access cromwell directory while trying to create it. Shorten process for creating cromwell directory --- doc/WDL.rst | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/doc/WDL.rst b/doc/WDL.rst index 92ad3fc..e352cf3 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -170,11 +170,8 @@ After that, if this is your first time running MySQL this way, you'll need to ru .. code-block:: bash - # start an interactive my sql session - mysql -h localhost -P 3306 --protocol tcp -u root -ppass cromwell - # from within the mysql prompt - create database cromwell; - exit; + mysql -h localhost -P 3306 --protocol tcp -u root -ppass \ + < <(echo "create database cromwell;" ) You should now (finally!) be good to go with call caching. From ef7ed41e1de86545e3bc9ee61ae4da3e0eabee75 Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Tue, 12 Dec 2023 11:42:51 -0800 Subject: [PATCH 08/10] Forget relative paths option --- doc/WDL.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/doc/WDL.rst b/doc/WDL.rst index e352cf3..7d92ef0 100644 --- a/doc/WDL.rst +++ b/doc/WDL.rst @@ -62,9 +62,14 @@ and tell it to put the workflow's outputs in the location you'd like: .. code-block:: json { - "final_workflow_outputs_dir": "" + "final_workflow_outputs_dir": "", + "use_relative_output_paths": true } +Note that this will cause Cromwell to fail after the workflow has all but succeeded +if any of your workflow's file outputs have the same file names +(thus leading to a conflict when you want them in the same directory). + Running ^^^^^^^ From df9da6c55e96f4586a40b072517724472be3c7e2 Mon Sep 17 00:00:00 2001 From: Jonathan Margoliash Date: Wed, 17 Jan 2024 14:09:16 -0800 Subject: [PATCH 09/10] User links which don't create named references in rst. Fix one tip and add another. Remove trailing whitespace. --- doc/Expanse.rst | 10 ++--- doc/UKB_Expanse_STR_GWAS.rst | 6 +-- doc/WDL.rst | 74 +++++++++++++++++++----------------- 3 files changed, 47 insertions(+), 43 deletions(-) diff --git a/doc/Expanse.rst b/doc/Expanse.rst index 6f9c8c3..832bd94 100644 --- a/doc/Expanse.rst +++ b/doc/Expanse.rst @@ -5,7 +5,7 @@ Last update: 2023/01/10 Expanse uses SLURM to schedule and run jobs -`Expanse user guide `_ +`Expanse user guide `__ Getting an account, logging in and setting up --------------------------------------------- @@ -43,7 +43,7 @@ Grabbing an interactive node srun --partition=ind-shared --pty --nodes=1 --ntasks-per-node=1 --mem=50G -t 24:00:00 --wait=0 --account=ddp268 /bin/bash -``--pty`` is what specifically makes this treated as an interactive session +:code:``--pty`` is what specifically makes this treated as an interactive session Running a script noninteractively with SLURM ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -88,7 +88,7 @@ than necessary. Some notes: * The output flag determines the file that stdout is written to. This must be a file, not a directory. - You can use some placeholders in the output location such as `%x` for job name and `%j` for job id. + You can use some placeholders in the output location such as :code:`%x` for job name and :code:`%j` for job id. * Use the error flag to choose stderr's output location, if not specified goes to the output location. * There may be an optional shebang line at the start of the file, but no blank or other lines between the beginning and the :code:`#SBATCH` lines @@ -134,7 +134,7 @@ containers in a secure manner on cluster computers. Terminology: -* `SingularityCE `_ is open source +* `SingularityCE `__ is open source * Sylabs is the company that owns SingularityPro which is just a supported version of singularity @@ -204,7 +204,7 @@ Singularity run tips * To run a shell script: :code:`singularity exec --containall docker:// /bin/bash -c "