diff --git a/docs/src/about.md b/docs/src/about.md index ae6db00..cdb5e41 100644 --- a/docs/src/about.md +++ b/docs/src/about.md @@ -1,6 +1,6 @@ # About -This package aggregates a series of meta information about Kerblam!. +This page aggregates a series of meta information about Kerblam!. ### License The project is licensed with the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE). @@ -8,8 +8,8 @@ Read [here](https://choosealicense.com/licenses/mit/) for the [choose a license] entry of the license. ### Citing -If you want to cite Kerblam!, provide a link to the Github repository or use -the following Zenodo DOI: [doi.org/10.5281/zenodo.10664806](https://zenodo.org/doi/10.5281/zenodo.10664806) +If you want or need to cite Kerblam!, provide a link to the Github repository or use +the following Zenodo DOI: [doi.org/10.5281/zenodo.10664806](https://zenodo.org/doi/10.5281/zenodo.10664806). ### Naming This project is named after the fictitious online shop/delivery company in @@ -22,14 +22,14 @@ The Kerblam! logo is written in the [Kwark Font](https://www.1001fonts.com/kwark This book is rendered by [`mdbook`](https://github.com/rust-lang/mdBook), and is written as a series of markdown files. -Its source code is available in [the Kerblam! repo](https://github.com/MrHedmad/kerblam). +Its source code is available in [the Kerblam! repo](https://github.com/MrHedmad/kerblam) +under the `./docs/` folder. The book hosted online always refers to the [latest Kerblam! release](https://github.com/MrHedmad/kerblam/releases). - If you are looking for older or newer versions of this book, you should read the markdown files directly [on Github](https://github.com/MrHedmad/kerblam/tree/main/docs), where you can select which tag to view from the top bar, or clone the repository -locally, checkout to the commit you like, and rebuiding from source. +locally, checkout to the commit you like, and rebuild from source. If you're interested, read [the development guide](dev/contributing.html) to learn more. diff --git a/docs/src/install.md b/docs/src/install.md index 81b6636..f355220 100644 --- a/docs/src/install.md +++ b/docs/src/install.md @@ -4,14 +4,15 @@ You have a few options when installing Kerblam!. ### Requirements Currently, Kerblam! only supports mac OS (both intel and apple chips) and GNU linux. Other unix/linux versions *may* work, but are untested. -It also uses binaries that it assumes are already installed: +It also uses binaries that it assumes are already installed and visible from your `$PATH`: - GNU `make`: [gnu.org/software/make](https://gnu.org/software/make); - `git`: [git-scm.com](https://git-scm.com/) - Docker (as `docker`) and/or Podman (as `podman`): [docker.com](https://docker.com/) and/or [podman.io](https://podman.io); - `tar`: [gnu.org/software/tar](https://www.gnu.org/software/tar/). +- `bash`: [gnu.org/software/bash](https://www.gnu.org/software/bash/). -If you can use `git`, `make`, `tar` and `docker` or `podman` from your CLI, +If you can use `git`, `make`, `tar`, `bash` and `docker` or `podman` from your CLI, you're good to go! Most if not all of these tools come pre-packaged in most linux distros. @@ -27,8 +28,10 @@ You can always install or update to the latest version with: ```bash curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh ``` +Be warned that the above command executes a script downloaded from the internet. You can [click here](https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh) -to download the same installer script and inspect it before you run it, if you'd like. +or manually follow the fetched URL above to download the same installer script +and inspect it before you run it, if you'd like. ### Install from source If you want to install the latest version from source, install Rust and `cargo`, then run: diff --git a/docs/src/landing.md b/docs/src/landing.md index b7a321c..6e12750 100644 --- a/docs/src/landing.md +++ b/docs/src/landing.md @@ -1,25 +1,16 @@ +![If you want it, kerblam it!](https://gist.github.com/MrHedmad/a5c719bbc22d982425fcd23f9e1d448c/raw/53833514b7701501c1c6b696156504b39b668f81/kerblam.dev_fig.png) -![If you want it, Kerblam it!](https://raw.githubusercontent.com/MrHedmad/kerblam/main/docs/images/logo.png) - -**Kerblam! is a Rust command line tool to manage the execution of scientific data +Kerblam! is a Rust command line tool to manage the execution of scientific data analysis, where having reproducible results and sharing the executed pipelines is important. It makes it easy to write multiple analysis pipelines and select -what data is analysed.** +what data is analysed. With Kerblam! your analyses will be less bloated, more organized, and more reproducible. -##### Click on the images to see the same videos on asciinema.org! - -[![If you see this, open an issue. The GIF is dead.](https://s9.gifyu.com/images/SFNkp.gif)](https://asciinema.org/a/641448) - -After you execute your pipelines, you can export them to others for reproduction: - -[![If you see this, open an issue. The GIF is dead.](https://s9.gifyu.com/images/SF6tA.gif)](https://asciinema.org/a/643038) - Kerblam! is a Free and Open Source Software, hosted on Github at [MrHedmad/kerblam](https://github.com/MrHedmad/kerblam). -The code is licensed with the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE). +The code is licensed under the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE). Use the sidebar to jump to a specific section. If you have never used Kerblam! before, you can read the documentation from start diff --git a/docs/src/quickstart.md b/docs/src/quickstart.md index 99a8acd..d277525 100644 --- a/docs/src/quickstart.md +++ b/docs/src/quickstart.md @@ -8,29 +8,27 @@ Kerblam! is a *project manager*. It helps you write clean, concise data analysis pipelines, and takes care of chores for you. Every Kerblam! project has a `kerblam.toml` file in its root. -Kerblam! looks for files in different folders relative to the `kerblam.toml` -file to manage your project. +When Kerblam! looks for files, it does it relative to the position of the +`kerblam.toml` file and in specific, pre-determined folders. This helps you keep everything in its place, so that others that are unfamiliar -with your project can understand it if they ever need to review it. +with your project can understand it if they ever need to look at it. -These folders are as follows: -- `kerblam.toml`: This file contains the options for Kerblam!. - It is often empty for simple projects. -- `data/`: Where all the project's data is saved. +These folders, relative to where the `kerblam.toml` file is, are: +- `./data/`: Where all the project's data is saved. Intermediate data files are specifically saved here. -- `data/in/`: Input data files are saved and should be looked for in here. -- `data/out/`: Output data files are saved and should be looked for in here. -- `src/`: Code you want to be executed should be saved here. -- `src/pipes/`: Makefiles and bash build scripts should be saved here. +- `./data/in/`: Input data files are saved and should be looked for in here. +- `./data/out/`: Output data files are saved and should be looked for in here. +- `./src/`: Code you want to be executed should be saved here. +- `./src/pipes/`: Makefiles and bash build scripts should be saved here. They have to be written as if they were saved in `./`. -- `src/dockerfiles/`: Container build scripts should be saved here. +- `./src/dockerfiles/`: Container build scripts should be saved here. > Any sub-folder of one of these specific folders (with the exception of > `src/pipes` and `src/dockerfiles`) contains the same type of files as the > parent directory. For instance, `data/in/fastq` is treated as if it contains > input data by Kerblam! just as the `data/in` directory is. -You can configure almost all of these paths in the `kerblam.toml`, if you so desire. +You can configure almost all of these paths in the `kerblam.toml` file, if you so desire. This is mostly done for compatibility reasons with non-kerblam! projects. New projects that wish to use Kerblam! are strongly encouraged to follow the standard folder structure, however. @@ -40,21 +38,21 @@ standard folder structure, however. > your choices in the `kerblam.toml` file. If you want to convert an existing project to use Kerblam!, you can take a look -at [the `kerblam.toml` section of the documentation](kerblam.toml.html). +at [the `kerblam.toml` section of the documentation](kerblam.toml.html) to +learn how to configure these paths. -If you follow this standard (or you write proper configuration), Kerblam! gives -you a bunch of benefits: +If you follow this standard (or you write proper configuration), you can use +Kerblam! to do a bunch of things: - You can run pipelines written in `make` or arbitrary shell files in `src/pipes/` as if you ran them from the root directory of your project by simply using - `kerblam run `. + `kerblam run `; - You can wrap your pipelines in docker containers by just writing new dockerfiles in `src/dockerfiles`, with essentially just the installation - of the dependencies. + of the dependencies, letting Kerblam! take care of the rest; - If you have wrapped up pipelines, you can export them for later execution (or to send them to a reviewer) with `kerblam package ` without needing - to edit your dockerfiles. - - If you have a package from someone else, you can run it with - `kerblam replay`. + to edit your dockerfiles; +- If you have a package from someone else, you can run it with `kerblam replay`. - You can fetch remote data from the internet with `kerblam data fetch`, see how much disk space your project's data is using with `kerblam data` and safely cleanup all the files that are not needed to re-run your project with @@ -67,3 +65,6 @@ The rest of this tutorial walks you through every feature. I hope you enjoy Kerblam! and that it makes your projects easier to understand, run and reproduce! + +> If you like Kerblam!, please consider [leaving a star on Github](https://github.com/MrHedmad/kerblam/stargazers). +> Thank you for supporting Kerblam! diff --git a/docs/src/tutorial/dockerfiles.md b/docs/src/tutorial/dockerfiles.md index 4faf6f0..9328f18 100644 --- a/docs/src/tutorial/dockerfiles.md +++ b/docs/src/tutorial/dockerfiles.md @@ -21,20 +21,28 @@ using `COPY . .`. ### The `data` directory is excluded from packages If you have a `COPY . .` directive in the dockerfile, it will behave differently when you `kerblam run` versus when you `kerblam package`. -In a run, **the current, local directory is used as-is as a build context**. + +When you run `kerblam package`, Kerblam! will create a temporary build context +with no input data. +This is what you want: Kerblam! needs to separately package your (precious) +input data on the side, and copy in the container only code and other execution-specific +files. + +In a run, the current local project directory is used as-is as a build context. This means that the `data` directory will be copied over. At the same time, Kerblam! will also *mount* the same directory to the running container, so the copied files will be "overwritten" by the live mountpoint while to container is running. -This generally means that copying the whole data directory is useless. +This generally means that copying the whole data directory is useless in a run, +and that it cannot be done during packaging. Therefore, a best practice is to ignore the contents of the data folders in the `.dockerignore` file. This makes no difference while packaging containers but a big difference when running them, as docker skips copying the useless data files. -To do this in a standard Kerblam! project, add this to your `.dockerignore`: +To do this in a standard Kerblam! project, simply add this to your `.dockerignore`: ``` # Ignore the intermediate/output directory data @@ -68,4 +76,4 @@ you place the `COPY . .` directive near the bottom of the dockerfile. This way, you can essentially work exclusively in docker and never install anything locally. -Kerblam! will name the pipelines as `_kerblam_runtime`. +Kerblam! will name the containers for the pipelines as `_kerblam_runtime`. diff --git a/docs/src/tutorial/intro_data.md b/docs/src/tutorial/intro_data.md index 023001b..2416177 100644 --- a/docs/src/tutorial/intro_data.md +++ b/docs/src/tutorial/intro_data.md @@ -5,7 +5,7 @@ project. If you follow open science guidelines, chances are that a lot of your data is FAIR, and you can fetch it remotely. -Kerblam! is perfect to work with such data. The next sections outline what +Kerblam! is perfect to work with such data. The next tutorial sections outline what Kerblam! can do to help you work with data. Remember that Kerblam! recognizes what data is what by the location where you @@ -31,6 +31,3 @@ The total size of all the files in the `./data/` folder is then broken down between categories: the `Total` data size, how much data can be removed with `kerblam data clean` or `kerblam data pack`, and how many files are specified to be downloaded but are not yet present locally. - -You can manipulate your data with `kerblam data` in several ways. -In the following sections we explain every one of these ways. diff --git a/docs/src/tutorial/package_data.md b/docs/src/tutorial/package_data.md index 362c162..3bc2c22 100644 --- a/docs/src/tutorial/package_data.md +++ b/docs/src/tutorial/package_data.md @@ -12,7 +12,3 @@ non-remotely-available `.data/in` files and the files in `./data/out`. You can also pass the `--cleanup` flag to also delete them after packing. You can then share the data pack with others. - -This is pretty useful if you have [packaged a pipeline](package_pipes.html) and -would like to send just the precious input data to whomever needs to reproduce -your work. diff --git a/docs/src/tutorial/package_pipes.md b/docs/src/tutorial/package_pipes.md index a30723e..5a2a730 100644 --- a/docs/src/tutorial/package_pipes.md +++ b/docs/src/tutorial/package_pipes.md @@ -5,23 +5,23 @@ It allows you to package everything needed to execute a pipeline in a docker container and export it for execution later. You must have a matching dockerfile for every pipeline that you want to package, -or Kerblam! wont know what to package your pipeline into. +or Kerblam! won't know what to package your pipeline into. For example, say that you have a `process` pipe that uses `make` to run, and requires both a remotely-downloaded `remote.txt` file and a local-only `precious.txt` file. -If you execute +If you execute: ```bash kerblam package process --tag my_process_package ``` Kerblam! will: -- Create a temporary context; +- Create a temporary build context; - Copy all non-data files to the temporary context; - Build the specified dockerfile as normal, but using this temporary context; - Create a new `Dockerfile` that: - Inherits from the image built before; - - Copies the Kerblam! executable to the root of the dockerfile; + - Copies the Kerblam! executable to the root of the container; - Configure the default execution command to something suitable for execution (just like `kerblam run` does, but "baked in"). - Build the docker container and tag it with `my_process_package`; @@ -54,7 +54,7 @@ The responsibility of having the resulting docker work in the long-term is up to you, not Kerblam! For most cases, just having `kerblam run` work is enough for the resulting package made by `kerblam package` to work, but depending on your docker -files this might not be the case.\ +files this might not be the case. Kerblam! does not test the resulting package - it's up to you to do that. It's best to try your packaged pipeline once before shipping it off. diff --git a/docs/src/tutorial/pipe_docstrings.md b/docs/src/tutorial/pipe_docstrings.md index 4f24525..7843c82 100644 --- a/docs/src/tutorial/pipe_docstrings.md +++ b/docs/src/tutorial/pipe_docstrings.md @@ -15,7 +15,7 @@ in the makefile/shellfile itself. Using the same example as above: #? Calculate the sums of the input metrics #? #? The script takes the input metrics, then calculates the row-wise sums. -#? These are important since the metrics refer to the calculation. +#? These are useful since we can refer to this calculation later. ./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py cat $< | ./src/calc_sum.py > $@ diff --git a/docs/src/tutorial/run.md b/docs/src/tutorial/run.md index eb87839..eb27b54 100644 --- a/docs/src/tutorial/run.md +++ b/docs/src/tutorial/run.md @@ -12,7 +12,6 @@ installed on your system this way, e.g. `snakemake` or `nextflow`. Make has a special execution policy to allow it to work with as little boilerplate as possible. - You can read more on Make [in the GNU Make book](https://www.gnu.org/software/make/manual/make.pdf). `kerblam run` supports the following flags: @@ -22,11 +21,19 @@ You can read more on Make [in the GNU Make book](https://www.gnu.org/software/ma - `--local` (`-l`): Skip [running in a container](run_containers.html), if a container is available, preferring a local run. +In short, `kerblam run` does something similar to this: +- Move your `pipe.sh` or `pipe.makefile` file in the root of the project, + under the name `executor`; +- Launch `make -f executor` or `bash executor` for you. + +This is why pipelines are written as if they are executed in the root of the +project, because they are. + ## Data Profiles - Running the same pipelines on different data You can run your same pipelines, *as-is*, on different data thanks to data profiles. -By default, Kerblam! will use your `./data/in/` folder as-is when executing pipes. +By default, Kerblam! will use your untouched `./data/in/` folder when executing pipes. If you want the same pipes to run on different sets of input data, Kerblam! can temporarily swap out your real data with this 'substitute' data during execution. @@ -37,8 +44,7 @@ to this alternative one. However, you then have to maintain two essentially identical pipelines, and you are prone to adding errors while you modify it (what if you forget to change one reference to the original file?). -You can use `kerblam` to do the same, but in a declarative, less-error prone and -easy way. +You can use `kerblam` to do the same, but in an easy, declarative and less-error-prone way. Define in your `kerblam.toml` file a new section under `data.profiles`: ```toml @@ -51,15 +57,19 @@ You can then run the same makefile with the new data with: ``` kerblam run process_csv --profile alternate ``` + +> Paths under every profile section are relative to the input data directory, +> by default `data/in`. + Under the hood, Kerblam! will: - Rename `input.csv` to `input.csv.original`; - Move `different_input.csv` to `input.csv`; - Run the analysis as normal; -- When the run ends (or the analysis crashes), Kerblam! will undo the move - and rename `input.csv.original` back to `input.csv`. +- When the run ends (it finishes, it crashes or you kill it), Kerblam! will undo both actions: + it moves `different_input.csv` back to its original place and + renames `input.csv.original` back to `input.csv`. -This effectively causes the makefile run with different input data in this -alternate run. +This effectively causes the makefile to run with different input data. > Careful that the *output* data will (most likely) be saved as the > same file names as a "normal" run! @@ -69,7 +79,7 @@ alternate run. > If you really want to, use the `KERBLAM_PROFILE` environment variable > described below and change the output paths accordingly. -This is most commonly useful to run the pipelines on test data that is faster to +Profiles are most commonly useful to run the pipelines on test data that is faster to process or that produces pre-defined outputs. For example, you could define something similar to: ```toml @@ -82,17 +92,22 @@ And execute your test run with `kerblam run pipe --profile test`. The profiles feature is used so commonly for test data that Kerblam! will automatically make a `test` profile for you, swapping all input files in the `./data/in` folder that start with `test_xxx` with their "regular" counterparts `xxx`. -For example, the profile above is redundant!\ +For example, the profile above is redundant! + If you write a `[data.profiles.test]` profile yourself, Kerblam! will not modify it in any way, effectively disabling the automatic test profile feature. -All file paths specified under the `profiles` tab must be relative to the `./data/in/` -folder. - Kerblam! tries its best to cleanup after itself (e.g. undo profiles, delete temporary files, etc...) when you use `kerblam run`, even if the pipe fails, and even if you kill your pipe with `CTRL-C`. +> If your pipeline is unresponsive to a `CTRL-C`, pressing it twice (two +> `SIGTERM` signals in a row) will kill Kerblam! instead, leaving the child +> process to be cleaned up by the OS and the (eventual) profile not cleaned up. +> +> This is to allow you to stop whatever Kerblam! or the pipe is doing in +> case of emergency. + Kerblam! will run the pipelines with the environment variable `KERBLAM_PROFILE` set to whatever the name of the profile is. In this way, you can detect from inside the pipeline if you are in a profile or not. diff --git a/docs/src/tutorial/run_containers.md b/docs/src/tutorial/run_containers.md index 67d2edf..cedafe3 100644 --- a/docs/src/tutorial/run_containers.md +++ b/docs/src/tutorial/run_containers.md @@ -1,10 +1,12 @@ # Containerized Execution of Pipelines -Kerblam! is primarely useful to ergonomically run pipelines inside containers. +Kerblam! can ergonomically run pipelines inside containers for you, making it +easier to be reproducible. If Kerblam! finds a container recipe (such as a Dockerfile) of the same name as one of your pipes in the `./src/dockerfiles/` folder -(e.g. `./src/dockerfiles/process_csv.dockerfile`), it will use it automatically -when you execute a pipeline (e.g. `kerblam run process_csv`). +(e.g. `./src/dockerfiles/process_csv.dockerfile` for the `./src/pipes/process_csv.makefile` pipe), +it will use it automatically when you execute a pipeline (e.g. `kerblam run process_csv`) +to run the pipeline inside a container. Specifically, it will do something similar to this: - Copy the pipeline to the root of the directory (as it does normally when you @@ -22,7 +24,7 @@ Kerblam! run your projects in docker environments, so you can tweak your dependencies and tooling (which might be different than your dev environment) and execute in a protected, reproducible environment. -Kerblam! will build the container images without moving the recipies around. +Kerblam! will build the container images without moving the recipies around (this is what the `-f` flag does). The `.dockerfile` in the build context (next to the `kerblam.toml`) is shared by all pipes. @@ -58,7 +60,7 @@ venv and simply run `kerblam run process_csv` to build the container and run your code inside it. -If you run `kerblam run` without a pipeline (or with the wrong pipeline), you +If you run `kerblam run` without a pipeline (or with a non-existant pipeline), you will get the list of available pipelines. You can see at a glance what pipelines have an associated dockerfile as they are prepended with a little whale (🐋): @@ -120,7 +122,8 @@ If you change the working directory, let Kerblam! know by setting the workdir = "/app" ``` In this way, Kerblam! will run the containers with the proper paths. -**This option applies to *ALL* containers managed by Kerblam!** -There is currently no way to configure a different working directory for every -specific dockerfile. +> **This option applies to *ALL* containers managed by Kerblam!** +> +> There is currently no way to configure a different working directory for every +> specific dockerfile.