Skip to content

Commit

Permalink
Update documentation (#85)
Browse files Browse the repository at this point in the history
  • Loading branch information
MrHedmad authored Mar 19, 2024
2 parents 32a2355 + 1044421 commit 7be0281
Show file tree
Hide file tree
Showing 11 changed files with 96 additions and 82 deletions.
12 changes: 6 additions & 6 deletions docs/src/about.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# About

This package aggregates a series of meta information about Kerblam!.
This page aggregates a series of meta information about Kerblam!.

### License
The project is licensed with the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE).
Read [here](https://choosealicense.com/licenses/mit/) for the [choose a license](https://choosealicense.com)
entry of the license.

### Citing
If you want to cite Kerblam!, provide a link to the Github repository or use
the following Zenodo DOI: [doi.org/10.5281/zenodo.10664806](https://zenodo.org/doi/10.5281/zenodo.10664806)
If you want or need to cite Kerblam!, provide a link to the Github repository or use
the following Zenodo DOI: [doi.org/10.5281/zenodo.10664806](https://zenodo.org/doi/10.5281/zenodo.10664806).

### Naming
This project is named after the fictitious online shop/delivery company in
Expand All @@ -22,14 +22,14 @@ The Kerblam! logo is written in the [Kwark Font](https://www.1001fonts.com/kwark

This book is rendered by [`mdbook`](https://github.com/rust-lang/mdBook), and
is written as a series of markdown files.
Its source code is available in [the Kerblam! repo](https://github.com/MrHedmad/kerblam).
Its source code is available in [the Kerblam! repo](https://github.com/MrHedmad/kerblam)
under the `./docs/` folder.

The book hosted online always refers to the
[latest Kerblam! release](https://github.com/MrHedmad/kerblam/releases).

If you are looking for older or newer versions of this book, you should
read the markdown files directly [on Github](https://github.com/MrHedmad/kerblam/tree/main/docs),
where you can select which tag to view from the top bar, or clone the repository
locally, checkout to the commit you like, and rebuiding from source.
locally, checkout to the commit you like, and rebuild from source.
If you're interested, read [the development guide](dev/contributing.html) to
learn more.
9 changes: 6 additions & 3 deletions docs/src/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ You have a few options when installing Kerblam!.
### Requirements
Currently, Kerblam! only supports mac OS (both intel and apple chips) and GNU linux.
Other unix/linux versions *may* work, but are untested.
It also uses binaries that it assumes are already installed:
It also uses binaries that it assumes are already installed and visible from your `$PATH`:
- GNU `make`: [gnu.org/software/make](https://gnu.org/software/make);
- `git`: [git-scm.com](https://git-scm.com/)
- Docker (as `docker`) and/or Podman (as `podman`):
[docker.com](https://docker.com/) and/or [podman.io](https://podman.io);
- `tar`: [gnu.org/software/tar](https://www.gnu.org/software/tar/).
- `bash`: [gnu.org/software/bash](https://www.gnu.org/software/bash/).

If you can use `git`, `make`, `tar` and `docker` or `podman` from your CLI,
If you can use `git`, `make`, `tar`, `bash` and `docker` or `podman` from your CLI,
you're good to go!

Most if not all of these tools come pre-packaged in most linux distros.
Expand All @@ -27,8 +28,10 @@ You can always install or update to the latest version with:
```bash
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh
```
Be warned that the above command executes a script downloaded from the internet.
You can [click here](https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh)
to download the same installer script and inspect it before you run it, if you'd like.
or manually follow the fetched URL above to download the same installer script
and inspect it before you run it, if you'd like.

### Install from source
If you want to install the latest version from source, install Rust and `cargo`, then run:
Expand Down
17 changes: 4 additions & 13 deletions docs/src/landing.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,16 @@
![If you want it, kerblam it!](https://gist.github.com/MrHedmad/a5c719bbc22d982425fcd23f9e1d448c/raw/53833514b7701501c1c6b696156504b39b668f81/kerblam.dev_fig.png)

![If you want it, Kerblam it!](https://raw.githubusercontent.com/MrHedmad/kerblam/main/docs/images/logo.png)

**Kerblam! is a Rust command line tool to manage the execution of scientific data
Kerblam! is a Rust command line tool to manage the execution of scientific data
analysis, where having reproducible results and sharing the executed pipelines
is important. It makes it easy to write multiple analysis pipelines and select
what data is analysed.**
what data is analysed.

With Kerblam! your analyses will be less bloated, more organized, and more
reproducible.

##### Click on the images to see the same videos on asciinema.org!

[![If you see this, open an issue. The GIF is dead.](https://s9.gifyu.com/images/SFNkp.gif)](https://asciinema.org/a/641448)

After you execute your pipelines, you can export them to others for reproduction:

[![If you see this, open an issue. The GIF is dead.](https://s9.gifyu.com/images/SF6tA.gif)](https://asciinema.org/a/643038)

Kerblam! is a Free and Open Source Software, hosted on Github at
[MrHedmad/kerblam](https://github.com/MrHedmad/kerblam).
The code is licensed with the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE).
The code is licensed under the [MIT License](https://github.com/MrHedmad/kerblam/blob/main/LICENSE).

Use the sidebar to jump to a specific section.
If you have never used Kerblam! before, you can read the documentation from start
Expand Down
43 changes: 22 additions & 21 deletions docs/src/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,29 +8,27 @@ Kerblam! is a *project manager*. It helps you write clean, concise data analysis
pipelines, and takes care of chores for you.

Every Kerblam! project has a `kerblam.toml` file in its root.
Kerblam! looks for files in different folders relative to the `kerblam.toml`
file to manage your project.
When Kerblam! looks for files, it does it relative to the position of the
`kerblam.toml` file and in specific, pre-determined folders.
This helps you keep everything in its place, so that others that are unfamiliar
with your project can understand it if they ever need to review it.
with your project can understand it if they ever need to look at it.

These folders are as follows:
- `kerblam.toml`: This file contains the options for Kerblam!.
It is often empty for simple projects.
- `data/`: Where all the project's data is saved.
These folders, relative to where the `kerblam.toml` file is, are:
- `./data/`: Where all the project's data is saved.
Intermediate data files are specifically saved here.
- `data/in/`: Input data files are saved and should be looked for in here.
- `data/out/`: Output data files are saved and should be looked for in here.
- `src/`: Code you want to be executed should be saved here.
- `src/pipes/`: Makefiles and bash build scripts should be saved here.
- `./data/in/`: Input data files are saved and should be looked for in here.
- `./data/out/`: Output data files are saved and should be looked for in here.
- `./src/`: Code you want to be executed should be saved here.
- `./src/pipes/`: Makefiles and bash build scripts should be saved here.
They have to be written as if they were saved in `./`.
- `src/dockerfiles/`: Container build scripts should be saved here.
- `./src/dockerfiles/`: Container build scripts should be saved here.

> Any sub-folder of one of these specific folders (with the exception of
> `src/pipes` and `src/dockerfiles`) contains the same type of files as the
> parent directory. For instance, `data/in/fastq` is treated as if it contains
> input data by Kerblam! just as the `data/in` directory is.
You can configure almost all of these paths in the `kerblam.toml`, if you so desire.
You can configure almost all of these paths in the `kerblam.toml` file, if you so desire.
This is mostly done for compatibility reasons with non-kerblam! projects.
New projects that wish to use Kerblam! are strongly encouraged to follow the
standard folder structure, however.
Expand All @@ -40,21 +38,21 @@ standard folder structure, however.
> your choices in the `kerblam.toml` file.
If you want to convert an existing project to use Kerblam!, you can take a look
at [the `kerblam.toml` section of the documentation](kerblam.toml.html).
at [the `kerblam.toml` section of the documentation](kerblam.toml.html) to
learn how to configure these paths.

If you follow this standard (or you write proper configuration), Kerblam! gives
you a bunch of benefits:
If you follow this standard (or you write proper configuration), you can use
Kerblam! to do a bunch of things:
- You can run pipelines written in `make` or arbitrary shell files in `src/pipes/`
as if you ran them from the root directory of your project by simply using
`kerblam run <pipe>`.
`kerblam run <pipe>`;
- You can wrap your pipelines in docker containers by just writing new
dockerfiles in `src/dockerfiles`, with essentially just the installation
of the dependencies.
of the dependencies, letting Kerblam! take care of the rest;
- If you have wrapped up pipelines, you can export them for later execution
(or to send them to a reviewer) with `kerblam package <pipe>` without needing
to edit your dockerfiles.
- If you have a package from someone else, you can run it with
`kerblam replay`.
to edit your dockerfiles;
- If you have a package from someone else, you can run it with `kerblam replay`.
- You can fetch remote data from the internet with `kerblam data fetch`, see
how much disk space your project's data is using with `kerblam data` and
safely cleanup all the files that are not needed to re-run your project with
Expand All @@ -67,3 +65,6 @@ The rest of this tutorial walks you through every feature.

I hope you enjoy Kerblam! and that it makes your projects easier to understand,
run and reproduce!

> If you like Kerblam!, please consider [leaving a star on Github](https://github.com/MrHedmad/kerblam/stargazers).
> Thank you for supporting Kerblam!
16 changes: 12 additions & 4 deletions docs/src/tutorial/dockerfiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,28 @@ using `COPY . .`.
### The `data` directory is excluded from packages
If you have a `COPY . .` directive in the dockerfile, it will behave differently
when you `kerblam run` versus when you `kerblam package`.
In a run, **the current, local directory is used as-is as a build context**.

When you run `kerblam package`, Kerblam! will create a temporary build context
with no input data.
This is what you want: Kerblam! needs to separately package your (precious)
input data on the side, and copy in the container only code and other execution-specific
files.

In a run, the current local project directory is used as-is as a build context.
This means that the `data` directory will be copied over.
At the same time, Kerblam! will also *mount* the same directory to the running
container, so the copied files will be "overwritten" by the live mountpoint
while to container is running.

This generally means that copying the whole data directory is useless.
This generally means that copying the whole data directory is useless in a run,
and that it cannot be done during packaging.

Therefore, a best practice is to ignore the contents of the data folders in the
`.dockerignore` file.
This makes no difference while packaging containers but a big difference when
running them, as docker skips copying the useless data files.

To do this in a standard Kerblam! project, add this to your `.dockerignore`:
To do this in a standard Kerblam! project, simply add this to your `.dockerignore`:
```
# Ignore the intermediate/output directory
data
Expand Down Expand Up @@ -68,4 +76,4 @@ you place the `COPY . .` directive near the bottom of the dockerfile.
This way, you can essentially work exclusively in docker and never install
anything locally.

Kerblam! will name the pipelines as `<pipeline name>_kerblam_runtime`.
Kerblam! will name the containers for the pipelines as `<pipeline name>_kerblam_runtime`.
5 changes: 1 addition & 4 deletions docs/src/tutorial/intro_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ project.
If you follow open science guidelines, chances are that a lot of your data is
FAIR, and you can fetch it remotely.

Kerblam! is perfect to work with such data. The next sections outline what
Kerblam! is perfect to work with such data. The next tutorial sections outline what
Kerblam! can do to help you work with data.

Remember that Kerblam! recognizes what data is what by the location where you
Expand All @@ -31,6 +31,3 @@ The total size of all the files in the `./data/` folder is then broken down
between categories: the `Total` data size, how much data can be removed with
`kerblam data clean` or `kerblam data pack`, and how many files are specified
to be downloaded but are not yet present locally.

You can manipulate your data with `kerblam data` in several ways.
In the following sections we explain every one of these ways.
4 changes: 0 additions & 4 deletions docs/src/tutorial/package_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,3 @@ non-remotely-available `.data/in` files and the files in `./data/out`.
You can also pass the `--cleanup` flag to also delete them after packing.

You can then share the data pack with others.

This is pretty useful if you have [packaged a pipeline](package_pipes.html) and
would like to send just the precious input data to whomever needs to reproduce
your work.
10 changes: 5 additions & 5 deletions docs/src/tutorial/package_pipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,23 @@ It allows you to package everything needed to execute a pipeline in a docker
container and export it for execution later.

You must have a matching dockerfile for every pipeline that you want to package,
or Kerblam! wont know what to package your pipeline into.
or Kerblam! won't know what to package your pipeline into.

For example, say that you have a `process` pipe that uses `make` to run, and
requires both a remotely-downloaded `remote.txt` file and a local-only
`precious.txt` file.

If you execute
If you execute:
```bash
kerblam package process --tag my_process_package
```
Kerblam! will:
- Create a temporary context;
- Create a temporary build context;
- Copy all non-data files to the temporary context;
- Build the specified dockerfile as normal, but using this temporary context;
- Create a new `Dockerfile` that:
- Inherits from the image built before;
- Copies the Kerblam! executable to the root of the dockerfile;
- Copies the Kerblam! executable to the root of the container;
- Configure the default execution command to something suitable for execution
(just like `kerblam run` does, but "baked in").
- Build the docker container and tag it with `my_process_package`;
Expand Down Expand Up @@ -54,7 +54,7 @@ The responsibility of having the resulting docker work in the long-term is
up to you, not Kerblam!
For most cases, just having `kerblam run` work is enough for the resulting
package made by `kerblam package` to work, but depending on your docker
files this might not be the case.\
files this might not be the case.
Kerblam! does not test the resulting package - it's up to you to do that.
It's best to try your packaged pipeline once before shipping it off.

Expand Down
2 changes: 1 addition & 1 deletion docs/src/tutorial/pipe_docstrings.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ in the makefile/shellfile itself. Using the same example as above:
#? Calculate the sums of the input metrics
#?
#? The script takes the input metrics, then calculates the row-wise sums.
#? These are important since the metrics refer to the calculation.
#? These are useful since we can refer to this calculation later.

./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
cat $< | ./src/calc_sum.py > $@
Expand Down
41 changes: 28 additions & 13 deletions docs/src/tutorial/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ installed on your system this way, e.g. `snakemake` or `nextflow`.

Make has a special execution policy to allow it to work with as little boilerplate
as possible.

You can read more on Make [in the GNU Make book](https://www.gnu.org/software/make/manual/make.pdf).

`kerblam run` supports the following flags:
Expand All @@ -22,11 +21,19 @@ You can read more on Make [in the GNU Make book](https://www.gnu.org/software/ma
- `--local` (`-l`): Skip [running in a container](run_containers.html), if a
container is available, preferring a local run.

In short, `kerblam run` does something similar to this:
- Move your `pipe.sh` or `pipe.makefile` file in the root of the project,
under the name `executor`;
- Launch `make -f executor` or `bash executor` for you.

This is why pipelines are written as if they are executed in the root of the
project, because they are.

## Data Profiles - Running the same pipelines on different data

You can run your same pipelines, *as-is*, on different data thanks to data profiles.

By default, Kerblam! will use your `./data/in/` folder as-is when executing pipes.
By default, Kerblam! will use your untouched `./data/in/` folder when executing pipes.
If you want the same pipes to run on different sets of input data, Kerblam! can
temporarily swap out your real data with this 'substitute' data during execution.

Expand All @@ -37,8 +44,7 @@ to this alternative one.
However, you then have to maintain two essentially identical
pipelines, and you are prone to adding errors while you modify it (what if you
forget to change one reference to the original file?).
You can use `kerblam` to do the same, but in a declarative, less-error prone and
easy way.
You can use `kerblam` to do the same, but in an easy, declarative and less-error-prone way.

Define in your `kerblam.toml` file a new section under `data.profiles`:
```toml
Expand All @@ -51,15 +57,19 @@ You can then run the same makefile with the new data with:
```
kerblam run process_csv --profile alternate
```

> Paths under every profile section are relative to the input data directory,
> by default `data/in`.
Under the hood, Kerblam! will:
- Rename `input.csv` to `input.csv.original`;
- Move `different_input.csv` to `input.csv`;
- Run the analysis as normal;
- When the run ends (or the analysis crashes), Kerblam! will undo the move
and rename `input.csv.original` back to `input.csv`.
- When the run ends (it finishes, it crashes or you kill it), Kerblam! will undo both actions:
it moves `different_input.csv` back to its original place and
renames `input.csv.original` back to `input.csv`.

This effectively causes the makefile run with different input data in this
alternate run.
This effectively causes the makefile to run with different input data.

> Careful that the *output* data will (most likely) be saved as the
> same file names as a "normal" run!
Expand All @@ -69,7 +79,7 @@ alternate run.
> If you really want to, use the `KERBLAM_PROFILE` environment variable
> described below and change the output paths accordingly.
This is most commonly useful to run the pipelines on test data that is faster to
Profiles are most commonly useful to run the pipelines on test data that is faster to
process or that produces pre-defined outputs. For example, you could define
something similar to:
```toml
Expand All @@ -82,17 +92,22 @@ And execute your test run with `kerblam run pipe --profile test`.
The profiles feature is used so commonly for test data that Kerblam! will
automatically make a `test` profile for you, swapping all input files in the
`./data/in` folder that start with `test_xxx` with their "regular" counterparts `xxx`.
For example, the profile above is redundant!\
For example, the profile above is redundant!

If you write a `[data.profiles.test]` profile yourself, Kerblam! will not
modify it in any way, effectively disabling the automatic test profile feature.

All file paths specified under the `profiles` tab must be relative to the `./data/in/`
folder.

Kerblam! tries its best to cleanup after itself (e.g. undo profiles,
delete temporary files, etc...) when you use `kerblam run`, even if the pipe
fails, and even if you kill your pipe with `CTRL-C`.

> If your pipeline is unresponsive to a `CTRL-C`, pressing it twice (two
> `SIGTERM` signals in a row) will kill Kerblam! instead, leaving the child
> process to be cleaned up by the OS and the (eventual) profile not cleaned up.
>
> This is to allow you to stop whatever Kerblam! or the pipe is doing in
> case of emergency.
Kerblam! will run the pipelines with the environment variable `KERBLAM_PROFILE`
set to whatever the name of the profile is.
In this way, you can detect from inside the pipeline if you are in a profile or not.
Expand Down
Loading

0 comments on commit 7be0281

Please sign in to comment.