From 430d75cba5f6b1efbed39156cb75c5a75a4516ae Mon Sep 17 00:00:00 2001 From: nsheff Date: Fri, 24 May 2024 17:04:18 -0400 Subject: [PATCH] clean up some docs --- docs/bbconf/README.md | 16 +++++++--------- docs/citations.md | 6 +++--- docs/geniml/manuscripts/gharavi2021.md | 7 +++++++ docs/geniml/manuscripts/gharavi2024.md | 14 ++++++++++++++ mkdocs.yml | 6 ++++-- 5 files changed, 35 insertions(+), 14 deletions(-) create mode 100644 docs/geniml/manuscripts/gharavi2021.md create mode 100644 docs/geniml/manuscripts/gharavi2024.md diff --git a/docs/bbconf/README.md b/docs/bbconf/README.md index d213de3..9e237cf 100644 --- a/docs/bbconf/README.md +++ b/docs/bbconf/README.md @@ -8,7 +8,7 @@ [![coverage](https://coverage-badge.samuelcolvin.workers.dev/databio/bbconf.svg)](https://coverage-badge.samuelcolvin.workers.dev/redirect/databio/bbconf) -*BEDBASE* project configuration package (agent) +*BEDbase* project configuration package (agent) ## What is this? @@ -18,18 +18,16 @@ It formalizes communication pathways for pipelines and downstream tools, ensurin --- -**Documentation**: https://docs.bedbase.org/bedboss - -**Source Code**: https://github.com/databio/bbconf +## Installation ---- +To install `bbconf` use this command: -## Installation -To install `bbclient` use this command: ``` -pip install bbclient +pip install bbconf ``` -or install the latest version from the GitHub repository: + +or, install the latest version from the GitHub repository: + ``` pip install git+https://github.com/databio/bbconf.git ``` diff --git a/docs/citations.md b/docs/citations.md index afaec36..4978637 100644 --- a/docs/citations.md +++ b/docs/citations.md @@ -11,12 +11,12 @@ Thanks for citing us! If you use BEDbase, geniml, or their components in your re | If you use... | Please cite ... | |---------------|-----------------| -| `geniml` region set evaluations | Zheng et al. (2023) *bioRxiv* | | `region2vec` embeddings | Gharavi et al. (2021) *Bioinformatics* | -| `bedspace` search and embeddings | Gharavi et al. (2023) *bioRxiv* | +| `bedspace` search and embeddings | Gharavi et al. (2024) *Bioengineering* | +| `scEmbed` single-cell embedding framework | LeRoy et al. (2023) *bioRxiv* | +| `geniml` region set evaluations | Zheng et al. (2023) *bioRxiv* | | `geniml hmm` module | Rymuza et al. (2023) *bioRxiv* | | `bedbase` database | Unpublished | -| `scEmbed` single-cell embedding framework | LeRoy et al. (2023) *bioRxiv* | diff --git a/docs/geniml/manuscripts/gharavi2021.md b/docs/geniml/manuscripts/gharavi2021.md new file mode 100644 index 0000000..7aea002 --- /dev/null +++ b/docs/geniml/manuscripts/gharavi2021.md @@ -0,0 +1,7 @@ +# Embeddings of genomic region sets capture rich biological associations in low dimensions + +## Relevant tutorials + +This paper was our first publication showing how to build and evaluate region set embeddings using region-set2vec, based on word2vec. + +See: [train Region2Vec embeddings](../tutorials/region2vec.md) \ No newline at end of file diff --git a/docs/geniml/manuscripts/gharavi2024.md b/docs/geniml/manuscripts/gharavi2024.md new file mode 100644 index 0000000..bf8ee11 --- /dev/null +++ b/docs/geniml/manuscripts/gharavi2024.md @@ -0,0 +1,14 @@ +# Joint representation learning for retrieval and annotation of genomic interval sets + +Paper: [Manuscript at *Bioengineering*](https://dx.doi.org/10.3390/bioengineering11030263) + +## Abstract + +As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval. + +## Relevant tutorials + +This paper trained BEDspace models (using StarSpace with BED files). See these tutorials: + +- [How to use BEDSpace to jointly embed regions and metadata](../tutorials/bedspace.md) + diff --git a/mkdocs.yml b/mkdocs.yml index d9fc149..63b40cf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -107,8 +107,10 @@ nav: - Create evaluation dataset with bedshift: geniml/tutorials/bedshift-evaluation-guide.md - Create search backend: geniml/tutorials/text2bednn-search-interface.md - Reference: - - Manuscripts: - - Rymuza2024: geniml/manuscripts/rymuza2024.md + - Published manuscripts: + - Gharavi 2021: geniml/manuscripts/gharavi2021.md + - Rymuza 2024: geniml/manuscripts/rymuza2024.md + - Gharavi 2024: geniml/manuscripts/gharavi2024.md - How to cite: citations.md - API documentation: geniml/autodoc_build/geniml.md - Support: geniml/support.md