diff --git a/docs/bedboss/bedbase_configuration.yaml b/docs/bedboss/bedbase_configuration.yaml
deleted file mode 100644
index 081be1d..0000000
--- a/docs/bedboss/bedbase_configuration.yaml
+++ /dev/null
@@ -1,23 +0,0 @@
-path:
- pipeline_output_path: $BEDBOSS_OUTPUT_PATH # do not change it
- bedstat_dir: bedstat_output
- remote_url_base: null
- bedbuncher_dir: bedbucher_output
-database:
- host: localhost
- port: 5432
- password: docker
- user: postgres
- name: pep-db
- dialect: postgresql
- driver: psycopg2
-server:
- host: 0.0.0.0
- port: 8000
-remotes:
- http:
- prefix: https://data.bedbase.org/
- description: HTTP compatible path
- s3:
- prefix: s3://data.bedbase.org/
- description: S3 compatible path
\ No newline at end of file
diff --git a/docs/bedboss/changelog.md b/docs/bedboss/changelog.md
deleted file mode 100644
index 5026ad7..0000000
--- a/docs/bedboss/changelog.md
+++ /dev/null
@@ -1,7 +0,0 @@
-# Changelog
-
-This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.
-
-## [0.1.0a1] - 2023-08-02
-### Added
-- Initial alpha release
diff --git a/docs/bedboss/how-to-configure.md b/docs/bedboss/how-to-configure.md
index d41da3a..27dda82 100644
--- a/docs/bedboss/how-to-configure.md
+++ b/docs/bedboss/how-to-configure.md
@@ -1,11 +1,11 @@
-# How to create bedbase config file (for bedstat)
+# How to create bedbase config file
### Bedbase config file is yaml file with 4 parts:
-- path to output files
-- database credentials
-- qdrant credentials (can be skipped for if indexing is not needed)
-- server information
-- remote info (can be skipped for bedboss)
+- paths and vector models
+- relational database credentials
+- qdrant credentials
+- server information
+- remote info
### Example:
```yaml
@@ -14,7 +14,7 @@ path:
pipeline_output_path: /data/outputs
bedstat_dir: outputs/bedstat_output
bedbuncher_dir: outputs/bedbuncher_output
- region2vec: databio/r2v-ChIP-atlas-hg38
+ region2vec: databio/r2v-ChIP-atlas-hg38-v2
vec2vec: databio/v2v-MiniLM-v2-ATAC-hg38
text2vec: sentence-transformers/all-MiniLM-L6-v2
database:
@@ -23,8 +23,11 @@ database:
password: $POSTGRES_PASSWORD
user: $POSTGRES_USER
name: bedbase
+ bed_table: bedfiles
+ bedset_table: bedsets
+ relationship_table: bedset_bedfiles
dialect: postgresql
- driver: psycopg
+ driver: psycopg2
qdrant:
host: $QDRANT_HOST
port: 6333
@@ -35,26 +38,9 @@ server:
port: 8000
remotes:
http:
- prefix: http://data.bedbase.org/
- description: HTTP compatible path
- s3:
- prefix: s3://data.bedbase.org/
- description: S3 compatible path
-access_methods:
- http:
- type: "https"
- description: HTTP compatible path
prefix: https://data2.bedbase.org/
+ description: HTTP compatible path
s3:
- type: "s3"
- description: S3 compatible path
prefix: s3://data2.bedbase.org/
- local:
- type: "https"
- description: How to serve local files.
- prefix: /static/
-```
-
-Download example bedbase configuration file here: Example bedbase configuration file
-
-.
\ No newline at end of file
+ description: S3 compatible path
+```
\ No newline at end of file
diff --git a/docs/bedboss/how-to-create-database.md b/docs/bedboss/how-to-create-database.md
index 08ee2f2..7dbac03 100644
--- a/docs/bedboss/how-to-create-database.md
+++ b/docs/bedboss/how-to-create-database.md
@@ -1,9 +1,14 @@
-# How to create bedbase database
+# How to create BEDbase database
-To run bedstat, bedbuncher and bedmbed we need to create postgres database.
+To run bedboss and upload data to the database we need to create postgres database, or use existing one.
+---
+### To create local database:
We are initiating postgres db in docker.
-If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`.
+If you don't have docker installed, you can install it with
+```bash
+sudo apt-get update && apt-get install docker-engine -y
+```
Now, create a persistent volume to house PostgreSQL data:
@@ -16,7 +21,9 @@ docker run -d --name bedbase-postgres -p 5432:5432 \
-e POSTGRES_PASSWORD=bedbasepassword \
-e POSTGRES_USER=postgres \
-e POSTGRES_DB=postgres \
- -v postgres-data:/var/lib/postgresql/data postgres:13
+ -v postgres-data:/var/lib/postgresql/data \
+ postgres:13
```
Now we have created docker and can run pipelines.
+To connect to the database, change your credentials in the `bedbase_config.yaml` file.
diff --git a/docs/bedboss/how-to-develop.md b/docs/bedboss/how-to-develop.md
new file mode 100644
index 0000000..37d900b
--- /dev/null
+++ b/docs/bedboss/how-to-develop.md
@@ -0,0 +1 @@
+### 🚧 docs in progress! Stay tuned for updates. We're working hard to bring you valuable content soon!
\ No newline at end of file
diff --git a/docs/bedboss/how-to-install-r-dependencies.md b/docs/bedboss/how-to-install-r-dependencies.md
index 9619097..ec54fc6 100644
--- a/docs/bedboss/how-to-install-r-dependencies.md
+++ b/docs/bedboss/how-to-install-r-dependencies.md
@@ -1,12 +1,7 @@
# How to install R dependencies
-1. Install R:
-2. Download this script: Install R dependencies
-3. Install dependencies by running this command in your terminal:
-
- ```
- Rscript installRdeps.R
- ```
-
-4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder:
-[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh))
+0. Install bedboss
+1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html
+2. Download this script: [installRdeps.R](https://github.com/databio/bedboss/blob/dev/scripts/installRdeps.R)
+3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R```
+4. Run `bedboss requirements-check` to check if everything was installed correctly.
diff --git a/docs/bedboss/how-to-run-from-python.md b/docs/bedboss/how-to-run-from-python.md
deleted file mode 100644
index c45c814..0000000
--- a/docs/bedboss/how-to-run-from-python.md
+++ /dev/null
@@ -1,77 +0,0 @@
-# How to run bedboss as a Python API
-
-## Install bedboss
-
-```bash
-pip install bedboss
-```
-
-## Run bedboss all
-
-```python
-from bedboss import run_all
-
-run_all(
- sample_name="example_sample_name",
- input_file="example/path/to/input_file",
- input_type="bed",
- outfolder="example/path/to/outfolder",
- genome="hg38",
- bedbase_config="example/path/to/bedbase_config.yaml",
- # + another optional arguments
-)
-
-
-```
-
-
-## Run bedboss all-pep
-
-```python
-from bedboss import run_all_by_pep
-
-run_all_by_pep(
- pep="example/path/to/pep.yaml"
-)
-```
-
-## Run bedboss make
-
-```python
-from bedboss import BedMaker
-
-BedMaker(
- input_file="example/path/to/input_file",
- input_type="bed",
- output_bed="example/path/to/output_bed",
- output_bigbed="example/path/to/output_bigbed",
- sample_name="example_sample_name",
- genome="hg38",
-)
-
-```
-
-## Run bedboss stat
-
-```python
-from bedboss import bedstat
-
-bedstat(
- bedfile="example/path/to/bedfile.bed",
- bedbase_config="example/path/to/bedbase_config.yaml",
- genome="hg38",
- outfolder="example/path/to/outfolder",
-)
-
-```
-
-## Run bedboss qc
-
-```python
-from bedboss import bedqc
-
-bedqc(
- bedfile="example/path/to/bedfile.bed",
- outfolder="example/path/to/outfolder",
-)
-```
\ No newline at end of file
diff --git a/docs/bedboss/installRdeps.R b/docs/bedboss/installRdeps.R
deleted file mode 100644
index 6e6627e..0000000
--- a/docs/bedboss/installRdeps.R
+++ /dev/null
@@ -1,29 +0,0 @@
-.install_pkg = function(p, bioc=FALSE) {
- if(!require(package = p, character.only=TRUE)) {
- if(bioc) {
- BiocManager::install(pkgs = p)
- } else {
- install.packages(pkgs = p)
- }
- }
-}
-
-.install_pkg("R.utils")
-.install_pkg("BiocManager")
-.install_pkg("optparse")
-.install_pkg("devtools")
-.install_pkg("GenomicRanges", bioc=TRUE)
-.install_pkg("GenomicFeatures", bioc=TRUE)
-.install_pkg("ensembldb", bioc=TRUE)
-.install_pkg("LOLA", bioc=TRUE)
-.install_pkg("BSgenome", bioc=TRUE)
-.install_pkg("ExperimentHub", bioc=TRUE)
-.install_pkg("AnnotationHub", bioc=TRUE)
-.install_pkg("conflicted")
-if(!require(package = "GenomicDistributions", character.only=TRUE)) {
- devtools::install_github("databio/GenomicDistributions")
-}
-options(timeout=1000)
-if(!require(package = "GenomicDistributionsData", character.only=TRUE)) {
- install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.2.tar.gz", repos=NULL)
-}
diff --git a/docs/bedboss/templates/usage.template b/docs/bedboss/templates/usage.template
deleted file mode 100644
index d01300f..0000000
--- a/docs/bedboss/templates/usage.template
+++ /dev/null
@@ -1,22 +0,0 @@
-# Usage reference
-
-BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files
-
-BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments:
-
-- `bedbase all`: Runs all pipelines one in order: bedmaker -> bedqc -> bedstat
-
-- `bedbase insert`: Runs all pipelines one in order by using PEP file and creates bedset: bedmaker -> bedqc -> bedstat -> bedbuncher
-
-- `bedbase make`: Creates Bed and BigBed files from other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig]
-
-- `bedbase qc`: Runs Quality control for bed file (Works only with bed files)
-
-- `bedbase stat`: Runs statistics for bed and bigbed files.
-
-- `bedbase bunch`: Creates bedset from PEP file
-
-- `bedbase index`: Creates bed file vectors and inserts to qdrant database
-
-Here you can see the command-line usage instructions for the main bedboss command and for each subcommand:
-
diff --git a/docs/bedboss/tutorials/bedbuncher_tutorial.md b/docs/bedboss/tutorials/bedbuncher_tutorial.md
new file mode 100644
index 0000000..f32c8b8
--- /dev/null
+++ b/docs/bedboss/tutorials/bedbuncher_tutorial.md
@@ -0,0 +1 @@
+### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon!
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/bedclassifier_tutorial.md b/docs/bedboss/tutorials/bedclassifier_tutorial.md
new file mode 100644
index 0000000..f32c8b8
--- /dev/null
+++ b/docs/bedboss/tutorials/bedclassifier_tutorial.md
@@ -0,0 +1 @@
+### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon!
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/bedindex_tutorial.md b/docs/bedboss/tutorials/bedindex_tutorial.md
new file mode 100644
index 0000000..f32c8b8
--- /dev/null
+++ b/docs/bedboss/tutorials/bedindex_tutorial.md
@@ -0,0 +1 @@
+### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon!
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/bedmaker_tutorial.md b/docs/bedboss/tutorials/bedmaker_tutorial.md
new file mode 100644
index 0000000..76a748f
--- /dev/null
+++ b/docs/bedboss/tutorials/bedmaker_tutorial.md
@@ -0,0 +1,39 @@
+## BEDmaker
+
+The BEDmaker is a tool that allows you to convert various file types into BED format and bigBed format.
+Currently supported formats are:
+- bed
+- bigBed
+- bigWig
+- wig
+
+Before running pipeline first, you have to install bedboss and check if bedmaker requirements are satisfied.
+To do so, you can run the next command:
+```bash
+bedboss requirements-check
+```
+
+### Run BEDmaker from command line
+```bash
+bedboss make \
+ --input-file path/to/input/file \
+ --input-type bed\
+ --output-folder path/to/output/dir \
+ --genome hg38 \
+ --sample-name sample1
+ --bigbed "path/to/bigbedfile.bigbed" # optional
+```
+
+### Run BEDmaker from within Python
+```python
+from bedboss.bedmaker.bedmaker import make_all
+
+make_all(
+ input_file="path/to/input/file",
+ input_type="bed",
+ output_folder="path/to/output/dir",
+ genome="hg38",
+ sample_name="sample1",
+ bigbed="path/to/bigbedfile.bigbed" # optional
+)
+```
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/bedqc_tutorial.md b/docs/bedboss/tutorials/bedqc_tutorial.md
new file mode 100644
index 0000000..3ff9a46
--- /dev/null
+++ b/docs/bedboss/tutorials/bedqc_tutorial.md
@@ -0,0 +1,32 @@
+## BEDqc
+
+BEDqc is a tool for quality control of BED files.
+As for now, it checks:
+- maximum file size,
+- maximum number of regions,
+- minimum region width threshold
+
+----
+### Run BEDqc from command line
+```bash
+bedboss qc \
+ --bedfile path/to/bedfile.bed \
+ --outfolder path/to/output/dir \
+```
+
+---
+
+Run BEDqc from within Python
+```python
+from bedboss import bedqc
+
+bedqc.run_bedqc(
+ bedfile="path/to/bedfile.bed",
+ outfolder="path/to/output/dir"
+ max_file_size=1000000, # optional
+ max_number_of_regions=1000, # optional
+ min_region_width=10, # optional
+)
+```
+
+If file won't pass the quality control, it will raise an error. and add this information to the log file.
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/bedstat_tutorial.md b/docs/bedboss/tutorials/bedstat_tutorial.md
new file mode 100644
index 0000000..cc34146
--- /dev/null
+++ b/docs/bedboss/tutorials/bedstat_tutorial.md
@@ -0,0 +1,55 @@
+# BEDstats
+
+BEDstats is a tool that calculates the statistics of a BED file and provides plots to visualize the results.
+
+It produces BED file Statistics:
+
+- **GC content**.The average GC content of the region set.
+- **Number of regions**. The total number of regions in the BED file.
+- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS)
+- **Mean region width**. The average region width of the region set.
+- **Exon percentage**. The percentage of the regions in the BED file that are annotated as exon.
+- **Intron percentage**. The percentage of the regions in the BED file that are annotated as intron.
+- **Promoter proc percentage**. The percentage of the regions in the BED file that are annotated as promoter-prox.
+- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic.
+- **Promoter core percentage**. The percentage of the regions in the BED file that are annotated as promoter-core.
+- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR.
+- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR.
+
+---
+
+### Step 1: Install all dependencies
+
+First you have to install bedboss and check if all requirements are satisfied.
+To do so, you can run next command:
+```bash
+bedboss requirements-check
+```
+If requirements are not satisfied, you will see the list of missing packages.
+
+
+### Step 2: Run bedstats
+
+#### Run BEDstats from command line
+```bash
+bedboss stats \
+ --bedfile path/to/bedfile.bed \
+ --outfolder path/to/output/dir \
+ --genome hg38 \
+ --bigbed "path/to/bigbedfile.bigbed" # optional
+```
+
+----
+#### Run BEDstats from within Python
+```python
+from bedboss import bedstats
+
+bedstat(
+ bedfile="path/to/bedfile.bed",
+ outfolder="path/to/output/dir",
+ genome="hg19",
+ bigbed="path/to/bigbedfile.bigbed", # optional
+ )
+```
+
+After running BEDstats, you will find the following files in the output directory + all statistics will be saved in output file.
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/tutorial_all.md b/docs/bedboss/tutorials/tutorial_all.md
new file mode 100644
index 0000000..fb7d23c
--- /dev/null
+++ b/docs/bedboss/tutorials/tutorial_all.md
@@ -0,0 +1,61 @@
+## Bedboss all
+
+Bedboss run-all is intended to run on sample (bed file) and run all bedboss pipelines:
+ bedmaker (+ bedclassifier + bedqc) -> bedstat. After that optionally it can run bedbuncher, qdrant indexing and upload metadata to PEPhub.
+
+### Step 1: Install all dependencies
+
+First you have to install bedboss and check if all requirements are satisfied.
+To do so, you can run next command:
+```bash
+bedboss requirements-check
+```
+If requirements are not satisfied, you will see the list of missing packages.
+
+### Step 2: Create bedconf.yaml file
+To run bedboss, you need to create a bedconf.yaml file with configuration.
+Detail instructions are in the configuration section.
+
+### Step 3: Run bedboss
+To run bedboss, you need to run the next command:
+```bash
+bedboss all \
+ --bedbase-config bedconf.yaml \
+ --input-file path/to/bedfile.bed \
+ --output-dir path/to/output/dir \
+ --input-type bed \
+ --genome hg38 \
+
+```
+
+Above command will run bedboss on the bed file and create a bedstat file in the output directory.
+It contains only required parameters. For more details, please check the usage section.
+
+By default, results will be uploaded only to postgres database.
+- To upload results to PEPhub, you need to make `databio` org available on GitHub, then login to PEPhub, and add `--upload-pephub` flag to the command.
+- To upload results to Qdrant, you need to add `--upload-qdrant` flag to the command.
+- To upload actual files to s3, you need to add `--upload-s3` flag to the command, and Before uploading you have to set up all necessary env vars: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_ENDPOINT_URL.
+
+
+---
+
+### Run bedboss all from within Python
+
+To run bedboss all from within Python, instead of using the command line in the step #3, you can use the following code:
+
+```python
+from bedboss import bedboss
+
+bedboss.run_all(
+ sample_name="sample1",
+ input_file="path/to/bedfile.bed",
+ input_type="bed",
+ outfolder="path/to/output/dir",
+ genome="hg38",
+ bedbase_config="bedconf.yaml",
+ narrowpeak=False, # optional
+ standardardize=True, # optional
+ other_metadata=None, # optional
+ upload_pephub=True, # optional
+)
+```
\ No newline at end of file
diff --git a/docs/bedboss/tutorials/tutorial_insert.md b/docs/bedboss/tutorials/tutorial_insert.md
new file mode 100644
index 0000000..2073bd3
--- /dev/null
+++ b/docs/bedboss/tutorials/tutorial_insert.md
@@ -0,0 +1,64 @@
+## Bedboss insert
+
+Bedboss insert is intended to run each sample in provided PEP.
+PEP can be provided as a file or as a registry path of the PEPhub.
+
+
+### Step 1: Install all dependencies
+
+First you have to install bedboss and check if all requirements are satisfied.
+To do so, you can run next command:
+```bash
+bedboss requirements-check
+```
+If requirements are not satisfied, you will see the list of missing packages.
+
+### Step 2: Create bedconf.yaml file
+To run bedboss insert, you need to create a bedconf.yaml file with configuration.
+Detail instructions are in the configuration section.
+
+### Step 3: Create PEP with bed files.
+BEDboss PEP should contain next fields: sample_name, input_file, input_type, genome.
+Before running bedboss, you need to validate provided PEP with [bedboss_insert schema](https://schema.databio.org/?namespace=pipelines&schema=bedboss).
+The easiest way to do so is to use [PEPhub](https://pephub.databio.org/), where you create a new PEP and validate it with the schema.
+Example PEP: [https://pephub.databio.org/databio/excluderanges?tag=bedbase](https://pephub.databio.org/databio/excluderanges?tag=bedbase)
+
+### Step 4: Run bedboss insert
+To run bedboss insert , you need to run the next command:
+```bash
+bedboss insert \
+ --bedbase-config bedconf.yaml \
+ --pep path/to/pep.yaml \
+ --output-folder path/to/output/dir
+
+```
+
+Above command will run bedboss on the bed file and create a bedstat file in the output directory.
+It contains only required parameters. For more details, please check the usage section.
+
+By default, results will be uploaded only to postgres database.
+- To upload results to PEPhub, you need to make `databio` org available on GitHub, then login to PEPhub, and add `--upload-pephub` flag to the command.
+- To upload results to Qdrant, you need to add `--upload-qdrant` flag to the command.
+- To upload actual files to s3, you need to add `--upload-s3` flag to the command, and Before uploading you have to set up all necessary env vars: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_ENDPOINT_URL.
+- To create bedset of provided pep files, you need to add `--create-bedset` flag to the command.
+
+
+---
+
+### Run bedboss insert from within Python
+
+To run bedboss insert from within Python, instead of using the command line in the step #4, you can use the following code:
+
+```python
+from bedboss import bedboss
+
+bedboss.insert_pep(
+ bedbase_config="bedconf.yaml",
+ pep="path/to/pep.yaml",
+ output_folder="path/to/output/dir",
+ upload_pephub=True, # optional
+ upload_qdrant=True, # optional
+ upload_s3=True, # optional
+ create_bedset=True # optional
+)
+```
\ No newline at end of file
diff --git a/docs/geniml/tutorials/text2bednn-search-interface.md b/docs/geniml/tutorials/text2bednn-search-interface.md
index e62a2e9..5c284ee 100644
--- a/docs/geniml/tutorials/text2bednn-search-interface.md
+++ b/docs/geniml/tutorials/text2bednn-search-interface.md
@@ -1,107 +1,42 @@
# How to create a natural language search backend for BED files
-The metadata of each BED file / region set is needed to build a natural language search backend. Embedding vectors of BED
-files are created by `Region2Vec`, and embedding vectors of metadata are created by [`SentenceTransformers`](https://www.sbert.net/). `Embed2EmbedNN`,
-a feedforward neural network (FNN), is trained to learn the embedding vectors of metadata from the embedding vectors of BED
-files. When a natural language query string is given, it will first be encoded to a vector by `SentenceTransformers`, and that
-vector will be encoded to a query vector by the FNN. `search` backend can perform k-nearest neighbors (KNN) search among the
-stored embedding vectors of BED files, and the BED files whose embedding vectors are closest to that query vector are the
-search results.
+The metadata of each BED file is needed to build a natural language search backend. BED files embedding vectors are created by
+`Region2Vec`, and metadata embedding vectors are created by [`FastEmbed`](https://github.com/qdrant/fastembed), [`SentenceTransformers`](https://www.sbert.net/), or other text embedding models.
-## Upload metadata and regions from files
-`RegionSetInfo` is a [`dataclass`](https://docs.python.org/3/library/dataclasses.html) that can store information about a BED file, which includes the file name, metadata, and the
-embedding vectors of region set and metadata. A list of RegionSetInfo can be created with a folder of BED files and a file of their
-metadata by `SentenceTransformers` and `Region2VecExModel`. The first column of metadata file must match the BED file names
-(the first column contains BED file names, or strings which BED file names start with), and is sorted by the first column. It can be
-sorted by a terminal command:
-```
-sort -k1 1 metadata_file > new_metadata_file
-```
-Example code to build a list of RegionSetInfo
-
-```python
-from geniml.text2bednn.utils import build_regionset_info_list_from_files
-from geniml.region2vec.main import Region2VecExModel
-from fastembed.embedding import FlagEmbedding
-
-# load Region2Vec from hugging face
-r2v_model = Region2VecExModel("databio/r2v-ChIP-atlas")
-# load natural language embedding model
-nl_model = FlagEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
-# folder of bed file
-bed_folder = "path/to/folders/of/bed/files"
-# path for metadata file
-metadata_path = "path/to/file/of/metadata"
-
-# list of RegionSetInfo
-ri_list = build_regionset_info_list_from_files(bed_folder, metadata_path, r2v_model, nl_model)
-```
+`Vec2VecFNN`, a feedforward neural network (FNN), is trained to maps vectors from the embedding space of natural language to the embedding
+space of BED files. When a natural language query string is given, it will first be encoded to a vector by the text embedding model, and that
+vector will be encoded to a query vector by the FNN. `search` backend can perform k-nearest neighbors (KNN) search among the stored BED
+file embedding vectors, and the BED files whose embedding vectors are closest to that query vector are the search results.
-## Upload metadata and regions from PEP
-A list of RegionSetInfo can also be created with a [`PEP`](https://pep.databio.org/en/latest/), which includes a `.csv` that stores metadata, and a `.yaml` as a a metadata validation
-framework.
-
-Example code to build a list of RegionSetInfo from a PEP:
-
-```python
-from geniml.text2bednn.utils import build_regionset_info_list_from_PEP
-
-# columns in the csv of PEP that contains metadata information
-columns = return [
- "tissue",
- "cell_line",
- "tissue_lineage",
- "tissue_description",
- "diagnosis",
- "sample_name",
- "antibody",
- ]
-
-# path to the yaml file
-yaml_path = "path/to/framework/yaml/file"
-
-ri_list_PEP = build_regionset_info_list_from_PEP(
- yaml_path,
- col_names,
- r2v_model,
- nl_model,
- )
-```
+## Store embedding vectors
+It is recommended to use `geniml.search.backend.HNSWBackend` to store embedding vectors. In the `HNSWBackend` that stores each BED file embedding
+vector, the `payload` should contain the name of BED file. In the `HNSWBackend` that stores the embedding vectors of each
+metadata string, the `payload` should contain the name of BED files that have that string in metadata.
## Train the model
-The list of RegionSetInfo can be split into 3 lists, which represent the training set, validating set, and testing set. The embedding
-vectors of metadata will be X, and the embedding vectors of the region set will be Y.
+Training a `Vec2VecFNN` needs x-y pairs of vectors (x: metadata embedding vector; y: BED embedding vector). A pair of a metadata embedding
+vector with the embedding vectors of BED files in its payload is a target pair, othersie a non-target pair. Non-target pairs are sampled for
+contrastive loss. Here is sample code to generate pairs from storage backend and train the model:
```python
-from sklearn.model_selection import train_test_split
-from geniml.text2bednn.utils region_info_list_to_vectors
-from geniml.text2bednn.text2bednn import Vec2VecFNN
+# target is an array of 1 (target) and -1 (non-target)
+X, Y, target = vec_pairs(
+ nl_backend, # HNSWBackend that store metadata embedding vectors
+ bed_backend, # HNSWBackend that store BED embedding vectors
+ "name", # key to file name in BED backend payloads
+ "files", # key to matching files in metadata backend payloads
+ True, # sample non-target pairs
+ 1.0 # number of non-target pairs /number of target pairs = 1
+)
+
+# train without validate data
+v2v_torch_contrast.train(
+ X,
+ Y,
+ folder_path="path/to/folder/for/checkpoint",
+ loss_func="cosine_embedding_loss", # right now "cosine_embedding_loss" is the only contrastive loss function available
+ training_target=target,
+)
-# split the list of RegionInfoSet into different data set
-train_list, validate_list = train_test_split(ri_list, test_size=0.2)
-
-# get the embedding vectors
-train_X, train_Y = region_info_list_to_vectors(train_list)
-validate_X, validate_Y = region_info_list_to_vectors(validate_list)
-
-# train the neural network
-v2vnn = Vec2VecFNN()
-v2vnn.train(train_X, train_Y, validating_data=(validate_X, validate_Y), num_epochs=50)
-```
-
-## Load the vectors and information to search backend
-[`qdrant-client`](https://github.com/qdrant/qdrant-client) and [`hnswlib`](https://github.com/nmslib/hnswlib) can store vectors and perform k-nearest neighbors (KNN) search with a given query vector, so we
-created one database backend (`QdrantBackend`) and one local file backend (`HNSWBackend`) that can store the embedding
-vectors for KNN search. `HNSWBackend` will create a .bin file with given path, which saves the searching index.
-
-```python
-from geniml.text2bednn.utils import prepare_vectors_for_database
-
-# loading data to search backend
-embeddings, labels = prepare_vectors_for_database(ri_list)
-
-# search backend
-hnsw_backend = HNSWBackend(local_index_path="path/to/local/index.bin")
-hnsw_backend.load(embeddings, labels)
```
## text2bednn search interface
@@ -119,3 +54,26 @@ query_term = "human, kidney, blood"
# perform KNN search with K = 5, the id of stored vectors and the distance / similarity score will be returned
ids, scores = file_interface.nl_vec_search(query_term, 5)
```
+
+### Evaluate search performance
+With a dictionary that contains query strings and id of relevant query results in search backend in this format:
+```
+{
+ : [
+ ,
+ ...
+ ],
+ ...
+}
+```
+`TextToBedNNSearchInterface` can return [mean average precision](https://www.youtube.com/watch?v=pM6DJ0ZZee0&t=157s), [average AUC-ROC](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf), and [average R-Precision](https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_491), here is example code:
+```python
+query_dict = {
+ "metadata string 1": [2, 3],
+ "metadata string 12": [1],
+ "metadata string 3": [2, 4, 5],
+ "metadata string 1": [0]
+}
+
+MAP, AUC, RP = file_interface.eval(query_dict)
+```
diff --git a/mkdocs.yml b/mkdocs.yml
index 2a0cdae..621f1b1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -62,18 +62,20 @@ nav:
- Changelog: changelog.md
- BEDboss:
- BEDBoss: bedboss/README.md
- - Tutorials:
- - BEDbase tutorial: bedboss/code/bedbase-tutorial.md
- - BEDmaker tutorial: bedboss/code/bedmaker-tutorial.md
- - BEDqc tutorial: bedboss/code/bedqc-tutorial.md
- - BEDstat tutorial: bedboss/code/bedstat-tutorial.md
- - Everything tutorial: bedboss/code/tutorial-all.md
+ - Tutorial:
+ - BEDboss-all pipeline: bedboss/tutorials/tutorial_all.md
+ - BEDboss insert: bedboss/tutorials/tutorial_insert.md
+ - BEDmaker tutorial: bedboss/tutorials/bedmaker_tutorial.md
+ - BEDqc tutorial: bedboss/tutorials/bedqc_tutorial.md
+ - BEDstat tutorial: bedboss/tutorials/bedstat_tutorial.md
+ - BEDbancher tutorial: bedboss/tutorials/bedbuncher_tutorial.md
+ - Bedindex tutorial: bedboss/tutorials/bedindex_tutorial.md
+ - Bedclassifier tutorial: bedboss/tutorials/bedclassifier_tutorial.md
- How to guides:
- - Configure bedboss: bedboss/how-to-configure.md
- - Run from Python: bedboss/how-to-run-from-python.md
- - Install R dependencies: bedboss/how-to-install-r-dependencies.md
- Create BEDbase database: bedboss/how-to-create-database.md
- - BEDboss insert: bedboss/bedboss-insert.md
+ - Create config file: bedboss/how-to-configure.md
+ - Install R dependencies: bedboss/how-to-install-r-dependencies.md
+ - Develop process: bedboss/how-to-develop.md
- Reference:
- How to cite: citations.md
- Usage: bedboss/usage.md