diff --git a/docs/bedboss/bedbase_configuration.yaml b/docs/bedboss/bedbase_configuration.yaml deleted file mode 100644 index 081be1d..0000000 --- a/docs/bedboss/bedbase_configuration.yaml +++ /dev/null @@ -1,23 +0,0 @@ -path: - pipeline_output_path: $BEDBOSS_OUTPUT_PATH # do not change it - bedstat_dir: bedstat_output - remote_url_base: null - bedbuncher_dir: bedbucher_output -database: - host: localhost - port: 5432 - password: docker - user: postgres - name: pep-db - dialect: postgresql - driver: psycopg2 -server: - host: 0.0.0.0 - port: 8000 -remotes: - http: - prefix: https://data.bedbase.org/ - description: HTTP compatible path - s3: - prefix: s3://data.bedbase.org/ - description: S3 compatible path \ No newline at end of file diff --git a/docs/bedboss/changelog.md b/docs/bedboss/changelog.md deleted file mode 100644 index 5026ad7..0000000 --- a/docs/bedboss/changelog.md +++ /dev/null @@ -1,7 +0,0 @@ -# Changelog - -This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. - -## [0.1.0a1] - 2023-08-02 -### Added -- Initial alpha release diff --git a/docs/bedboss/how-to-configure.md b/docs/bedboss/how-to-configure.md index d41da3a..27dda82 100644 --- a/docs/bedboss/how-to-configure.md +++ b/docs/bedboss/how-to-configure.md @@ -1,11 +1,11 @@ -# How to create bedbase config file (for bedstat) +# How to create bedbase config file ### Bedbase config file is yaml file with 4 parts: -- path to output files -- database credentials -- qdrant credentials (can be skipped for if indexing is not needed) -- server information -- remote info (can be skipped for bedboss) +- paths and vector models +- relational database credentials +- qdrant credentials +- server information +- remote info ### Example: ```yaml @@ -14,7 +14,7 @@ path: pipeline_output_path: /data/outputs bedstat_dir: outputs/bedstat_output bedbuncher_dir: outputs/bedbuncher_output - region2vec: databio/r2v-ChIP-atlas-hg38 + region2vec: databio/r2v-ChIP-atlas-hg38-v2 vec2vec: databio/v2v-MiniLM-v2-ATAC-hg38 text2vec: sentence-transformers/all-MiniLM-L6-v2 database: @@ -23,8 +23,11 @@ database: password: $POSTGRES_PASSWORD user: $POSTGRES_USER name: bedbase + bed_table: bedfiles + bedset_table: bedsets + relationship_table: bedset_bedfiles dialect: postgresql - driver: psycopg + driver: psycopg2 qdrant: host: $QDRANT_HOST port: 6333 @@ -35,26 +38,9 @@ server: port: 8000 remotes: http: - prefix: http://data.bedbase.org/ - description: HTTP compatible path - s3: - prefix: s3://data.bedbase.org/ - description: S3 compatible path -access_methods: - http: - type: "https" - description: HTTP compatible path prefix: https://data2.bedbase.org/ + description: HTTP compatible path s3: - type: "s3" - description: S3 compatible path prefix: s3://data2.bedbase.org/ - local: - type: "https" - description: How to serve local files. - prefix: /static/ -``` - -Download example bedbase configuration file here: Example bedbase configuration file - -. \ No newline at end of file + description: S3 compatible path +``` \ No newline at end of file diff --git a/docs/bedboss/how-to-create-database.md b/docs/bedboss/how-to-create-database.md index 08ee2f2..7dbac03 100644 --- a/docs/bedboss/how-to-create-database.md +++ b/docs/bedboss/how-to-create-database.md @@ -1,9 +1,14 @@ -# How to create bedbase database +# How to create BEDbase database -To run bedstat, bedbuncher and bedmbed we need to create postgres database. +To run bedboss and upload data to the database we need to create postgres database, or use existing one. +--- +### To create local database: We are initiating postgres db in docker. -If you don't have docker installed, you can install it with `sudo apt-get update && apt-get install docker-engine -y`. +If you don't have docker installed, you can install it with +```bash +sudo apt-get update && apt-get install docker-engine -y +``` Now, create a persistent volume to house PostgreSQL data: @@ -16,7 +21,9 @@ docker run -d --name bedbase-postgres -p 5432:5432 \ -e POSTGRES_PASSWORD=bedbasepassword \ -e POSTGRES_USER=postgres \ -e POSTGRES_DB=postgres \ - -v postgres-data:/var/lib/postgresql/data postgres:13 + -v postgres-data:/var/lib/postgresql/data \ + postgres:13 ``` Now we have created docker and can run pipelines. +To connect to the database, change your credentials in the `bedbase_config.yaml` file. diff --git a/docs/bedboss/how-to-develop.md b/docs/bedboss/how-to-develop.md new file mode 100644 index 0000000..37d900b --- /dev/null +++ b/docs/bedboss/how-to-develop.md @@ -0,0 +1 @@ +### 🚧 docs in progress! Stay tuned for updates. We're working hard to bring you valuable content soon! \ No newline at end of file diff --git a/docs/bedboss/how-to-install-r-dependencies.md b/docs/bedboss/how-to-install-r-dependencies.md index 9619097..ec54fc6 100644 --- a/docs/bedboss/how-to-install-r-dependencies.md +++ b/docs/bedboss/how-to-install-r-dependencies.md @@ -1,12 +1,7 @@ # How to install R dependencies -1. Install R: -2. Download this script: Install R dependencies -3. Install dependencies by running this command in your terminal: - - ``` - Rscript installRdeps.R - ``` - -4. Run `bash_requirements_test.sh` to check if everything was installed correctly (located in test folder: -[Bash requirement tests](https://github.com/bedbase/bedboss/blob/68910f5142a95d92c27ef53eafb9c35599af2fbd/test/bash_requirements_test.sh)) +0. Install bedboss +1. Install R: https://cran.r-project.org/bin/linux/ubuntu/fullREADME.html +2. Download this script: [installRdeps.R](https://github.com/databio/bedboss/blob/dev/scripts/installRdeps.R) +3. Install dependencies by running this command in your terminal: ```Rscript installRdeps.R``` +4. Run `bedboss requirements-check` to check if everything was installed correctly. diff --git a/docs/bedboss/how-to-run-from-python.md b/docs/bedboss/how-to-run-from-python.md deleted file mode 100644 index c45c814..0000000 --- a/docs/bedboss/how-to-run-from-python.md +++ /dev/null @@ -1,77 +0,0 @@ -# How to run bedboss as a Python API - -## Install bedboss - -```bash -pip install bedboss -``` - -## Run bedboss all - -```python -from bedboss import run_all - -run_all( - sample_name="example_sample_name", - input_file="example/path/to/input_file", - input_type="bed", - outfolder="example/path/to/outfolder", - genome="hg38", - bedbase_config="example/path/to/bedbase_config.yaml", - # + another optional arguments -) - - -``` - - -## Run bedboss all-pep - -```python -from bedboss import run_all_by_pep - -run_all_by_pep( - pep="example/path/to/pep.yaml" -) -``` - -## Run bedboss make - -```python -from bedboss import BedMaker - -BedMaker( - input_file="example/path/to/input_file", - input_type="bed", - output_bed="example/path/to/output_bed", - output_bigbed="example/path/to/output_bigbed", - sample_name="example_sample_name", - genome="hg38", -) - -``` - -## Run bedboss stat - -```python -from bedboss import bedstat - -bedstat( - bedfile="example/path/to/bedfile.bed", - bedbase_config="example/path/to/bedbase_config.yaml", - genome="hg38", - outfolder="example/path/to/outfolder", -) - -``` - -## Run bedboss qc - -```python -from bedboss import bedqc - -bedqc( - bedfile="example/path/to/bedfile.bed", - outfolder="example/path/to/outfolder", -) -``` \ No newline at end of file diff --git a/docs/bedboss/installRdeps.R b/docs/bedboss/installRdeps.R deleted file mode 100644 index 6e6627e..0000000 --- a/docs/bedboss/installRdeps.R +++ /dev/null @@ -1,29 +0,0 @@ -.install_pkg = function(p, bioc=FALSE) { - if(!require(package = p, character.only=TRUE)) { - if(bioc) { - BiocManager::install(pkgs = p) - } else { - install.packages(pkgs = p) - } - } -} - -.install_pkg("R.utils") -.install_pkg("BiocManager") -.install_pkg("optparse") -.install_pkg("devtools") -.install_pkg("GenomicRanges", bioc=TRUE) -.install_pkg("GenomicFeatures", bioc=TRUE) -.install_pkg("ensembldb", bioc=TRUE) -.install_pkg("LOLA", bioc=TRUE) -.install_pkg("BSgenome", bioc=TRUE) -.install_pkg("ExperimentHub", bioc=TRUE) -.install_pkg("AnnotationHub", bioc=TRUE) -.install_pkg("conflicted") -if(!require(package = "GenomicDistributions", character.only=TRUE)) { - devtools::install_github("databio/GenomicDistributions") -} -options(timeout=1000) -if(!require(package = "GenomicDistributionsData", character.only=TRUE)) { - install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.2.tar.gz", repos=NULL) -} diff --git a/docs/bedboss/templates/usage.template b/docs/bedboss/templates/usage.template deleted file mode 100644 index d01300f..0000000 --- a/docs/bedboss/templates/usage.template +++ /dev/null @@ -1,22 +0,0 @@ -# Usage reference - -BEDboss is command-line tool-warehouse of 3 pipelines for genomic interval files - -BEDboss include: bedmaker, bedqc, bedstat. This pipelines can be run using next positional arguments: - -- `bedbase all`: Runs all pipelines one in order: bedmaker -> bedqc -> bedstat - -- `bedbase insert`: Runs all pipelines one in order by using PEP file and creates bedset: bedmaker -> bedqc -> bedstat -> bedbuncher - -- `bedbase make`: Creates Bed and BigBed files from other type of genomic interval files [bigwig|bedgraph|bed|bigbed|wig] - -- `bedbase qc`: Runs Quality control for bed file (Works only with bed files) - -- `bedbase stat`: Runs statistics for bed and bigbed files. - -- `bedbase bunch`: Creates bedset from PEP file - -- `bedbase index`: Creates bed file vectors and inserts to qdrant database - -Here you can see the command-line usage instructions for the main bedboss command and for each subcommand: - diff --git a/docs/bedboss/tutorials/bedbuncher_tutorial.md b/docs/bedboss/tutorials/bedbuncher_tutorial.md new file mode 100644 index 0000000..f32c8b8 --- /dev/null +++ b/docs/bedboss/tutorials/bedbuncher_tutorial.md @@ -0,0 +1 @@ +### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon! \ No newline at end of file diff --git a/docs/bedboss/tutorials/bedclassifier_tutorial.md b/docs/bedboss/tutorials/bedclassifier_tutorial.md new file mode 100644 index 0000000..f32c8b8 --- /dev/null +++ b/docs/bedboss/tutorials/bedclassifier_tutorial.md @@ -0,0 +1 @@ +### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon! \ No newline at end of file diff --git a/docs/bedboss/tutorials/bedindex_tutorial.md b/docs/bedboss/tutorials/bedindex_tutorial.md new file mode 100644 index 0000000..f32c8b8 --- /dev/null +++ b/docs/bedboss/tutorials/bedindex_tutorial.md @@ -0,0 +1 @@ +### 🚧 Tutorial in progress! Stay tuned for updates. We're working hard to bring you valuable content soon! \ No newline at end of file diff --git a/docs/bedboss/tutorials/bedmaker_tutorial.md b/docs/bedboss/tutorials/bedmaker_tutorial.md new file mode 100644 index 0000000..76a748f --- /dev/null +++ b/docs/bedboss/tutorials/bedmaker_tutorial.md @@ -0,0 +1,39 @@ +## BEDmaker + +The BEDmaker is a tool that allows you to convert various file types into BED format and bigBed format. +Currently supported formats are: +- bed +- bigBed +- bigWig +- wig + +Before running pipeline first, you have to install bedboss and check if bedmaker requirements are satisfied. +To do so, you can run the next command: +```bash +bedboss requirements-check +``` + +### Run BEDmaker from command line +```bash +bedboss make \ + --input-file path/to/input/file \ + --input-type bed\ + --output-folder path/to/output/dir \ + --genome hg38 \ + --sample-name sample1 + --bigbed "path/to/bigbedfile.bigbed" # optional +``` + +### Run BEDmaker from within Python +```python +from bedboss.bedmaker.bedmaker import make_all + +make_all( + input_file="path/to/input/file", + input_type="bed", + output_folder="path/to/output/dir", + genome="hg38", + sample_name="sample1", + bigbed="path/to/bigbedfile.bigbed" # optional +) +``` \ No newline at end of file diff --git a/docs/bedboss/tutorials/bedqc_tutorial.md b/docs/bedboss/tutorials/bedqc_tutorial.md new file mode 100644 index 0000000..3ff9a46 --- /dev/null +++ b/docs/bedboss/tutorials/bedqc_tutorial.md @@ -0,0 +1,32 @@ +## BEDqc + +BEDqc is a tool for quality control of BED files. +As for now, it checks: +- maximum file size, +- maximum number of regions, +- minimum region width threshold + +---- +### Run BEDqc from command line +```bash +bedboss qc \ + --bedfile path/to/bedfile.bed \ + --outfolder path/to/output/dir \ +``` + +--- + +Run BEDqc from within Python +```python +from bedboss import bedqc + +bedqc.run_bedqc( + bedfile="path/to/bedfile.bed", + outfolder="path/to/output/dir" + max_file_size=1000000, # optional + max_number_of_regions=1000, # optional + min_region_width=10, # optional +) +``` + +If file won't pass the quality control, it will raise an error. and add this information to the log file. \ No newline at end of file diff --git a/docs/bedboss/tutorials/bedstat_tutorial.md b/docs/bedboss/tutorials/bedstat_tutorial.md new file mode 100644 index 0000000..cc34146 --- /dev/null +++ b/docs/bedboss/tutorials/bedstat_tutorial.md @@ -0,0 +1,55 @@ +# BEDstats + +BEDstats is a tool that calculates the statistics of a BED file and provides plots to visualize the results. + +It produces BED file Statistics: + +- **GC content**.The average GC content of the region set. +- **Number of regions**. The total number of regions in the BED file. +- **Median TSS distance**. The median absolute distance to the Transcription Start Sites (TSS) +- **Mean region width**. The average region width of the region set. +- **Exon percentage**. The percentage of the regions in the BED file that are annotated as exon. +- **Intron percentage**. The percentage of the regions in the BED file that are annotated as intron. +- **Promoter proc percentage**. The percentage of the regions in the BED file that are annotated as promoter-prox. +- **Intergenic percentage**. The percentage of the regions in the BED file that are annotated as intergenic. +- **Promoter core percentage**. The percentage of the regions in the BED file that are annotated as promoter-core. +- **5' UTR percentage**. The percentage of the regions in the BED file that are annotated as 5'-UTR. +- **3' UTR percentage**. The percentage of the regions in the BED file that are annotated as 3'-UTR. + +--- + +### Step 1: Install all dependencies + +First you have to install bedboss and check if all requirements are satisfied. +To do so, you can run next command: +```bash +bedboss requirements-check +``` +If requirements are not satisfied, you will see the list of missing packages. + + +### Step 2: Run bedstats + +#### Run BEDstats from command line +```bash +bedboss stats \ + --bedfile path/to/bedfile.bed \ + --outfolder path/to/output/dir \ + --genome hg38 \ + --bigbed "path/to/bigbedfile.bigbed" # optional +``` + +---- +#### Run BEDstats from within Python +```python +from bedboss import bedstats + +bedstat( + bedfile="path/to/bedfile.bed", + outfolder="path/to/output/dir", + genome="hg19", + bigbed="path/to/bigbedfile.bigbed", # optional + ) +``` + +After running BEDstats, you will find the following files in the output directory + all statistics will be saved in output file. \ No newline at end of file diff --git a/docs/bedboss/tutorials/tutorial_all.md b/docs/bedboss/tutorials/tutorial_all.md new file mode 100644 index 0000000..fb7d23c --- /dev/null +++ b/docs/bedboss/tutorials/tutorial_all.md @@ -0,0 +1,61 @@ +## Bedboss all + +Bedboss run-all is intended to run on sample (bed file) and run all bedboss pipelines: + bedmaker (+ bedclassifier + bedqc) -> bedstat. After that optionally it can run bedbuncher, qdrant indexing and upload metadata to PEPhub. + +### Step 1: Install all dependencies + +First you have to install bedboss and check if all requirements are satisfied. +To do so, you can run next command: +```bash +bedboss requirements-check +``` +If requirements are not satisfied, you will see the list of missing packages. + +### Step 2: Create bedconf.yaml file +To run bedboss, you need to create a bedconf.yaml file with configuration. +Detail instructions are in the configuration section. + +### Step 3: Run bedboss +To run bedboss, you need to run the next command: +```bash +bedboss all \ + --bedbase-config bedconf.yaml \ + --input-file path/to/bedfile.bed \ + --output-dir path/to/output/dir \ + --input-type bed \ + --genome hg38 \ + +``` + +Above command will run bedboss on the bed file and create a bedstat file in the output directory. +It contains only required parameters. For more details, please check the usage section. + +By default, results will be uploaded only to postgres database. +- To upload results to PEPhub, you need to make `databio` org available on GitHub, then login to PEPhub, and add `--upload-pephub` flag to the command. +- To upload results to Qdrant, you need to add `--upload-qdrant` flag to the command. +- To upload actual files to s3, you need to add `--upload-s3` flag to the command, and Before uploading you have to set up all necessary env vars: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_ENDPOINT_URL. + + +--- + +### Run bedboss all from within Python + +To run bedboss all from within Python, instead of using the command line in the step #3, you can use the following code: + +```python +from bedboss import bedboss + +bedboss.run_all( + sample_name="sample1", + input_file="path/to/bedfile.bed", + input_type="bed", + outfolder="path/to/output/dir", + genome="hg38", + bedbase_config="bedconf.yaml", + narrowpeak=False, # optional + standardardize=True, # optional + other_metadata=None, # optional + upload_pephub=True, # optional +) +``` \ No newline at end of file diff --git a/docs/bedboss/tutorials/tutorial_insert.md b/docs/bedboss/tutorials/tutorial_insert.md new file mode 100644 index 0000000..2073bd3 --- /dev/null +++ b/docs/bedboss/tutorials/tutorial_insert.md @@ -0,0 +1,64 @@ +## Bedboss insert + +Bedboss insert is intended to run each sample in provided PEP. +PEP can be provided as a file or as a registry path of the PEPhub. + + +### Step 1: Install all dependencies + +First you have to install bedboss and check if all requirements are satisfied. +To do so, you can run next command: +```bash +bedboss requirements-check +``` +If requirements are not satisfied, you will see the list of missing packages. + +### Step 2: Create bedconf.yaml file +To run bedboss insert, you need to create a bedconf.yaml file with configuration. +Detail instructions are in the configuration section. + +### Step 3: Create PEP with bed files. +BEDboss PEP should contain next fields: sample_name, input_file, input_type, genome. +Before running bedboss, you need to validate provided PEP with [bedboss_insert schema](https://schema.databio.org/?namespace=pipelines&schema=bedboss). +The easiest way to do so is to use [PEPhub](https://pephub.databio.org/), where you create a new PEP and validate it with the schema. +Example PEP: [https://pephub.databio.org/databio/excluderanges?tag=bedbase](https://pephub.databio.org/databio/excluderanges?tag=bedbase) + +### Step 4: Run bedboss insert +To run bedboss insert , you need to run the next command: +```bash +bedboss insert \ + --bedbase-config bedconf.yaml \ + --pep path/to/pep.yaml \ + --output-folder path/to/output/dir + +``` + +Above command will run bedboss on the bed file and create a bedstat file in the output directory. +It contains only required parameters. For more details, please check the usage section. + +By default, results will be uploaded only to postgres database. +- To upload results to PEPhub, you need to make `databio` org available on GitHub, then login to PEPhub, and add `--upload-pephub` flag to the command. +- To upload results to Qdrant, you need to add `--upload-qdrant` flag to the command. +- To upload actual files to s3, you need to add `--upload-s3` flag to the command, and Before uploading you have to set up all necessary env vars: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_ENDPOINT_URL. +- To create bedset of provided pep files, you need to add `--create-bedset` flag to the command. + + +--- + +### Run bedboss insert from within Python + +To run bedboss insert from within Python, instead of using the command line in the step #4, you can use the following code: + +```python +from bedboss import bedboss + +bedboss.insert_pep( + bedbase_config="bedconf.yaml", + pep="path/to/pep.yaml", + output_folder="path/to/output/dir", + upload_pephub=True, # optional + upload_qdrant=True, # optional + upload_s3=True, # optional + create_bedset=True # optional +) +``` \ No newline at end of file diff --git a/docs/geniml/tutorials/text2bednn-search-interface.md b/docs/geniml/tutorials/text2bednn-search-interface.md index e62a2e9..5c284ee 100644 --- a/docs/geniml/tutorials/text2bednn-search-interface.md +++ b/docs/geniml/tutorials/text2bednn-search-interface.md @@ -1,107 +1,42 @@ # How to create a natural language search backend for BED files -The metadata of each BED file / region set is needed to build a natural language search backend. Embedding vectors of BED -files are created by `Region2Vec`, and embedding vectors of metadata are created by [`SentenceTransformers`](https://www.sbert.net/). `Embed2EmbedNN`, -a feedforward neural network (FNN), is trained to learn the embedding vectors of metadata from the embedding vectors of BED -files. When a natural language query string is given, it will first be encoded to a vector by `SentenceTransformers`, and that -vector will be encoded to a query vector by the FNN. `search` backend can perform k-nearest neighbors (KNN) search among the -stored embedding vectors of BED files, and the BED files whose embedding vectors are closest to that query vector are the -search results. +The metadata of each BED file is needed to build a natural language search backend. BED files embedding vectors are created by +`Region2Vec`, and metadata embedding vectors are created by [`FastEmbed`](https://github.com/qdrant/fastembed), [`SentenceTransformers`](https://www.sbert.net/), or other text embedding models. -## Upload metadata and regions from files -`RegionSetInfo` is a [`dataclass`](https://docs.python.org/3/library/dataclasses.html) that can store information about a BED file, which includes the file name, metadata, and the -embedding vectors of region set and metadata. A list of RegionSetInfo can be created with a folder of BED files and a file of their -metadata by `SentenceTransformers` and `Region2VecExModel`. The first column of metadata file must match the BED file names -(the first column contains BED file names, or strings which BED file names start with), and is sorted by the first column. It can be -sorted by a terminal command: -``` -sort -k1 1 metadata_file > new_metadata_file -``` -Example code to build a list of RegionSetInfo - -```python -from geniml.text2bednn.utils import build_regionset_info_list_from_files -from geniml.region2vec.main import Region2VecExModel -from fastembed.embedding import FlagEmbedding - -# load Region2Vec from hugging face -r2v_model = Region2VecExModel("databio/r2v-ChIP-atlas") -# load natural language embedding model -nl_model = FlagEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2") -# folder of bed file -bed_folder = "path/to/folders/of/bed/files" -# path for metadata file -metadata_path = "path/to/file/of/metadata" - -# list of RegionSetInfo -ri_list = build_regionset_info_list_from_files(bed_folder, metadata_path, r2v_model, nl_model) -``` +`Vec2VecFNN`, a feedforward neural network (FNN), is trained to maps vectors from the embedding space of natural language to the embedding +space of BED files. When a natural language query string is given, it will first be encoded to a vector by the text embedding model, and that +vector will be encoded to a query vector by the FNN. `search` backend can perform k-nearest neighbors (KNN) search among the stored BED +file embedding vectors, and the BED files whose embedding vectors are closest to that query vector are the search results. -## Upload metadata and regions from PEP -A list of RegionSetInfo can also be created with a [`PEP`](https://pep.databio.org/en/latest/), which includes a `.csv` that stores metadata, and a `.yaml` as a a metadata validation -framework. - -Example code to build a list of RegionSetInfo from a PEP: - -```python -from geniml.text2bednn.utils import build_regionset_info_list_from_PEP - -# columns in the csv of PEP that contains metadata information -columns = return [ - "tissue", - "cell_line", - "tissue_lineage", - "tissue_description", - "diagnosis", - "sample_name", - "antibody", - ] - -# path to the yaml file -yaml_path = "path/to/framework/yaml/file" - -ri_list_PEP = build_regionset_info_list_from_PEP( - yaml_path, - col_names, - r2v_model, - nl_model, - ) -``` +## Store embedding vectors +It is recommended to use `geniml.search.backend.HNSWBackend` to store embedding vectors. In the `HNSWBackend` that stores each BED file embedding +vector, the `payload` should contain the name of BED file. In the `HNSWBackend` that stores the embedding vectors of each +metadata string, the `payload` should contain the name of BED files that have that string in metadata. ## Train the model -The list of RegionSetInfo can be split into 3 lists, which represent the training set, validating set, and testing set. The embedding -vectors of metadata will be X, and the embedding vectors of the region set will be Y. +Training a `Vec2VecFNN` needs x-y pairs of vectors (x: metadata embedding vector; y: BED embedding vector). A pair of a metadata embedding +vector with the embedding vectors of BED files in its payload is a target pair, othersie a non-target pair. Non-target pairs are sampled for +contrastive loss. Here is sample code to generate pairs from storage backend and train the model: ```python -from sklearn.model_selection import train_test_split -from geniml.text2bednn.utils region_info_list_to_vectors -from geniml.text2bednn.text2bednn import Vec2VecFNN +# target is an array of 1 (target) and -1 (non-target) +X, Y, target = vec_pairs( + nl_backend, # HNSWBackend that store metadata embedding vectors + bed_backend, # HNSWBackend that store BED embedding vectors + "name", # key to file name in BED backend payloads + "files", # key to matching files in metadata backend payloads + True, # sample non-target pairs + 1.0 # number of non-target pairs /number of target pairs = 1 +) + +# train without validate data +v2v_torch_contrast.train( + X, + Y, + folder_path="path/to/folder/for/checkpoint", + loss_func="cosine_embedding_loss", # right now "cosine_embedding_loss" is the only contrastive loss function available + training_target=target, +) -# split the list of RegionInfoSet into different data set -train_list, validate_list = train_test_split(ri_list, test_size=0.2) - -# get the embedding vectors -train_X, train_Y = region_info_list_to_vectors(train_list) -validate_X, validate_Y = region_info_list_to_vectors(validate_list) - -# train the neural network -v2vnn = Vec2VecFNN() -v2vnn.train(train_X, train_Y, validating_data=(validate_X, validate_Y), num_epochs=50) -``` - -## Load the vectors and information to search backend -[`qdrant-client`](https://github.com/qdrant/qdrant-client) and [`hnswlib`](https://github.com/nmslib/hnswlib) can store vectors and perform k-nearest neighbors (KNN) search with a given query vector, so we -created one database backend (`QdrantBackend`) and one local file backend (`HNSWBackend`) that can store the embedding -vectors for KNN search. `HNSWBackend` will create a .bin file with given path, which saves the searching index. - -```python -from geniml.text2bednn.utils import prepare_vectors_for_database - -# loading data to search backend -embeddings, labels = prepare_vectors_for_database(ri_list) - -# search backend -hnsw_backend = HNSWBackend(local_index_path="path/to/local/index.bin") -hnsw_backend.load(embeddings, labels) ``` ## text2bednn search interface @@ -119,3 +54,26 @@ query_term = "human, kidney, blood" # perform KNN search with K = 5, the id of stored vectors and the distance / similarity score will be returned ids, scores = file_interface.nl_vec_search(query_term, 5) ``` + +### Evaluate search performance +With a dictionary that contains query strings and id of relevant query results in search backend in this format: +``` +{ + : [ + , + ... + ], + ... +} +``` +`TextToBedNNSearchInterface` can return [mean average precision](https://www.youtube.com/watch?v=pM6DJ0ZZee0&t=157s), [average AUC-ROC](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf), and [average R-Precision](https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_491), here is example code: +```python +query_dict = { + "metadata string 1": [2, 3], + "metadata string 12": [1], + "metadata string 3": [2, 4, 5], + "metadata string 1": [0] +} + +MAP, AUC, RP = file_interface.eval(query_dict) +``` diff --git a/mkdocs.yml b/mkdocs.yml index 2a0cdae..621f1b1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -62,18 +62,20 @@ nav: - Changelog: changelog.md - BEDboss: - BEDBoss: bedboss/README.md - - Tutorials: - - BEDbase tutorial: bedboss/code/bedbase-tutorial.md - - BEDmaker tutorial: bedboss/code/bedmaker-tutorial.md - - BEDqc tutorial: bedboss/code/bedqc-tutorial.md - - BEDstat tutorial: bedboss/code/bedstat-tutorial.md - - Everything tutorial: bedboss/code/tutorial-all.md + - Tutorial: + - BEDboss-all pipeline: bedboss/tutorials/tutorial_all.md + - BEDboss insert: bedboss/tutorials/tutorial_insert.md + - BEDmaker tutorial: bedboss/tutorials/bedmaker_tutorial.md + - BEDqc tutorial: bedboss/tutorials/bedqc_tutorial.md + - BEDstat tutorial: bedboss/tutorials/bedstat_tutorial.md + - BEDbancher tutorial: bedboss/tutorials/bedbuncher_tutorial.md + - Bedindex tutorial: bedboss/tutorials/bedindex_tutorial.md + - Bedclassifier tutorial: bedboss/tutorials/bedclassifier_tutorial.md - How to guides: - - Configure bedboss: bedboss/how-to-configure.md - - Run from Python: bedboss/how-to-run-from-python.md - - Install R dependencies: bedboss/how-to-install-r-dependencies.md - Create BEDbase database: bedboss/how-to-create-database.md - - BEDboss insert: bedboss/bedboss-insert.md + - Create config file: bedboss/how-to-configure.md + - Install R dependencies: bedboss/how-to-install-r-dependencies.md + - Develop process: bedboss/how-to-develop.md - Reference: - How to cite: citations.md - Usage: bedboss/usage.md