From 738d3ee34ac48df8c4f0c627f307876a10d3dad0 Mon Sep 17 00:00:00 2001 From: Julia820 Date: Sun, 12 May 2024 19:04:40 +0200 Subject: [PATCH] separate assessment --- docs/geniml/code/assess-universe.md | 15065 ++++++++++++++++ .../code/create-consensus-peaks-python.md | 442 +- docs/geniml/notebooks/assess-universe.ipynb | 390 + .../create-consensus-peaks-python.ipynb | 305 +- .../tutorials/create-consensus-peaks.md | 25 +- mkdocs.yml | 2 +- 6 files changed, 15459 insertions(+), 770 deletions(-) create mode 100644 docs/geniml/code/assess-universe.md create mode 100644 docs/geniml/notebooks/assess-universe.ipynb diff --git a/docs/geniml/code/assess-universe.md b/docs/geniml/code/assess-universe.md new file mode 100644 index 0000000..97b1b35 --- /dev/null +++ b/docs/geniml/code/assess-universe.md @@ -0,0 +1,15065 @@ + + + + + +assess-universe + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ +
+
+ +
+ + + + +
+ + + + +
+
+ +
+ + + + +
+ + + + +
+ + + + +
+
+ +
+ + + + +
+ + + + +
+ + + + + + + + + diff --git a/docs/geniml/code/create-consensus-peaks-python.md b/docs/geniml/code/create-consensus-peaks-python.md index 5cacdc5..1c2d143 100644 --- a/docs/geniml/code/create-consensus-peaks-python.md +++ b/docs/geniml/code/create-consensus-peaks-python.md @@ -14811,451 +14811,11 @@ body[data-format='mobile'] .jp-OutputArea-child .jp-OutputArea-output {
-
- - - - -
- - - - -
- - - - -
-
- -
- - - - -
- - - - -
-
- -
- - - - -
- - - - -
- -
diff --git a/docs/geniml/notebooks/assess-universe.ipynb b/docs/geniml/notebooks/assess-universe.ipynb new file mode 100644 index 0000000..32dd7ff --- /dev/null +++ b/docs/geniml/notebooks/assess-universe.ipynb @@ -0,0 +1,390 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to assess universe fit to collection of BED files\n", + "\n", + "## Introduction\n", + "\n", + "In this tutorial, you will see how to assess a fit of a given universe to a collection of files. (Tutorial on creating different universes from files can be found [here](../tutorials/create-consensus-peaks.md) and [here](create-consensus-peaks-python.md).) Choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score, a region boundary score, and a likelihood score. Fit of a universe can be assessed both using CLI and python functions depending on use case. With CLI you can create a file with values of universe assessment methods for each file within the collection, while with python functions you can get measures of universe fit to the whole collection. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CLI\n", + "\n", + "Using CLI you can calculate both base-level overlap score and region boundary score separately for each file in the collections and than summarized. To calculate them you need raw files as well as the analyzed universe. It is also necessary to choose at least one assessment metric to be calculated: \n", + "\n", + "* `--overlap` - to calculate base pair overlap between universe and regions in the file, number of base pair only in the universe, number of base pair only in the file, which can be used to calculate F10 score; \n", + "* `--distance` - to calculate median of distance form regions in the raw file to the universe;\n", + "* `--distance-universe-to-file` - to calculate median of distance form the universe to regions in the raw file;\n", + "* `--distance-flexible` - to calculate median of distance form regions in the raw file to the universe taking into account universe flexibility;\n", + "* `--distance-flexible-universe-to-file` - - to calculate median of distance form the universe to regions in the raw file taking into account universe flexibility. \n", + "\n", + "Here we present an example, which calculates all possible metrics for HMM universe:\n", + "\n", + "```\n", + " geniml assess-universe --raw-data-folder raw/ \\\n", + " --file-list file_list.txt \\\n", + " --universe universe_hmm.bed \\\n", + " --folder-out . \\\n", + " --pref test_assess \\\n", + " --overlap \\\n", + " --distance \\\n", + " --distance-universe-to-file \\\n", + " --distance-flexible \\\n", + " --distance-flexible-universe-to-file\n", + "```\n", + "The resulting file is called test_assess_data.csv, and contains columns with the raw calculated metrics for each file: *file*, *univers/file*, *file/universe*, *universe&file*, *median_dist_file_to_universe*, *median_dist_file_to_universe_flex*, *median_dist_universe_to_file*, *median_dist_universe_to_file_flex*. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python functions\n", + "\n", + "The file created with CLI can be further summarized into specific metrics assessing the fit of a universe to a whole collection such as: a base-level overlap score (F10), a region boundary distance score (RBD)." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fileunivers/filefile/universeuniverse&filemedian_dist_file_to_universemedian_dist_file_to_universe_flexmedian_dist_universe_to_filemedian_dist_universe_to_file_flex
0test_1.bed2506403363027.00.076.50.0
1test_2.bed1803146433327.00.070.07.5
2test_3.bed29490318728.00.0225.0224.5
3test_4.bed2071546406527.00.0116.5105.5
\n", + "
" + ], + "text/plain": [ + " file univers/file file/universe universe&file \\\n", + "0 test_1.bed 2506 403 3630 \n", + "1 test_2.bed 1803 146 4333 \n", + "2 test_3.bed 2949 0 3187 \n", + "3 test_4.bed 2071 546 4065 \n", + "\n", + " median_dist_file_to_universe median_dist_file_to_universe_flex \\\n", + "0 27.0 0.0 \n", + "1 27.0 0.0 \n", + "2 28.0 0.0 \n", + "3 27.0 0.0 \n", + "\n", + " median_dist_universe_to_file median_dist_universe_to_file_flex \n", + "0 76.5 0.0 \n", + "1 70.0 7.5 \n", + "2 225.0 224.5 \n", + "3 116.5 105.5 " + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file\n", + "import pandas as pd\n", + "\n", + "assessment_file_path = \"test_assess_data.csv\"\n", + "df = pd.read_csv(assessment_file_path)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Universe \n", + "F10: 0.93\n", + "RBS: 0.77\n", + "flexible RBS: 0.98\n" + ] + } + ], + "source": [ + "rbs = get_rbs_from_assessment_file(assessment_file_path)\n", + "f_10 = get_f_10_score_from_assessment_file(assessment_file_path)\n", + "rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)\n", + "print(f\"Universe \\nF10: {f_10:.2f}\\nRBS: {rbs:.2f}\\nflexible RBS: {rbs_flex:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or all of this metrics can be directly calculated from the universe and raw files including a likelihood score (LH):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe F10: 0.93'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_f_10_score\n", + "\n", + "f10 = get_f_10_score(\n", + " \"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\",\n", + " 1)\n", + "\n", + "f\"Universe F10: {f10:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe RBS: 0.77'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_mean_rbs\n", + "rbs = get_mean_rbs(\"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\", 1)\n", + "f\"Universe RBS: {rbs:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe LH: -127156.87'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_likelihood\n", + "lh = get_likelihood(\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", + " \"coverage/\"\n", + ")\n", + "f\"Universe LH: {lh:.2f}\" " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both region boundary score and likelihood can be also calculated taking into account universe flexibility:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe flexible RBS: 0.98'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from geniml.assess.assess import get_mean_rbs\n", + "rbs_flex = get_mean_rbs(\n", + " \"raw/\",\n", + " 'file_list.txt',\n", + " \"universe_hmm.bed\",\n", + " 1,\n", + " flexible=True)\n", + "f\"Universe flexible RBS: {rbs_flex:.2f}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'Universe flexible LH: -127156.87'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lh_flex = get_likelihood(\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", + " \"coverage/\"\n", + ")\n", + "f\"Universe flexible LH: {lh_flex:.2f}\" " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/geniml/notebooks/create-consensus-peaks-python.ipynb b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb index cb386d7..c531813 100644 --- a/docs/geniml/notebooks/create-consensus-peaks-python.ipynb +++ b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb @@ -173,311 +173,8 @@ "source": [ "# How to assess new universe?\n", "\n", - "So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all these metrics for HMM universe:" + "So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this we created a tutorial that can be found [here](../code/assess-universe.md), which presents different methods that assess universe fit to the collection of files." ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "6b36ce9d-5412-4ba8-afe7-964909806e0d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Universe F10: 0.93'" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from geniml.assess.assess import get_f_10_score\n", - "\n", - "f10 = get_f_10_score(\n", - " \"raw/\",\n", - " 'file_list.txt',\n", - " \"universe_hmm.bed\",\n", - " 1)\n", - "\n", - "f\"Universe F10: {f10:.2f}\"" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "7b29421c-8543-4fc9-ac83-ad6f7d0f70df", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Universe RBS: 0.77'" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from geniml.assess.assess import get_mean_rbs\n", - "rbs = get_mean_rbs(\"raw/\",\n", - " 'file_list.txt',\n", - " \"universe_hmm.bed\", 1)\n", - "f\"Universe RBS: {rbs:.2f}\"" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "711fbb4f-4502-499e-8b5d-e879e26a0124", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Universe LH: -127156.87'" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from geniml.assess.assess import get_likelihood\n", - "lh = get_likelihood(\n", - " \"model.tar\",\n", - " \"universe_hmm.bed\",\n", - " \"coverage/\"\n", - ")\n", - "f\"Universe LH: {lh:.2f}\" " - ] - }, - { - "cell_type": "markdown", - "id": "171e1240-e12a-450a-9df9-ad1f0d97e398", - "metadata": {}, - "source": [ - "Both region boundary score and likelihood can be also calculated taking into account universe flexibility:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "fa597da1-d452-4583-973d-92263512b38e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Universe flexible RBS: 0.98'" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from geniml.assess.assess import get_mean_rbs\n", - "rbs_flex = get_mean_rbs(\n", - " \"raw/\",\n", - " 'file_list.txt',\n", - " \"universe_hmm.bed\",\n", - " 1,\n", - " flexible=True)\n", - "f\"Universe flexible RBS: {rbs_flex:.2f}\"" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "315a1dce-e045-41a0-b1ed-e06d117bebaa", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Universe flexible LH: -127156.87'" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "lh_flex = get_likelihood(\n", - " \"model.tar\",\n", - " \"universe_hmm.bed\",\n", - " \"coverage/\"\n", - ")\n", - "f\"Universe flexible LH: {lh_flex:.2f}\" " - ] - }, - { - "cell_type": "markdown", - "id": "ece5e3df-647f-46ab-bd95-4add00ebdfd5", - "metadata": {}, - "source": [ - "In CLI version of this [tutorial](../tutorials/create-consensus-peaks.md) it was shown how to calculate an assessment file with all the metrics. This file can be further summarized into specific metrics assessing the fit of a universe to a whole collection. " - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "462ffa69-3867-42b3-84f9-0cbe394dbd20", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
fileunivers/filefile/universeuniverse&filemedian_dist_file_to_universemedian_dist_file_to_universe_flexmedian_dist_universe_to_filemedian_dist_universe_to_file_flex
0test_1.bed2506403363027.00.076.50.0
1test_2.bed1803146433327.00.070.07.5
2test_3.bed29490318728.00.0225.0224.5
3test_4.bed2071546406527.00.0116.5105.5
\n", - "
" - ], - "text/plain": [ - " file univers/file file/universe universe&file \\\n", - "0 test_1.bed 2506 403 3630 \n", - "1 test_2.bed 1803 146 4333 \n", - "2 test_3.bed 2949 0 3187 \n", - "3 test_4.bed 2071 546 4065 \n", - "\n", - " median_dist_file_to_universe median_dist_file_to_universe_flex \\\n", - "0 27.0 0.0 \n", - "1 27.0 0.0 \n", - "2 28.0 0.0 \n", - "3 27.0 0.0 \n", - "\n", - " median_dist_universe_to_file median_dist_universe_to_file_flex \n", - "0 76.5 0.0 \n", - "1 70.0 7.5 \n", - "2 225.0 224.5 \n", - "3 116.5 105.5 " - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file\n", - "import pandas as pd\n", - "\n", - "assessment_file_path = \"test_assess_data.csv\"\n", - "df = pd.read_csv(assessment_file_path)\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "4f9f3a13", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Universe \n", - "F10: 0.93\n", - "RBS: 0.77\n", - "flexible RBS: 0.98\n" - ] - } - ], - "source": [ - "rbs = get_rbs_from_assessment_file(assessment_file_path)\n", - "f_10 = get_f_10_score_from_assessment_file(assessment_file_path)\n", - "rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)\n", - "print(f\"Universe \\nF10: {f_10:.2f}\\nRBS: {rbs:.2f}\\nflexible RBS: {rbs_flex:.2f}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76201732-509f-46ca-a703-ec804a03e097", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/docs/geniml/tutorials/create-consensus-peaks.md b/docs/geniml/tutorials/create-consensus-peaks.md index 4eba331..c8f8db3 100644 --- a/docs/geniml/tutorials/create-consensus-peaks.md +++ b/docs/geniml/tutorials/create-consensus-peaks.md @@ -81,27 +81,4 @@ geniml build-universe hmm --coverage-folder coverage/ \ # How to assess new universe? -So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score, a region boundary score, and a likelihood score. The two first metrics can be calculated separately for each file in the collections and than summarized. To calculate them you need raw files as well as the analyzed universe. It is also necessary to choose at least one assessment metric to be calculated: - -* `--overlap` - to calculate base pair overlap between universe and regions in the file, number of base pair only in the universe, number of base pair only in the file, which can be used to calculate F10 score; -* `--distance` - to calculate median of distance form regions in the raw file to the universe; -* `--distance-universe-to-file` - to calculate median of distance form the universe to regions in the raw file; -* `--distance-flexible` - to calculate median of distance form regions in the raw file to the universe taking into account universe flexibility; -* `--distance-flexible-universe-to-file` - - to calculate median of distance form the universe to regions in the raw file taking into account universe flexibility. - -Here we present an example, which calculates all possible metrics for HMM universe: - -``` - geniml assess-universe --raw-data-folder raw/ \ - --file-list file_list.txt \ - --universe universe_hmm.bed \ - --folder-out . \ - --pref test_assess \ - --overlap \ - --distance \ - --distance-universe-to-file \ - --distance-flexible \ - --distance-flexible-universe-to-file -``` -The resulting file is called test_assess_data.csv, and contains columns with the raw calculated metrics for each file: *file*, *univers/file*, *file/universe*, *universe&file*, *median_dist_file_to_universe*, *median_dist_file_to_universe_flex*, *median_dist_universe_to_file*, *median_dist_universe_to_file_flex*. -More information about assessing fit of universe to a collection of files can be found in jupyter notebook version of this tutorial tha can be found [here](../code/create-consensus-peaks-python.md). \ No newline at end of file +So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this we created a tutorial that can be found [here](../code/assess-universe.md), which presents different methods that assess universe fit to the collection of files. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 0b6a5e5..d9fc149 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -101,7 +101,7 @@ nav: - Evaluate embeddings: geniml/tutorials/evaluation.md - Create consensus peaks with CLI: geniml/tutorials/create-consensus-peaks.md - Create consensus peaks with Python: geniml/code/create-consensus-peaks-python.md - - Assess universe fit: geniml/tutorials/assess-universe.md + - Assess universe fit: geniml/code/assess-universe.md - Fine-tune embeddings: geniml/tutorials/fine-tune-region2vec-model.md - Randomize bed files: geniml/tutorials/bedshift.md - Create evaluation dataset with bedshift: geniml/tutorials/bedshift-evaluation-guide.md