diff --git a/docs/geniml/code/create-consensus-peaks-python.md b/docs/geniml/code/create-consensus-peaks-python.md index 54345c3..5cacdc5 100644 --- a/docs/geniml/code/create-consensus-peaks-python.md +++ b/docs/geniml/code/create-consensus-peaks-python.md @@ -1,9 +1,13 @@ + + + - - - create-consensus-peaks-python + + + + + + + - + + - - - + + - - - + -
-
-
+ + + + + + + diff --git a/docs/geniml/notebooks/create-consensus-peaks-python.ipynb b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb index 0b7a405..cb386d7 100644 --- a/docs/geniml/notebooks/create-consensus-peaks-python.ipynb +++ b/docs/geniml/notebooks/create-consensus-peaks-python.ipynb @@ -8,7 +8,7 @@ "# How to build a new universe?\n", "\n", "## Data preprocessing\n", - "This is a jupyter version of CLI tutorial that can be found [here](../tutorials/create-consensus-peaks.md). You will use here python functions insted of CLI to build and assess diffrent universe. Fielse that you will use here can be downlodead from XXX. In there you will find a compressed folder:\n", + "This is a jupyter version of CLI tutorial that can be found [here](../tutorials/create-consensus-peaks.md). You will use here python functions instead of CLI to build and assess different universes. Files that you will use here can be downloaded from XXX. In there you will find a compressed folder:\n", "\n", "```\n", "consensus:\n", @@ -21,9 +21,9 @@ " chrom.sizes\n", "```\n", "\n", - "In the raw folder there are example BED files used in this tutorial and file withe names of files we will analyze.\n", + "In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files you will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data. \n", "\n", - "It assummes that you alread have files of the genome coverage by the analzed colletion. The example of how to creat them can be found [here](../tutorials/create-consensus-peaks.md)." + "Here we assume that you already have files of the genome coverage by the analyzed collection. The example of how to create them can be found [here](../tutorials/create-consensus-peaks.md)." ] }, { @@ -46,7 +46,7 @@ "outputs": [], "source": [ "from geniml.universe.cc_universe import cc_universe\n", - "cc_universe(\"coverage/\", file_out=\"universe_cc_py.bed\")" + "cc_universe(\"coverage/\", file_out=\"universe_cc.bed\")" ] }, { @@ -54,8 +54,8 @@ "id": "56af8029-97b1-4bd6-81a9-af86a6b6c83e", "metadata": {}, "source": [ - "Depending on the task the universe can be smooth by setting ```merge``` option with the distance beloved witch peaks should be merged together and \n", - "`filter_size` with minimum size of peak that should be part of the universe. Instead of it using maximum likelihood cutoff one can also defined cutoff with `cutoff` option. If it is set to 1 the result is union universe, and when to number of files it wil produce intersection universe:" + "Depending on the task the universe can be smooth by setting ```merge``` option with the distance below witch peaks should be merged together and \n", + "`filter_size` with minimum size of peak that should be part of the universe. Instead of using maximum likelihood cutoff one can also defined cutoff with `cutoff` option. If it is set to 1 the result is union universe, and when to number of files it wil produce intersection universe:" ] }, { @@ -87,7 +87,7 @@ "source": [ "from geniml.universe.ccf_universe import ccf_universe\n", "\n", - "ccf_universe(\"coverage/\", file_out=\"universe_ccf_py.bed\")" + "ccf_universe(\"coverage/\", file_out=\"universe_ccf.bed\")" ] }, { @@ -109,14 +109,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "Function 'main' executed in 0.0007min\n" + "Function 'main' executed in 0.0001min\n" ] } ], "source": [ "from geniml.likelihood.build_model import main\n", "\n", - "main(\"model_py.tar\", \"coverage/\",\n", + "main(\"model.tar\", \"coverage/\",\n", " \"all\",\n", " file_no=4)" ] @@ -126,7 +126,7 @@ "id": "d05ebdb6-cf66-4b0c-abbd-2fc75f093350", "metadata": {}, "source": [ - "The resulting tar archiver contains LH model that can be used for building flexible universes called a maximum likelihood universe (ML):" + "The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end of a region. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm for building flexible universes results in maximum likelihood universe (ML):" ] }, { @@ -134,22 +134,14 @@ "execution_count": 5, "id": "d50948d0-f46b-4eb0-a2ab-9121f315ac21", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "numba is installed\n" - ] - } - ], + "outputs": [], "source": [ "from geniml.universe.ml_universe import ml_universe\n", "\n", - "ml_universe(\"model_py.tar\",\n", + "ml_universe(\"model.tar\",\n", " \"coverage\",\n", " \"all\",\n", - " \"universe_ml_py.bed\")" + " \"universe_ml.bed\")" ] }, { @@ -171,7 +163,7 @@ "from geniml.universe.hmm_universe import hmm_universe\n", "\n", "hmm_universe(\"coverage/\",\n", - " \"universe_hmm_py.bed\")" + " \"universe_hmm.bed\")" ] }, { @@ -181,7 +173,7 @@ "source": [ "# How to assess new universe?\n", "\n", - "So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all metrics for HMM universe:" + "So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all these metrics for HMM universe:" ] }, { @@ -207,7 +199,7 @@ "f10 = get_f_10_score(\n", " \"raw/\",\n", " 'file_list.txt',\n", - " \"universe_hmm_py.bed\",\n", + " \"universe_hmm.bed\",\n", " 1)\n", "\n", "f\"Universe F10: {f10:.2f}\"" @@ -234,7 +226,7 @@ "from geniml.assess.assess import get_mean_rbs\n", "rbs = get_mean_rbs(\"raw/\",\n", " 'file_list.txt',\n", - " \"universe_hmm_py.bed\", 1)\n", + " \"universe_hmm.bed\", 1)\n", "f\"Universe RBS: {rbs:.2f}\"" ] }, @@ -258,8 +250,8 @@ "source": [ "from geniml.assess.assess import get_likelihood\n", "lh = get_likelihood(\n", - " \"model_py.tar\",\n", - " \"universe_hmm_py.bed\",\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", " \"coverage/\"\n", ")\n", "f\"Universe LH: {lh:.2f}\" " @@ -270,7 +262,7 @@ "id": "171e1240-e12a-450a-9df9-ad1f0d97e398", "metadata": {}, "source": [ - "Both region baounary score and likelihood can cacluated taking into account universe flexiblility:" + "Both region boundary score and likelihood can be also calculated taking into account universe flexibility:" ] }, { @@ -295,7 +287,7 @@ "rbs_flex = get_mean_rbs(\n", " \"raw/\",\n", " 'file_list.txt',\n", - " \"universe_hmm_py.bed\",\n", + " \"universe_hmm.bed\",\n", " 1,\n", " flexible=True)\n", "f\"Universe flexible RBS: {rbs_flex:.2f}\"" @@ -320,8 +312,8 @@ ], "source": [ "lh_flex = get_likelihood(\n", - " \"model_py.tar\",\n", - " \"universe_hmm_py.bed\",\n", + " \"model.tar\",\n", + " \"universe_hmm.bed\",\n", " \"coverage/\"\n", ")\n", "f\"Universe flexible LH: {lh_flex:.2f}\" " @@ -337,31 +329,155 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "id": "462ffa69-3867-42b3-84f9-0cbe394dbd20", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
fileunivers/filefile/universeuniverse&filemedian_dist_file_to_universemedian_dist_file_to_universe_flexmedian_dist_universe_to_filemedian_dist_universe_to_file_flex
0test_1.bed2506403363027.00.076.50.0
1test_2.bed1803146433327.00.070.07.5
2test_3.bed29490318728.00.0225.0224.5
3test_4.bed2071546406527.00.0116.5105.5
\n", + "
" + ], + "text/plain": [ + " file univers/file file/universe universe&file \\\n", + "0 test_1.bed 2506 403 3630 \n", + "1 test_2.bed 1803 146 4333 \n", + "2 test_3.bed 2949 0 3187 \n", + "3 test_4.bed 2071 546 4065 \n", + "\n", + " median_dist_file_to_universe median_dist_file_to_universe_flex \\\n", + "0 27.0 0.0 \n", + "1 27.0 0.0 \n", + "2 28.0 0.0 \n", + "3 27.0 0.0 \n", + "\n", + " median_dist_universe_to_file median_dist_universe_to_file_flex \n", + "0 76.5 0.0 \n", + "1 70.0 7.5 \n", + "2 225.0 224.5 \n", + "3 116.5 105.5 " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file\n", "import pandas as pd\n", "\n", "assessment_file_path = \"test_assess_data.csv\"\n", - "df = pd.read(assessment_file_path)\n", + "df = pd.read_csv(assessment_file_path)\n", "df.head()" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "id": "4f9f3a13", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Universe \n", + "F10: 0.93\n", + "RBS: 0.77\n", + "flexible RBS: 0.98\n" + ] + } + ], "source": [ "rbs = get_rbs_from_assessment_file(assessment_file_path)\n", "f_10 = get_f_10_score_from_assessment_file(assessment_file_path)\n", - "rbs_flex = get_f_10_score_from_assessment_file(assessment_file_path, flexible=True)\n", - "f\"Universe\\nF10: {f_10:.2f}\\nRBS: {rbs:.2f}\\nflexible RBS: {rbs_flex:.2f}\"" + "rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)\n", + "print(f\"Universe \\nF10: {f_10:.2f}\\nRBS: {rbs:.2f}\\nflexible RBS: {rbs_flex:.2f}\")" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76201732-509f-46ca-a703-ec804a03e097", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/docs/geniml/tutorials/create-consensus-peaks.md b/docs/geniml/tutorials/create-consensus-peaks.md index b45c9a9..4eba331 100644 --- a/docs/geniml/tutorials/create-consensus-peaks.md +++ b/docs/geniml/tutorials/create-consensus-peaks.md @@ -14,7 +14,7 @@ consensus: chrom.sizes ``` -In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files we will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data. +In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files you will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data. To build any kind of a universe you need bigWig files with genome coverage by the analyzed collection, which can be made it using [uniwig](https://github.com/databio/uniwig/). First we have to combine all the analyzed files into one BED file: @@ -42,10 +42,10 @@ geniml build-universe cc --coverage-folder coverage/ \ Depending on the task the universe can be smooth by setting `--merge` flag with the distance beloved witch peaks should be merged together and -`--filter-size` with minimum size of peak that should be part of the universe. Instead of it using maximum likelihood cutoff one can also defined cutoff with `--cutoff` flag. If it is set to 1 the result is union universe, and when to number of analyzed files it wil produce intersection universe. +`--filter-size` with minimum size of peak that should be part of the universe. Instead of using maximum likelihood cutoff one can also defined cutoff with `--cutoff` flag. If it is set to 1 the result is union universe, and when to number of analyzed files it wil produce intersection universe. ## Coverage cutoff flexible universe -A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universe. It builds confidence interval around the maximum likelihood cutoff. This results in two values one for the cutoff for boundaries, and the other one for the region core. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe: +A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universe. It builds confidence interval around the maximum likelihood cutoff. This results in two values one for the cutoff for boundaries, and the other one for the region core. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe: ``` geniml build-universe ccf --coverage-folder coverage/ \ @@ -62,7 +62,7 @@ geniml lh build_model --model-file model.tar \ --file-no `wc -l file_list.txt` ``` -The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm is for building flexible universes called a maximum likelihood universe (ML): +The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end of a region. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm for building flexible universes results in maximum likelihood universe (ML): ``` geniml build-universe ml --model-file model.tar \ @@ -83,7 +83,7 @@ geniml build-universe hmm --coverage-folder coverage/ \ So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score, a region boundary score, and a likelihood score. The two first metrics can be calculated separately for each file in the collections and than summarized. To calculate them you need raw files as well as the analyzed universe. It is also necessary to choose at least one assessment metric to be calculated: -* `--overlap` - to calculate base pair overlap between universe and regions in the file, number of base pair in only the universe, number of base pair in only the file, which can be used to calculate F10 score; +* `--overlap` - to calculate base pair overlap between universe and regions in the file, number of base pair only in the universe, number of base pair only in the file, which can be used to calculate F10 score; * `--distance` - to calculate median of distance form regions in the raw file to the universe; * `--distance-universe-to-file` - to calculate median of distance form the universe to regions in the raw file; * `--distance-flexible` - to calculate median of distance form regions in the raw file to the universe taking into account universe flexibility;