How to build a new universe?¶
Data preprocessing¶
This is a jupyter version of CLI tutorial that can be found here. You will use here python functions insted of CLI to build and assess diffrent universe. Fielse that you will use here can be downlodead from XXX. In there you will find a compressed folder:
+How to build a new universe?¶
Data preprocessing¶
This is a jupyter version of CLI tutorial that can be found here. You will use here python functions instead of CLI to build and assess different universes. Files that you will use here can be downloaded from XXX. In there you will find a compressed folder:
+consensus:
- raw
test_1.bed
@@ -7490,436 +14589,680 @@ a.anchor-link {
test_3.bed
test_4.bed
file_list.txt
- chrom.sizes
-
-In the raw folder there are example BED files used in this tutorial and file withe names of files we will analyze.
-It assummes that you alread have files of the genome coverage by the analzed colletion. The example of how to creat them can be found here.
+ chrom.sizes +In the raw folder there are example BED files used in this tutorial and in file_list.txt are names of files you will analyze. Additionally there is a file with chromosome sizes, which you will use to preprocess the data.
+Here we assume that you already have files of the genome coverage by the analyzed collection. The example of how to create them can be found here.
+Coverage cutoff universe¶
First, you will create a coverage cutoff universe (CC). This is the simplest type of a universe that only includes genomic positions with coverage greater or equal to cutoff x. This cutoff by default is calculated using simple likelihood model that calculates the probability of appearing in a collection. The universe can be build just based on genome coverage:
+Coverage cutoff universe¶
First, you will create a coverage cutoff universe (CC). This is the simplest type of a universe that only includes genomic positions with coverage greater or equal to cutoff x. This cutoff by default is calculated using simple likelihood model that calculates the probability of appearing in a collection. The universe can be build just based on genome coverage:
+from geniml.universe.cc_universe import cc_universe
-cc_universe("coverage/", file_out="universe_cc_py.bed")
+
+from geniml.universe.cc_universe import cc_universe
+cc_universe("coverage/", file_out="universe_cc.bed")
+
+
Depending on the task the universe can be smooth by setting merge
option with the distance beloved witch peaks should be merged together and
-filter_size
with minimum size of peak that should be part of the universe. Instead of it using maximum likelihood cutoff one can also defined cutoff with cutoff
option. If it is set to 1 the result is union universe, and when to number of files it wil produce intersection universe:
Depending on the task the universe can be smooth by setting merge
option with the distance below witch peaks should be merged together and
+filter_size
with minimum size of peak that should be part of the universe. Instead of using maximum likelihood cutoff one can also defined cutoff with cutoff
option. If it is set to 1 the result is union universe, and when to number of files it wil produce intersection universe:
cc_universe("coverage/", file_out="universe_union.bed", cutoff=1)
+
+cc_universe("coverage/", file_out="universe_union.bed", cutoff=1)
cc_universe("coverage/", file_out="universe_intersection.bed", cutoff=4)
+
+
Coverage cutoff flexible universe¶
A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universes. It uses two cutoffs calculated based on maximum likelihood cutoff, making a confidence interval around the optimal cutoff value. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe:
+Coverage cutoff flexible universe¶
A more complex version of coverage cutoff universe is coverage cutoff flexible universe (CCF). In contrast to its' fixed version it produces flexible universes. It uses two cutoffs calculated based on maximum likelihood cutoff, making a confidence interval around the optimal cutoff value. Despite the fact that the CFF universe is more complex it is build using the same input as the CC universe:
+from geniml.universe.ccf_universe import ccf_universe
+
+from geniml.universe.ccf_universe import ccf_universe
-ccf_universe("coverage/", file_out="universe_ccf_py.bed")
+ccf_universe("coverage/", file_out="universe_ccf.bed")
+
+
Maximum likelihood universe¶
In the previous examples both CC anf CCF universes used simple likelihood model to calculate the cutoff. However, we also developed more complex likelihood model that takes into account the positions of starts and ends of the regions in the collection. This LH model can build based on coverage files:
+Maximum likelihood universe¶
In the previous examples both CC anf CCF universes used simple likelihood model to calculate the cutoff. However, we also developed more complex likelihood model that takes into account the positions of starts and ends of the regions in the collection. This LH model can build based on coverage files:
+from geniml.likelihood.build_model import main
+
+from geniml.likelihood.build_model import main
-main("model_py.tar", "coverage/",
+main("model.tar", "coverage/",
"all",
file_no=4)
+
+
Function 'main' executed in 0.0007min + + + + + ++Function 'main' executed in 0.0001min
The resulting tar archiver contains LH model that can be used for building flexible universes called a maximum likelihood universe (ML):
+The resulting tar archiver contains LH model. This model can be used as a scoring function that assigns to each position probability of it being a start, core or end of a region. It can be both used for universe assessment and universe building. Combination of LH model and optimization algorithm for building flexible universes results in maximum likelihood universe (ML):
+from geniml.universe.ml_universe import ml_universe
+
+from geniml.universe.ml_universe import ml_universe
-ml_universe("model_py.tar",
+ml_universe("model.tar",
"coverage",
"all",
- "universe_ml_py.bed")
+ "universe_ml.bed")
+
+
numba is installed --
HMM¶
The forth presented method of creating universes utilizes Hidden Markov Models. In this approach the parts of flexible regions are hidden states of the model, while genome coverage by the collections are emissions. The resulting universe is called Hidden Markov Model universe. It can be build only based on the genome coverage by the collection:
+HMM¶
The forth presented method of creating universes utilizes Hidden Markov Models. In this approach the parts of flexible regions are hidden states of the model, while genome coverage by the collections are emissions. The resulting universe is called Hidden Markov Model universe. It can be build only based on the genome coverage by the collection:
+from geniml.universe.hmm_universe import hmm_universe
+
+from geniml.universe.hmm_universe import hmm_universe
hmm_universe("coverage/",
- "universe_hmm_py.bed")
+ "universe_hmm.bed")
+
+
How to assess new universe?¶
So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all metrics for HMM universe:
+How to assess new universe?¶
So far you used many different methods for creating new universes. But choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score (F10), a region boundary distance score (RBD), and a likelihood score (LH). Here we present an example, which calculates all these metrics for HMM universe:
+from geniml.assess.assess import get_f_10_score
+
+from geniml.assess.assess import get_f_10_score
f10 = get_f_10_score(
"raw/",
'file_list.txt',
- "universe_hmm_py.bed",
+ "universe_hmm.bed",
1)
f"Universe F10: {f10:.2f}"
+
+
'Universe F10: 0.93'+ +
'Universe F10: 0.93'
from geniml.assess.assess import get_mean_rbs
+
+from geniml.assess.assess import get_mean_rbs
rbs = get_mean_rbs("raw/",
'file_list.txt',
- "universe_hmm_py.bed", 1)
+ "universe_hmm.bed", 1)
f"Universe RBS: {rbs:.2f}"
+
+
'Universe RBS: 0.77'+ +
'Universe RBS: 0.77'
from geniml.assess.assess import get_likelihood
+
+from geniml.assess.assess import get_likelihood
lh = get_likelihood(
- "model_py.tar",
- "universe_hmm_py.bed",
+ "model.tar",
+ "universe_hmm.bed",
"coverage/"
)
f"Universe LH: {lh:.2f}"
+
+
'Universe LH: -127156.87'+ +
'Universe LH: -127156.87'
Both region baounary score and likelihood can cacluated taking into account universe flexiblility:
+Both region boundary score and likelihood can be also calculated taking into account universe flexibility:
+from geniml.assess.assess import get_mean_rbs
+
+from geniml.assess.assess import get_mean_rbs
rbs_flex = get_mean_rbs(
"raw/",
'file_list.txt',
- "universe_hmm_py.bed",
+ "universe_hmm.bed",
1,
flexible=True)
f"Universe flexible RBS: {rbs_flex:.2f}"
+
+
'Universe flexible RBS: 0.98'+ +
'Universe flexible RBS: 0.98'
lh_flex = get_likelihood(
- "model_py.tar",
- "universe_hmm_py.bed",
+
+lh_flex = get_likelihood(
+ "model.tar",
+ "universe_hmm.bed",
"coverage/"
)
f"Universe flexible LH: {lh_flex:.2f}"
+
+
'Universe flexible LH: -127156.87'+ +
'Universe flexible LH: -127156.87'
In CLI version of this tutorial it was shown how to calculate an assessment file with all the metrics. This file can be further summarized into specific metrics assessing the fit of a universe to a whole collection.
+from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file
+
+from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file
import pandas as pd
assessment_file_path = "test_assess_data.csv"
-df = pd.read(assessment_file_path)
+df = pd.read_csv(assessment_file_path)
df.head()
+
+
+ | file | +univers/file | +file/universe | +universe&file | +median_dist_file_to_universe | +median_dist_file_to_universe_flex | +median_dist_universe_to_file | +median_dist_universe_to_file_flex | +
---|---|---|---|---|---|---|---|---|
0 | +test_1.bed | +2506 | +403 | +3630 | +27.0 | +0.0 | +76.5 | +0.0 | +
1 | +test_2.bed | +1803 | +146 | +4333 | +27.0 | +0.0 | +70.0 | +7.5 | +
2 | +test_3.bed | +2949 | +0 | +3187 | +28.0 | +0.0 | +225.0 | +224.5 | +
3 | +test_4.bed | +2071 | +546 | +4065 | +27.0 | +0.0 | +116.5 | +105.5 | +
rbs = get_rbs_from_assessment_file(assessment_file_path)
+
+rbs = get_rbs_from_assessment_file(assessment_file_path)
f_10 = get_f_10_score_from_assessment_file(assessment_file_path)
-rbs_flex = get_f_10_score_from_assessment_file(assessment_file_path, flexible=True)
-f"Universe\nF10: {f_10:.2f}\nRBS: {rbs:.2f}\nflexible RBS: {rbs_flex:.2f}"
+rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)
+print(f"Universe \nF10: {f_10:.2f}\nRBS: {rbs:.2f}\nflexible RBS: {rbs_flex:.2f}")
+
+
+
Universe +F10: 0.93 +RBS: 0.77 +flexible RBS: 0.98 ++
+
\n", + " | file | \n", + "univers/file | \n", + "file/universe | \n", + "universe&file | \n", + "median_dist_file_to_universe | \n", + "median_dist_file_to_universe_flex | \n", + "median_dist_universe_to_file | \n", + "median_dist_universe_to_file_flex | \n", + "
---|---|---|---|---|---|---|---|---|
0 | \n", + "test_1.bed | \n", + "2506 | \n", + "403 | \n", + "3630 | \n", + "27.0 | \n", + "0.0 | \n", + "76.5 | \n", + "0.0 | \n", + "
1 | \n", + "test_2.bed | \n", + "1803 | \n", + "146 | \n", + "4333 | \n", + "27.0 | \n", + "0.0 | \n", + "70.0 | \n", + "7.5 | \n", + "
2 | \n", + "test_3.bed | \n", + "2949 | \n", + "0 | \n", + "3187 | \n", + "28.0 | \n", + "0.0 | \n", + "225.0 | \n", + "224.5 | \n", + "
3 | \n", + "test_4.bed | \n", + "2071 | \n", + "546 | \n", + "4065 | \n", + "27.0 | \n", + "0.0 | \n", + "116.5 | \n", + "105.5 | \n", + "