Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is omh metric? #4

Open
jianshu93 opened this issue Dec 28, 2022 · 3 comments
Open

is omh metric? #4

jianshu93 opened this issue Dec 28, 2022 · 3 comments

Comments

@jianshu93
Copy link

Hello Team,

several questions related to omh and edit distance:

  1. For edit distance in the edlib (https://github.com/Martinsos/edlib), there can be global, infix and prefix method (e.g., gaps opened at beginning or end of either sequences are not penalized or penalized, semi-global or global). For its sensitive hash, I think it should be global (gaps opened at beginning or end of either sequences are penalized) because all kmers of 2 sequences are considered.
  2. weighted jaccard are metric (https://en.wikipedia.org/wiki/Metric_space) so distance approximated by its LSH will also be metric. Is OMH also metric?

Thanks,

Jianshu

@gmarcais
Copy link
Collaborator

Hi Jianshu,

thank you for your questions regarding OMH.

  1. That's correct. OMH is a LSH sensitive (with a gap) to the (global) edit distance, not to a local alignment like Smith-Waterman.
  2. There are different LSH at play here. MinHash is an unbiased estimator of the Jaccard, in other words it is a gap-less LSH (LSH according to Charikar's definition). OMH is a gaped LSH (following Indik's definition): it does not directly estimate the edit distance. Rather, it can distinguish with high-probability poorly aligning sequence pairs from well aligning sequence pairs. The question whether OMH is metric is not really valid. It is not a measure at all.

Cheers

@jianshu93
Copy link
Author

Hello Guillaume,

Thanks for the quick response during holiday! For the second question, my question then would be the distance/similarity estimated by OMH (the value calculated by omh_compute between 2 sequences, there must be a value telling whether high-probability poorly aligning sequence pairs and well aligning sequence pairs can be differentiated, like Figure 4 in the paper, correct me If I am understanding the paper in a wrong way) metric? Thanks for pointing out the indik's definition. The reason I am asking is also related to nearest neighbor search, but not based on LSH, but based on tree-like or graph like space partitioning algorithms, which requires that the distance used must be metric. I want to feed the output from omh_compute (a distance) to graph or tree like algorithms like this one here, which requires the distance used must be metric (or at least close).

Thanks,

Jianshu

@jianshu93
Copy link
Author

Question on to what edit dissimilarity between 2 sequences, order Min Hash is still accurate. I can see that for 0.45 edit dissimilarity, it is still accurate and can differentiate those 2 sequences. Another question is, if 2 sequences are not the same length, say one is only 1/5 of another, then will order min hash can still accurately accproxiamte so called, semi-global alignment identity?

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants