Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples to run? #6

Open
jianshu93 opened this issue Dec 2, 2023 · 1 comment
Open

examples to run? #6

jianshu93 opened this issue Dec 2, 2023 · 1 comment
Assignees

Comments

@jianshu93
Copy link

Hello Team,

I was trying to run the perl script to have output from many sequences, is there an example input for the perl script (I have already successfully compiled everything needed). I want to compare OMH estimated Jaccard index with exact jaccard for 1000 sequences, so all versus all.

Thanks,

Jianshu

@danfdeblasio
Copy link
Collaborator

Hi Jianshu,

Happy to hear you're looking to use OMH, thanks for pointing out this issue. I think it stems mainly from some missing documentation.

I made the random sequences README a little more verbose, and I hope it helps. At the last minute we added the k-mer size to the input of the perl script and this was not documented. The input file is expecting the captured standard output from the generator python script, but as long as its a tab delimited list of pairs of sequences it should capture it okay and be able to output a list of OMH values (in the 4th column of another tab delimited file).

As an example, the output from generate_random_pairs.py would be something like (here this is k=5, n=10 with --trim on):

AACAACACCC AACACAAACC 4 4 6 0.09090909090909091 0.09090909090909091 3
ACACCAACAC ACCCCAAACC 4 4 5 0.0 0.0 3
CAACAAACCC CACAACACAA 6 1 6 0.1 0.1 4
AAACCACA00 AACCCAACAC 6 1 5 0.0 0.0 4
AACAACAAAA AACCCCAAAC 6 1 4 0.0 0.0 2
ACCAAAACCC ACCACCAACC 4 4 5 0.09090909090909091 0.09090909090909091 3
CACACACACA CCCCCAACAC 6 1 5 0.0 0.0 5
ACACCCACAA ACCCACCAC0 6 1 6 0.2 0.2 5
ACACAACAA0 ACACCAACAC 4 4 7 0.09090909090909091 0.09090909090909091 3
AAACCCCCAA AACCCCCA00 4 4 8 0.5 0.5 3
...

Then the output of the perl script is something like: (again with k=5 specified)

4 0.09090909090909091 0.09090909090909091 0 3
4 0.0 0.0 0 3
6 0.1 0.1 0 4
6 0.0 0.0 0 4
6 0.0 0.0 0 2
4 0.09090909090909091 0.09090909090909091 0 3
6 0.0 0.0 0 5
6 0.2 0.2 0.0222222 5
4 0.09090909090909091 0.09090909090909091 0 3
4 0.5 0.5 0.214286 3
...

The columns other than the 4th are copied from the input (those values are computed in the python script).

So if you were to input just sequence pairs, you would have extraneous tabs in the output, but it should work.

Let me know if you have additional questions.

Dan

@danfdeblasio danfdeblasio self-assigned this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants