Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speeding up deep graphs #2

Open
ekg opened this issue May 17, 2022 · 20 comments
Open

speeding up deep graphs #2

ekg opened this issue May 17, 2022 · 20 comments

Comments

@ekg
Copy link

ekg commented May 17, 2022

On deep graphs (2k-fold) I'm seeing gfaffix taking quite a bit of time. It's essentially single threaded, right? Is there a possible way to adapt it to operate in parallel?

@danydoerr
Copy link
Member

danydoerr commented May 18, 2022

There is potential for speedup by multithreading parts of GFAffix, especially if the bottle neck is the graph traversal. I'd be interested in knowing which step in the algorithm affects the running time on these deep graphs. The graph traversal can be parallelized, but it's not embarrassingly parallelizable, as the graph editing (which is intertwined in the graph traversal) can only be done safely in a single thread.

@natir
Copy link
Contributor

natir commented May 19, 2022

If you give me some sample dataset I can try to perform runtime analysis and try to improve running time by parallelized or not.

@danydoerr
Copy link
Member

@ekg do you have such a "deep graph" handy?

@natir
Copy link
Contributor

natir commented May 19, 2022

Or just smaller but similar graph.

@ekg
Copy link
Author

ekg commented May 19, 2022 via email

@danydoerr
Copy link
Member

Do these deep graphs have a particular high node degree, or is the number of nodes exceptionally high?

@ekg
Copy link
Author

ekg commented May 19, 2022 via email

@danydoerr
Copy link
Member

GFAffix should scale linearly with path depth, overall this shouldn't dominate the runtime. But node degree is certainly a bottle neck.

@natir
Copy link
Contributor

natir commented May 19, 2022

@ekg Could you give me advise how to build a similar graph, type of data, pipeline, tools, etc…

@ekg
Copy link
Author

ekg commented Jun 8, 2022

@natir sorry to take a while here. I may need to share this somehow, but the graph is rather large and will take 4 days to build if you do so from scratch. I'll see if I can get a simpler test case together.

@danydoerr
Copy link
Member

@ekg Just so you know that your issue is not forgotten: I have a solution for parallelizing GFAffix and will work on this sometime soonish.

@AndreaGuarracino
Copy link

@danydoerr any updates on the parallelization?

@danydoerr
Copy link
Member

danydoerr commented Dec 20, 2024

@AndreaGuarracino thanks for asking. I'm preparing a release that should be out in the next few days. Do you want to have already a binary to test on? I'm also changing the cli...

@AndreaGuarracino
Copy link

AndreaGuarracino commented Dec 20, 2024 via email

@danydoerr
Copy link
Member

danydoerr commented Dec 20, 2024

@AndreaGuarracino
Copy link

Thx! Now I've noticed your refined_deduplication branch!

Looks extremely great. With 1 thread it is already a bit faster than the current main branch. And it is also slimmer in memory.

@danydoerr
Copy link
Member

danydoerr commented Dec 20, 2024 via email

@danydoerr
Copy link
Member

@AndreaGuarracino do you have ungfaffixed graphs of the new HPRC assemblies that I can test on?

@AndreaGuarracino
Copy link

AndreaGuarracino commented Dec 21, 2024 via email

@danydoerr
Copy link
Member

New version is released now. It still requires improvement in speed and memory usage for these large graphs. The bottleneck is now only the i/o part of gfaffix, and I already have some concrete ideas how to improve it. I'll leave this issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants