-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReproZip hangs on fit_transform() method from sklearn.decomposition and phate #388
Comments
I can't reproduce this, I ran it successfully in a Ubuntu 18.04 VM. reprozip 1.1 (from pip) I used the Olivetti faces dataset. Can you try running with increased verbosity? |
Unfortunately there doesn't seem to be anything wrong in that log, it looks like PCA is running. I assume you have waited long enough and it never completes? ReproZip shouldn't slow down that process anyway. Unless I can reproduce this locally I am not sure I'll be able to fix it, sorry. |
When I'm running it "normally" (without ReproZip) this single iteration of fit_transform() takes 0.03 s. With ReproZip tracing I waited for 30 min and nothing happened. However, the processor usage is between 4-6% during that process so it looks like something is going on. I will try to run it on a different machine and with Python 3.8 and will share the findings. |
I've managed to run reprozip trace after downgrading Python from version 3.9.15 to 3.8.0. It doesn't hang on fit_transform() anymore. However, single iteration of fit_transform() still performs much longer (25 seconds) when running through reprozip trace compared to "normal" run (160 miliseconds). Could you please check if the time of execution on your side is comparable no matter if you run python file_name.py or reprozip trace python file_name.py for that piece of code with PCA that I've shared with you and the Olivetti faces dataset? |
Ubuntu 18,04 does not have Python 3.9 so I am not sure how to reproduce your setup. Did you compile Python from source? |
I've just meant reproducing it in Python 3.8.0. Just like you did it before, but this time checking how long it takes to perform ps. |
How long are those commands taking respectively? There should be no overhead for the computing process, only for figuring out dependencies to write the |
For single iteration of For 5 iterations of |
I can reproduce the slowness if I increase the number of iterations. It seems that scikit-learn uses threads (15 for me) that call |
I'm stumped. It looks like a bug in scikit-learn to be honest. If I slow down reprozip further, making the sched_yield longer, the Python code yields even more and never completes. In any case, you can trace your program with a low number of iterations and then change it back to the proper number before packing. The reprozip tracer is not used during reproduction so it shouldn't be a problem. Sorry I can't help further! |
Thank you! That tip with tracing the program with a low number of iterations should do the trick! |
Update: export OPENBLAS_NUM_THREADS=1 Actually, it was that last export of the above-mentioned that eventually made things run, so only the last one might be needed or all four. It runs slower due to switching off threading inside methods called from scikit-learn but at least it doesn't hang. Multiprocessing outside scikit-learn works fine so it is still possible to run the algos in parallel. |
Thanks @milech, that should help narrow it down. Hopefully I can find where the issue is. ReproZip should not interfere with OpenMP like this. |
One more thing that I forgot to mention. Setting those environment variables made it work only in this configuration: Package Version certifi 2022.12.7 python==3.9.15 |
When running reprozip trace for the experiment containing dimensionality reduction algorithms like PCA or t-SNE from scikit-learn or PHATE from phate library, it hangs on performing the fit_transform() method.
Versions of libraries:
scikit-learn 1.2.0
phate 1.0.10
pandas 1.3.5
System: Ubuntu 18.04.6 LTS
Sample code to reproduce the issue:
The text was updated successfully, but these errors were encountered: