-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfering ports while minimizing races from other threads #35
Comments
Are you using a portserver in your TensorFlow OSS workflows? If so, the portserver would've marked it as allocated and prevented it from being handed out to anything else until the requesting process has died or released it. |
More on the topic of your question - transferring ports - I presume you mean the ability to transfer a file descriptor between processes on POSIX? This could work if the server code in question were willing to accept an already open & bound fd instead. That feels like an unusual thing for most servers to allow rather than doing their own bind. So unless all possible servers were modified, would it matter? A recent writeup on this old approach for added background: https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec (and its followup for a recent Linux 5.6+ Internally Python's |
There has been some difficulty setting up a portserver -- and it won't really solve the problem because some of the port activities after the initial port allocation from portserver will be from outside port server. The use case I think a port transfer API can help is the following case:
If any of P0 .. P19 is bound at step 3, then the step 4 for that process would fail? I was thinking if we can somehow keep P-i bound in subprocess "i', until freeing up that port right at step 4. That would be similar to transfering the ownership of a port from Process M to subprocess "i". In terms of implementation, I was not thinking of the fancier fd transfer approach, but rather just keeping the port number bound, and unbound it briefly during transfer of the ownership between processes and into the intended consumer. This is not 100% secure but should be able to drastically drop down the flakiness, which is likely good enough. |
I'm not sure what you mean by that. The port server won't hand out the same port again until the process id associated with requested the port dies (or otherwise returns it). In your example this would likely be the orchestrating "Process M". But if other processes also want to request a port directly on their own behalf or for their own children that also works with a port server. |
I meant that Step 3 has allocated ports from a different library which does not use the port server, thus there is a likelihood that that library has bound to a port that was going to be used in step 4. |
Fundamental to being able to reliably test network software is being able to control all port assignments across the entire software stack being run. Clarifying question:
Anyways, I'm happy to review PRs implementing something else portpicker wise that users could opt into, but I cannot dedicate time developing such a new solution to a problem which no testing infrastructure I maintain runs into. |
Currently between portpicker picks a port, and the port number is sent to another thread / process to use, there can be a large time window where neither the portpicker nor the port consumer holds the ownership of the port. As a consequence, some other user of the socket API can grab that port, causing the portuser to fail to bind that port number.
This pattern is likely the cause of several flaking tests in TensorFlow OSS workflows, and negatively affected the productivity of github contributors to the surrounding community (OpenXLA, JAX etc).
I wonder if we can add some facility in portpicker to facilitate the transfer of allocated ports between the picker thread/process to the user thread/process. It seems to be rather doable using current Python multiprocessing primitives.
Thoughts? @gpshead
The text was updated successfully, but these errors were encountered: