Transfering ports while minimizing races from other threads #35

rainwoodman · 2023-12-01T01:46:52Z

Currently between portpicker picks a port, and the port number is sent to another thread / process to use, there can be a large time window where neither the portpicker nor the port consumer holds the ownership of the port. As a consequence, some other user of the socket API can grab that port, causing the portuser to fail to bind that port number.

This pattern is likely the cause of several flaking tests in TensorFlow OSS workflows, and negatively affected the productivity of github contributors to the surrounding community (OpenXLA, JAX etc).

I wonder if we can add some facility in portpicker to facilitate the transfer of allocated ports between the picker thread/process to the user thread/process. It seems to be rather doable using current Python multiprocessing primitives.

Thoughts? @gpshead

gpshead · 2023-12-01T23:41:27Z

Are you using a portserver in your TensorFlow OSS workflows? If so, the portserver would've marked it as allocated and prevented it from being handed out to anything else until the requesting process has died or released it.

gpshead · 2023-12-01T23:50:48Z

More on the topic of your question - transferring ports - I presume you mean the ability to transfer a file descriptor between processes on POSIX? This could work if the server code in question were willing to accept an already open & bound fd instead. That feels like an unusual thing for most servers to allow rather than doing their own bind. So unless all possible servers were modified, would it matter?

A recent writeup on this old approach for added background: https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec (and its followup for a recent Linux 5.6+ pidfd_getfd approach)

Internally Python's multiprocessing.forkserver uses the old fd over unix socket transfer mechanism. I do not think it wise to reach into multiprocessing internals (those are not public APIs) but it can provide an example if needed.

rainwoodman · 2023-12-02T01:04:56Z

There has been some difficulty setting up a portserver -- and it won't really solve the problem because some of the port activities after the initial port allocation from portserver will be from outside port server. The use case I think a port transfer API can help is the following case:

Process M allocates 20 ports from portpicker, P0... P19, then spend time doing some 'initialization work' for a while;
Process M forks 20 subprocesses, send P0 ... P19 to each of the subprocess.
Subprocess "i" launches a bunch of servers binds to some ports
Subprocess "i" then binds to its own port P-i.

If any of P0 .. P19 is bound at step 3, then the step 4 for that process would fail?

I was thinking if we can somehow keep P-i bound in subprocess "i', until freeing up that port right at step 4. That would be similar to transfering the ownership of a port from Process M to subprocess "i".

In terms of implementation, I was not thinking of the fancier fd transfer approach, but rather just keeping the port number bound, and unbound it briefly during transfer of the ownership between processes and into the intended consumer. This is not 100% secure but should be able to drastically drop down the flakiness, which is likely good enough.

gpshead · 2024-01-11T22:46:31Z

it won't really solve the problem because some of the port activities after the initial port allocation from portserver will be from outside port server.

I'm not sure what you mean by that. The port server won't hand out the same port again until the process id associated with requested the port dies (or otherwise returns it). In your example this would likely be the orchestrating "Process M". But if other processes also want to request a port directly on their own behalf or for their own children that also works with a port server.

rainwoodman · 2024-01-16T01:44:20Z

I meant that Step 3 has allocated ports from a different library which does not use the port server, thus there is a likelihood that that library has bound to a port that was going to be used in step 4.

gpshead · 2024-01-16T19:43:06Z

Fundamental to being able to reliably test network software is being able to control all port assignments across the entire software stack being run.

Clarifying question:

Are you talking about code running as part of the TensorFlow OSS testing workflow? In that case, aim to fix the "different library" to have port(s) passed in rather than picking randomly on its own and plumb them through. (Or to use a portserver when the environment offers one.)
If you are talking about software outside of the TensorFlow OSS testing workflows claiming ports on a "system" that might conflict. That is always possible. One suggestion for that case: Consider requiring automating containers running within their own network namespace for the testing workflows. Such forced isolation from the rest of the system with a controlled environment can bring reliability.

Anyways, I'm happy to review PRs implementing something else portpicker wise that users could opt into, but I cannot dedicate time developing such a new solution to a problem which no testing infrastructure I maintain runs into.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfering ports while minimizing races from other threads #35

Transfering ports while minimizing races from other threads #35

rainwoodman commented Dec 1, 2023

gpshead commented Dec 1, 2023

gpshead commented Dec 1, 2023

rainwoodman commented Dec 2, 2023

gpshead commented Jan 11, 2024

rainwoodman commented Jan 16, 2024

gpshead commented Jan 16, 2024

Transfering ports while minimizing races from other threads #35

Transfering ports while minimizing races from other threads #35

Comments

rainwoodman commented Dec 1, 2023

gpshead commented Dec 1, 2023

gpshead commented Dec 1, 2023

rainwoodman commented Dec 2, 2023

gpshead commented Jan 11, 2024

rainwoodman commented Jan 16, 2024

gpshead commented Jan 16, 2024