-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poldercast Vicinity Layer Not Returning "Nearest" Neighbours #15
Comments
Yes and this is why the ID needs to be removed. This was done to facilitate management of the network protocols utilising |
The IDs are not the problem though? In Jormungandr we can just generate the IDs by hashing IP addresses. The problem here is where the layer fetches the nodes from the sorted list. |
The list is sorted by the proximity function which is based on topic preferences, not on IDs.which is exactly what vicinity is about. |
But the available nodes are stored in a BTreeSet keyed by ID. The layer fetches all the nodes from this set and then applies the proximity sort order. Given that all pools subscribe to the same topic with the same priority, the order is basically guaranteed to be lexicographically ordered by ID. Then the layer just takes the top 20. Which are always the same nodes. |
It is a poldercast/src/poldercast/vicinity.rs Line 76 in f882186
EDITokay I answered too quickly indeed that all node selected are ordered by ID, the introduction of the new unstable sort you have added in your last PR #9 does not guarantee the original order is kept, so this is not guaranteed to be true. The the proximity function will still reorder things by topic/preferences which is what we expect from the paper. Now we can still improve things with not allowing the users to set their own ID is necessary, this was an unfortunate mistake and allows a user to trick the game. However we are still doing what we were supposed to do, and this is one of the limitation of poldercast in the context of few topics:
Also the vicinity is not supposed to change that often. This is the role of Cylon to provide random nodes. Vicinity is well defined as to be arbitrarily deterministic. Which is what is being done below: Lines 75 to 91 in f882186
BUT I see where you are coming from and in the context of being used in Jörmungandr there is only a few topics to use (blocks and block contents), so yes your analysis is correct in this context, nodes are likely to be the same for everyone and this is likely going to make propagations of events less performant. So we could do something to improve the proximity function a little bit. We could do a couple of things to improve things:
function available here: Lines 29 to 31 in f882186
and we would need to update: Lines 120 to 122 in f882186
This will make us deviate a bit more from the original paper, but would improve the vicinity selection by adding a bit more randomness. |
That vector originates from this BTreeSet: Line 10 in f882186
Which is returned by this method: Line 62 in f882186
That method collects the set in sorted order (by key value, i.e. ID) and transforms it to a vector. Which is in the same order. |
I will edit my comment accordingly. |
Does that mean the analysis is correct? |
yes, in the context of Jörmungandr you are right. It is likely the nodes selected by vicinity will be the same. I have updated my comment. Thanks a lot for spotting this issue. |
Go team Jormungandr! 🎉 |
However do you agree the title of the issue is no longer correct?. The proximity function is correct as per the paper. We are actually returning the nearest ones. Just that in the case of IOHK’s product this is not enough as it risks to return always the same kind of nodes due to the ID and the few numbers of topic. I am thinking that having an Id that is randomly generated would be better instead of a hash. I don’t recall the paper being explicit about using the IP address. Also in the case of jormungandr there are nodes that don’t have IP addresses. |
No not really. The sorting function I believe to be correct. But the last part (fetching the actual neighbours of the node) is not correct. You have 10 of the letters in the alphabet and you are trying to find the closest neighbours to M. So you sort the letters alphabetically. Then search around M to find JKL[M]NOP. At the moment, ABCDEFG (or whatever letters you have in your bucket) are always returned. This clearly cannot be the nearest neighbour as it lacks an "anchor" point (e.g. M).
Personally I'd just take the ip (and maybe the port), hash it with BLAKE and then take the first 24 characters from that. That way, any node can always validate the IP address and port against the hash. I guess one problem however is, some nodes (i.e. wallet nodes) don't necessarily know their IP address! So, they need to be random.... Thinking.... |
Node IDs are unique. So unstable sort is not a problem as there will never be two nodes in the set of nodes that have equality (only because indirectly the nodes are first sorted by ID, then by proximity, thus pool 000000000000... with topics/priorities (BLOCKS:HIGH + MESSAGES:HIGH) always at the top).
I think you nailed it there. Regardless of the implementation, i cant really see much point in the business of topics, or priority for that matter. Not unless there is a long game here where it should become important? As you say, most nodes (certainly all stake pools), are interested in the same topics (blocks and fragments) and at the same level of priority (high).
Agreed. Also why #16 is relevant to this issue.
Good start. At some point in the future will have to discuss the security of this scheme still. As its a trustless p2p system I can think of a handful of issues. For example, I can just generate the same ID as another node and enter the network, thus disrupting that node.
I think this might also be a good step in the right direction. But there is this from the PolderCast paper, which I quite like:
I'd like to think we could move in that direction long term as well. |
Reading my last comment back to myself, I'm realising that this issue is also partially resolved if #16 is addressed. Because, if vicinity is populated by the smaller random subset managed by the Cyclon layer, then indeed sorting as-is will be perfectly fine because you are always sorting on a smaller random sample set - and within that subset, all you care about is subscription+priority sort order. I see where you are coming from now. Tell me if you agree with the following:
The key thing here is this sentence from the PolderCast paper:
I guess Vicinity could maintain a bounded binary heap sorted by priority for each topic? Also... In the meantime while this and #16 is addressed - we could just disable the vicinity layer which will remove all gaming of the system and should be significantly better than what is currently going on (because of all that traffic being sent to the top N of all stake nodes ordered lexicographically by node ID). |
Another relevant piece of information that isn't currently implemented correctly:
Current implementation assumes the node specifies both topic and priority. But in-fact priority is a calculated and dynamic value used to ensure that overlay rings that are less well established have a higher priority of being processed. |
I have made a new release the changesI have made a breaking change release of poldercast. To verify though (there is still an `Id` in the `NodeProfile` to allow compatibility with previous versions (there is an `Id` randomly generated in the `NodeProfile` (Gossip)).However code using the latest version will need to:
Improvements are:
These addresses the issues with follow up improvements will be to leverage the |
An interesting pattern has shown up with nodes allocating their own IDs. Many are choosing to use IDs prefixed with a large number of
0
characters.Here is a lexicographically ordered dump of peers gossiped on the network from yesterday.
After running a node with a similar number of
0
prefixed characters, one observes an extremely large increase in the number of block announcements received.The reason is that the Vicinity layer is not taking the "nearest nodes" as described in the PolderCast Paper. Rather it is ordering them, according to the proximity function, but then simply taking the top N entries:
poldercast/src/poldercast/vicinity.rs
Lines 85 to 90 in f882186
Nearest neighbours should be taken as a window around the nodes own ID after having been sorted by the proximity function.
The text was updated successfully, but these errors were encountered: