DHT: BitTorrent vs kademlia vs clones (python)

I am in the middle of implementing my own dht for the inner cluster. Since it will be used in a file sharing program such as bittorrent, "Mainline DHT" was the first thing I looked at. After that, I found "confusing" (python, dht using a twisted matrix), congress (python, dht using pyev + libev) and, of course, the original "kademlia".

They have different approaches to organizing k-buckets:

1) congress, kademlia uses fixed 160 buckets in the range 2 * i <= (the difference for each identifier is from us) 2 * (i + 1), with 0 <= i <160.

2) mainline DHT and involved dynamic buckets. At launch, they have only 1 bucket covering the entire space. After it is filled with 8 living nodes, the bucket will be divided into 2 new ones. But ONLY if our own identifier is inside this bucket. If this is not the case, the bucket will never be broken. So, soon we will have 160 buckets closest to us and several others.

Both options are good enough. But I found a HUGE difference in logic that detects some identifier of some bucket or not. And that is my question.

congress and kademlia consider buckets to be “minimum distance from us” and “maximum distance from us”. Thus, our own identifier will ALWAYS be in bucket0. The maximum of 2 other identifiers in bucket1 (because it covers 2 * 1 <= x <2 * 2 distances) will ALWAYS be closest to us. Therefore, my brain does not break, everything is fine.

But if you look at Mainline DHT or are confused, you will see which vegetative packages are considered as absolute node id, not xor distance! Thus, in theoretically complete table identifiers 0,1,2,3,4,5,6,7 will be in 1 bucket.

So. Why do some implementations interpret the boundaries of the bucket as "maximum / minimum distance from us" and others as "maximum / minimum value of 160 bits integer"?

+6
source share
1 answer

In the Kademlia document, optimization of dynamically split buckets actually occurs as the routing table grows. There is no logical difference between the two approaches, it is just an optimization to save space. When implementing a fixed full-size routing table, you must find k nodes to send requests. If the bucket that your target hits is empty or contains fewer nodes in it, you need to choose from neighboring buckets. Given that in the near future you will not have to share the closest bucket, this simplifies and speeds up the search.

As for your point (1), I think you may have misunderstood cadmium. The boundaries of the routing table bucket always refer to your own node identifier. And the bucket identification space doubles for each bucket farther from you. Without this property (if, say, each bucket covered an equal range of ID space), you would not be able to search correctly, and of course they would not be log (n).

DHT core line implements kademlia.

+8
source

Source: https://habr.com/ru/post/898904/


All Articles