C ++ some questions on boost :: unordered_map and boost :: hash

I just started working in boost and containers, and I read several articles on the Internet and on stackoverflow that boost :: unordered_map is the fastest container for large collections. So, I have this class state, which should be unique in the container (without duplicates), and there will be millions, if not billions of states in the container. Therefore, I tried to optimize it for small sizes and as few calculations as possible. I used to use boost :: ptr_vector, but when I read about stackoverflow, a vector is only good when there are not many objects in it. In my case, the state descibes sensorimotor information from the robot, so there can be a huge number of states, and therefore a quick search is the highest priority. By following up on the documentation for unordered_map, I understand that I can do two things to speed things up: use the hash function and use the equality operator to compare states based on their hash function. So, I implemented the private hash () function, which takes state information and uses boost :: hash_combine, creates a hash value of std :: size_t. The == operator compares mainly hash state values. So:

  • : std :: size_t enough to cover billions of possible hash_function combinations? To avoid duplication of states, I intend to use their hash_values.

  • When creating state_map should I use state * or a hash value as a key? ie: boost::unordered_map<State*,std::size_t> state_map; Or boost::unordered_map<std::size_t,State*> state_map;

  • Are search times with boost :: unordered_map :: iterator = state_map.find () faster than via boost :: ptr_vector and comparing each value of the iterator key?

  • Finally, any advice or recommendations on optimizing such an unordered map for quick and quick searches would be greatly appreciated.

EDIT: I saw quite a few answers, one of which is not to use boost, but C ++ 0X, the other is not to use unordered_set, but to be honest, I still want to see how boost :: unordered_set is used with a hash function. I followed the additional documentation and implemented, but I still can't figure out how to use the boost hash function with an ordered set.

+6
source share
3 answers

This is a bit confusing.

  • What you say is not โ€œthings you can do to speed things upโ€; rather, they are mandatory requirements of your type, which can be recognized as the element type of an unordered map, as well as for an unordered set (which you might want).

  • You need to provide an equality operator that compares objects, not hash values. The whole point of equality is to distinguish between elements with the same hash.

  • size_t is an unsigned integral type, 32 bits on x86 and 64 bits on x64. Since you want โ€œbillions of elements,โ€ which means a lot of gigabytes of data, I assume you have a solid x64 machine anyway.

  • The important thing is that your hash function is good, i.e. has several collisions.

  • You need a kit, not a card. Put the objects directly into the set: std::unordered_set<State> . Use a card if you are matching with something, that is, it states something else. Oh, use C ++ 0x, and don't raise if you can.

  • Using hash_combine is good.


Example for a child:

 struct State { inline bool operator==(const State &) const; /* Stuff */ }; namespace std { template <> struct hash<State> { inline std::size_t operator()(const State & s) const { /* your hash algorithm here */ } }; } std::size_t Foo(const State & s) { /* some code */ } int main() { std::unordered_set<State> states; // no extra data needed std::unordered_set<State, Foo> states; // another hash function } 
+4
source

Unordered_map is a hash table. You do not store the hash; it is done internally as a storage and retrieval method.

Given your requirements, an unordered_set may be more appropriate since your object is the only item to store.

You are a little confused - the equality operator and hash functions are not really performance elements, but are necessary for non-trivial objects for the container to work properly. A good hash function will distribute your nodes evenly across the buckets, and the equality operator will be used to eliminate any ambiguity regarding hash-based matches.

std :: size_t is great for a hash function. Remember that no hash is perfect; collisions will occur, and these collision elements are stored in a linked list in this bucket position.

Thus .find () will be O (1) in the optimal case and very close to O (1) in the middle case (and O (N) in the worst case, but a decent hash function will avoid this.)

You do not mention your platform or architecture; in billions of records, you still have to worry about running out of memory, depending on the size and size of the State object.

+2
source

forget about the hash; there is nothing (at least from your question) that suggests that you have a meaningful key;

lets you step back and rephrase your actual performance:

  • you want to quickly check for duplicates for any object objects

comment if I need to add others.

From the above goal, and from your comment, I would suggest using ordered_set , not unordered_map. Yes, an ordered search uses the binary search O (log (n)), while an unordered one uses the search O (1).

However, the difference is that with this approach, you need an ordered_component ONLY to verify that such a state does not exist already when you are going to create a new one, that is, it is in a time creation state.

In all other search queries, you actually don't need to look into order_set! because you already have a key; State *, and the key can access the value using the magic dereference operator: * key

so with this approach, you only use order_set as index to check the states when creating only time. In all other cases, you get access to your state using the pointer key dereference operator.

if all of the above was not enough to convince you, here is the last nail in the coffin of the idea to use a hash to quickly determine equality; the hash function has a small probability of collision, but as the number of states increases, this probability will become a complete certainty. Therefore, depending on your resiliency, you will be dealing with government clashes (and from your question and the number of states that you expect to deal with, it looks like you will be dealing with a lot of them).

To do this, you obviously need a comparison predicate to check all the internal properties of your state (gyroscope, motors, accelerometers, proton rays, etc.).

+2
source

Source: https://habr.com/ru/post/892721/


All Articles