Unhappy unmanaged input / hash function unordered_map

Now I was writing an image processing algorithm, and at some point I needed to collect some statistical information about the converted pixels in order to get a deeper idea of ​​the direction I should follow, with further development. The information I need to collect was in the format:

key: RGB value value: int 

What I did, I opened the converted image and repeated it, saving the values ​​that I need for std::unordered_map , which has the following signature:

 typedef std::unordered_map<boost::gil::rgb8_pixel_t, unsigned int> pixel_map_t; 

In the loop:

 for(int y = 0; y < vi.height(); y++) { SrcView::x_iterator dst_it = src.row_begin(y); for(int x = 0; x < vi.width(); x++, hits++) { diff_map.insert(std::make_pair(dst_it[x], /* some uint32 */)); } 

I am also writing a custom hash function (it was an ideal hash function: 256^2 x R + 256 x G + B - so collisions should be minimal, regardless of buckets and hash table layout (to a reasonable extent).

What I noticed was that the insert was terribly slow! - after reaching the 11th iteration, the insertion rate worsened by about 100x. I had a huge amount of collisions! Despite the very small number of duplicated colors in the image.

Subsequently, I wanted to eliminate any possible error in my code and began to conduct a comparative analysis of unordered_map using STL hash functions with primitive key types such as int.

Code for reference:

 std::size_t hits = 0, colls = 0; for(int y = 0; y < vi.height(); y++) { SrcView::x_iterator dst_it = src.row_begin(y); for(int x = 0; x < vi.width(); x++, hits++) { if(diff_map.find(x*y) != diff_map.cend()) colls++; diff_map.insert(std::make_pair(x*y, 10)); } std::cout << y << "/" << vi.height() << " -> buckets: " << diff_map.bucket_count() << "(" << std::floor(diff_map.load_factor() * 100) << "% load factor) [ " << colls << " collisions / " << hits << " hits ]" << std::endl; } 

... and here are the results for the first 20 iterations of the outer loop (using only the STL hash function for the int-typed key):

 0/480 -> buckets: 8(12% load factor) [ 639 collisions / 640 hits ] 1/480 -> buckets: 4096(15% load factor) [ 640 collisions / 1280 hits ] 2/480 -> buckets: 4096(23% load factor) [ 960 collisions / 1920 hits ] 3/480 -> buckets: 4096(31% load factor) [ 1281 collisions / 2560 hits ] 4/480 -> buckets: 4096(37% load factor) [ 1654 collisions / 3200 hits ] 5/480 -> buckets: 4096(45% load factor) [ 1964 collisions / 3840 hits ] 6/480 -> buckets: 4096(51% load factor) [ 2370 collisions / 4480 hits ] 7/480 -> buckets: 4096(59% load factor) [ 2674 collisions / 5120 hits ] 8/480 -> buckets: 4096(65% load factor) [ 3083 collisions / 5760 hits ] 9/480 -> buckets: 4096(71% load factor) [ 3460 collisions / 6400 hits ] 10/480 -> buckets: 4096(77% load factor) [ 3872 collisions / 7040 hits ] 11/480 -> buckets: 4096(85% load factor) [ 4161 collisions / 7680 hits ] 12/480 -> buckets: 4096(90% load factor) [ 4612 collisions / 8320 hits ] 13/480 -> buckets: 4096(99% load factor) [ 4901 collisions / 8960 hits ] 14/480 -> buckets: 32768(13% load factor) [ 5315 collisions / 9600 hits ] 15/480 -> buckets: 32768(13% load factor) [ 5719 collisions / 10240 hits ] 16/480 -> buckets: 32768(14% load factor) [ 6148 collisions / 10880 hits ] 17/480 -> buckets: 32768(15% load factor) [ 6420 collisions / 11520 hits ] 18/480 -> buckets: 32768(16% load factor) [ 6870 collisions / 12160 hits ] 19/480 -> buckets: 32768(17% load factor) [ 7135 collisions / 12800 hits ] 20/480 -> buckets: 32768(17% load factor) [ 7584 collisions / 13440 hits ] 21/480 -> buckets: 32768(18% load factor) [ 7993 collisions / 14080 hits ] 

Is the number of collisions too high in this case? STL libraries are generally of high quality, but have 639/640 and 640/1280 for a simple built-in key sound, at least strange. Or maybe I'm doing something wrong?


And that was my hash function (theoretically, there shouldn't be any collisions at all, but the numbers were very close):

 template<> struct std::hash<boost::gil::rgb8_pixel_t> : public std::unary_function<const boost::gil::rgb8_pixel_t&, size_t> { size_t operator()(const boost::gil::rgb8_pixel_t& key) const { size_t ret = (static_cast<size_t>(key[0]) << 16) | (static_cast<size_t>(key[1]) << 8) | (static_cast<size_t>(key[2])); //return 256 * 256 * key[0] + 256 * key[1] + key[2]; return ret; } }; 

Now this is no longer funny ...

I wrote this hash function:

 template<> struct std::hash<int> : public std::unary_function<const int&, size_t> { size_t operator()(const int& key) const { return 5; } }; 

In theory, I should have a 100% collision speed, right? but the results are:

 0/480 -> buckets: 8(12% load factor) [ 639 collisions / 640 hits ] 1/480 -> buckets: 4096(15% load factor) [ 640 collisions / 1280 hits ] 2/480 -> buckets: 4096(23% load factor) [ 960 collisions / 1920 hits ] 3/480 -> buckets: 4096(31% load factor) [ 1281 collisions / 2560 hits ] 4/480 -> buckets: 4096(37% load factor) [ 1654 collisions / 3200 hits ] 5/480 -> buckets: 4096(45% load factor) [ 1964 collisions / 3840 hits ] 6/480 -> buckets: 4096(51% load factor) [ 2370 collisions / 4480 hits ] 7/480 -> buckets: 4096(59% load factor) [ 2674 collisions / 5120 hits ] 8/480 -> buckets: 4096(65% load factor) [ 3083 collisions / 5760 hits ] 9/480 -> buckets: 4096(71% load factor) [ 3460 collisions / 6400 hits ] 

Why?

Env: MSVS2010

+6
source share
5 answers

colls does not measure collisions. If you want to measure collisions, then for each bucket b in the range [0, bucket_count()) get bucket_size(b) . This will show you how many items are in each bucket. If there are 2 or more items in the bucket, you have bucket_size(b) - 1 collision for bucket b .

+8
source

Your hash space is 24 bytes in size. To have 0 collisions, you will need a hash table the size of your data, if it is perfect for you, 25-50% more if not. I assume that you have made your hash table much, much smaller than this, so the container reassigns your data and causes collisions.

+4
source

If I understand what you're doing right, you might just get these collisions, because many of the pixels in your image have the same color, and you repeatedly call diff_map.insert for the same color (so the quality of your hash is doesn’t matter). If you do this to compute a histogram of colors, you probably won't want to do "diff_map.insert (std :: make_pair (dst_it [x], / * some uint32 * /));", but just do something like

auto it = diff_map.find (dst_it [x]); if (it == diff_map.end ()) it = 1; else (it-> second) ++;

+1
source

I am also writing a custom hash function (it was a perfect hash function: 256 ^ 2 x R + 256 x G + B - so collisions should be minimal regardless of buckets and hash table layout (to a reasonable extent).

This hash function is not good. A good hash function (when you don't know the number of buckets) should generate extremely different hash values ​​for nearly identical inputs. In your case, a very simple way to achieve this is to use three tables of 256 random 32-bit values: int32_t rand[3][256] - then hash ala rand[0][R] ^ rand[1][G] ^ rand[2][B] . This randomly scatters your values ​​across buckets without any cluster tendency for similar values: the perfect hash function for unknown # buckets.

You can also let the hash functions provided by the library have a crack in it, they cannot improve the properties of a random table for generating hashes, but they can be faster due to fewer memory notes or slower due to more or more complex mathematical operations - a guide if you are not indifferent.

0
source

Even if you may not have equal values, the values ​​may be fairly close. I had a difficult time trying to find good hashing functions for time series or numbers that are not scattered. When unordered_map makes a "%" (modulo) hash value with the number of buckets, most of your values ​​can end in only a few buckets (if the hash values ​​are not scattered well), and this leads to an O (n) search.

If the hash values ​​are not scattered enough, I would just use std :: map (RB tree), where I get O (log n).

0
source

Source: https://habr.com/ru/post/888731/


All Articles