Faster way to read / write std :: unordered_map file from / to file

I work with very large std::unordered_map (hundreds of millions of records) and must save and load them to and from the file. The way I'm currently doing this is to repeat through the map and read / write each pair of keys and values ​​one at a time:

 std::unordered_map<unsigned long long int, char> map; void save(){ std::unordered_map<unsigned long long int, char>::iterator iter; FILE *f = fopen("map", "wb"); for(iter=map.begin(); iter!=map.end(); iter++){ fwrite(&(iter->first), 8, 1, f); fwrite(&(iter->second), 1, 1, f); } fclose(f); } void load(){ FILE *f = fopen("map", "rb"); unsigned long long int key; char val; while(fread(&key, 8, 1, f)){ fread(&val, 1, 1, f); map[key] = val; } fclose(f); } 

But about 624 million records, reading a map from a file took 9 minutes. Writing to a file was faster, but still it took several minutes. Is there a faster way to do this?

+5
source share
5 answers

C ++ unordered_map implementation should use chaining . There are many really good reasons why you can do this for the general purpose hash table discussed here.

This is crucial for performance. Most importantly, this means that the hash table entries are likely to be scattered throughout the memory in such a way as to make access to each of them an order of magnitude (or so) less efficient, which would be the case if they could be somehow then access sequentially.

Fortunately, you can create hash tables that, when they are nearly full, provide almost consistent access to neighboring elements. This is done using open addressing .

Since your hash table is not a common goal, you can try this.

Below, I built a simple hash table container with open addressing and linear exploration . He suggests a few things:

  • Your keys are already somehow randomly distributed. This eliminates the need for hash functions (although decent hash functions are fairly easy to build, even if large hash functions are complex).

  • You only add items to the hash table, you do not delete them. If this were not the case, you would need to change the used vector to something that could contain three states: used , UNUSED and TOMBSTONE , where TOMBSTONE is the declared remote element and is used to continue the linear search probe or stop the linear insertion sensor.

  • So that you know the size of your hash table ahead of time, you do not need to change its size / rename.

  • So you do not need to move elements in a specific order.

Of course, there are probably all kinds of great implementations of hash tables of open addressing on the Internet that solve many of the above problems. However, the simplicity of my table allows me to convey an important point.

The important point is that my project allows you to store all the hash table information in three vectors. That is: memory is adjacent.

Continuous memory is quickly allocated, quickly read, and quickly recorded. The effect of this is deep.

Using the same test setup as my previous answer, I get the following points:

 Save. Save time = 82.9345 ms Load. Load time = 115.111 ms 

This is a 95% reduction in time savings (22 times faster) and a 98% reduction in load time (62 times faster).

Code:

 #include <cassert> #include <chrono> #include <cstdint> #include <cstdio> #include <functional> #include <iostream> #include <random> #include <vector> const int TEST_TABLE_SIZE = 10000000; template<class K, class V> class SimpleHash { public: int usedslots = 0; std::vector<K> keys; std::vector<V> vals; std::vector<uint8_t> used; //size0 should be a prime and about 30% larger than the maximum number needed SimpleHash(int size0){ vals.resize(size0); keys.resize(size0); used.resize(size0/8+1,0); } //If the key values are already uniformly distributed, using a hash gains us //nothing uint64_t hash(const K key){ return key; } bool isUsed(const uint64_t loc){ const auto used_loc = loc/8; const auto used_bit = 1<<(loc%8); return used[used_loc]&used_bit; } void setUsed(const uint64_t loc){ const auto used_loc = loc/8; const auto used_bit = 1<<(loc%8); used[used_loc] |= used_bit; } void insert(const K key, const V val){ uint64_t loc = hash(key)%keys.size(); //Use linear probing. Can create infinite loops if table too full. while(isUsed(loc)){ loc = (loc+1)%keys.size(); } setUsed(loc); keys[loc] = key; vals[loc] = val; } V& get(const K key) { uint64_t loc = hash(key)%keys.size(); while(true){ if(!isUsed(loc)) throw std::runtime_error("Item not present!"); if(keys[loc]==key) return vals[loc]; loc = (loc+1)%keys.size(); } } uint64_t usedSize() const { return usedslots; } uint64_t size() const { return keys.size(); } }; typedef SimpleHash<uint64_t, char> table_t; void SaveSimpleHash(const table_t &map){ std::cout<<"Save. "; const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "wb"); uint64_t size = map.size(); fwrite(&size, 8, 1, f); fwrite(map.keys.data(), 8, size, f); fwrite(map.vals.data(), 1, size, f); fwrite(map.used.data(), 1, size/8+1, f); fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } table_t LoadSimpleHash(){ std::cout<<"Load. "; const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "rb"); uint64_t size; fread(&size, 8, 1, f); table_t map(size); fread(map.keys.data(), 8, size, f); fread(map.vals.data(), 1, size, f); fread(map.used.data(), 1, size/8+1, f); fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; return map; } int main(){ //Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity auto generator = std::mt19937(12345); //Combination of my luggage //Generate values within the specified closed intervals auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator); auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator); table_t map(1.3*TEST_TABLE_SIZE); std::cout<<"Created table of size "<<map.size()<<std::endl; std::cout<<"Generating test data..."<<std::endl; for(int i=0;i<TEST_TABLE_SIZE;i++) map.insert(key_rand(),(char)val_rand()); //Low chance of collisions, so we get quite close to the desired size map.insert(23,42); assert(map.get(23)==42); SaveSimpleHash(map); auto newmap = LoadSimpleHash(); //Ensure that the load worked for(int i=0;i<map.keys.size();i++) assert(map.keys.at(i)==newmap.keys.at(i)); for(int i=0;i<map.vals.size();i++) assert(map.vals.at(i)==newmap.vals.at(i)); for(int i=0;i<map.used.size();i++) assert(map.used.at(i)==newmap.used.at(i)); } 
+4
source

(Edit: I added a new answer to achieve a 95% reduction in wall time.)

I made a minimal working example that illustrates the problem you are trying to solve. This is what you should always do in your questions.

Then I deleted the unsigned long long int material and replaced its uint64_t with the cstdint library. This ensures that we work with the same data size, since unsigned long long int can mean almost everything, which depends on which computer / compiler you use.

The resulting MWE looks like this:

 #include <chrono> #include <cstdint> #include <cstdio> #include <deque> #include <functional> #include <iostream> #include <random> #include <unordered_map> #include <vector> typedef std::unordered_map<uint64_t, char> table_t; const int TEST_TABLE_SIZE = 10000000; void Save(const table_t &map){ std::cout<<"Save. "; const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "wb"); for(auto iter=map.begin(); iter!=map.end(); iter++){ fwrite(&(iter->first), 8, 1, f); fwrite(&(iter->second), 1, 1, f); } fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } //Take advantage of the limited range of values to save time void SaveLookup(const table_t &map){ std::cout<<"SaveLookup. "; const auto start = std::chrono::steady_clock::now(); //Create a lookup table std::vector< std::deque<uint64_t> > lookup(256); for(auto &kv: map) lookup.at(kv.second+128).emplace_back(kv.first); //Save lookup table header FILE *f = fopen("/z/map", "wb"); for(const auto &row: lookup){ const uint32_t rowsize = row.size(); fwrite(&rowsize, 4, 1, f); } //Save values for(const auto &row: lookup) for(const auto &val: row) fwrite(&val, 8, 1, f); fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } //Take advantage of the limited range of values and contiguous memory to //save time void SaveLookupVector(const table_t &map){ std::cout<<"SaveLookupVector. "; const auto start = std::chrono::steady_clock::now(); //Create a lookup table std::vector< std::vector<uint64_t> > lookup(256); for(auto &kv: map) lookup.at(kv.second+128).emplace_back(kv.first); //Save lookup table header FILE *f = fopen("/z/map", "wb"); for(const auto &row: lookup){ const uint32_t rowsize = row.size(); fwrite(&rowsize, 4, 1, f); } //Save values for(const auto &row: lookup) fwrite(row.data(), 8, row.size(), f); fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Save time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } void Load(table_t &map){ std::cout<<"Load. "; const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "rb"); uint64_t key; char val; while(fread(&key, 8, 1, f)){ fread(&val, 1, 1, f); map[key] = val; } fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } void Load2(table_t &map){ std::cout<<"Load with Reserve. "; map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8); const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "rb"); uint64_t key; char val; while(fread(&key, 8, 1, f)){ fread(&val, 1, 1, f); map[key] = val; } fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } //Take advantage of the limited range of values to save time void LoadLookup(table_t &map){ std::cout<<"LoadLookup. "; map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8); const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "rb"); //Read the header std::vector<uint32_t> inpsizes(256); for(int i=0;i<256;i++) fread(&inpsizes[i], 4, 1, f); uint64_t key; for(int i=0;i<256;i++){ const char val = i-128; for(int v=0;v<inpsizes.at(i);v++){ fread(&key, 8, 1, f); map[key] = val; } } fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } //Take advantage of the limited range of values and contiguous memory to save time void LoadLookupVector(table_t &map){ std::cout<<"LoadLookupVector. "; map.reserve(TEST_TABLE_SIZE+TEST_TABLE_SIZE/8); const auto start = std::chrono::steady_clock::now(); FILE *f = fopen("/z/map", "rb"); //Read the header std::vector<uint32_t> inpsizes(256); for(int i=0;i<256;i++) fread(&inpsizes[i], 4, 1, f); for(int i=0;i<256;i++){ const char val = i-128; std::vector<uint64_t> keys(inpsizes[i]); fread(keys.data(), 8, inpsizes[i], f); for(const auto &key: keys) map[key] = val; } fclose(f); const auto end = std::chrono::steady_clock::now(); std::cout<<"Load time = "<< std::chrono::duration<double, std::milli> (end-start).count() << " ms" << std::endl; } int main(){ //Perfectly horrendous way of seeding a PRNG, but we'll do it here for brevity auto generator = std::mt19937(12345); //Combination of my luggage //Generate values within the specified closed intervals auto key_rand = std::bind(std::uniform_int_distribution<uint64_t>(0,std::numeric_limits<uint64_t>::max()), generator); auto val_rand = std::bind(std::uniform_int_distribution<int>(std::numeric_limits<char>::lowest(),std::numeric_limits<char>::max()), generator); std::cout<<"Generating test data..."<<std::endl; //Generate a test table table_t map; for(int i=0;i<TEST_TABLE_SIZE;i++) map[key_rand()] = (char)val_rand(); //Low chance of collisions, so we get quite close to the desired size Save(map); { table_t map2; Load (map2); } { table_t map2; Load2(map2); } SaveLookup(map); SaveLookupVector(map); { table_t map2; LoadLookup (map2); } { table_t map2; LoadLookupVector(map2); } } 

In the test dataset that I use, this gives me a 1982 ms write time and a read time (using source code) of 7467 ms. Reading time seemed to be the biggest bottleneck, so I created a new Load2 function that reserves enough space for unordered_map before reading. This reduced read time to 4700 ms (savings of 37%).

Change 1

Now, I note that the values ​​of your unordered_map can only take 255 different values. That way I can easily convert unordered_map to some lookup table in RAM. That is, instead of having:

 123123 1 234234 0 345345 1 237872 1 

I can change the data to look like this:

 0 234234 1 123123 345345 237872 

What is the advantage of this? This means that I no longer need to write the value to disk. This saves 1 byte per entry in the table. Since each table entry consists of 8 bytes for the key and 1 byte for the value, this should give me 11% savings in both read time and write minus the cost of reordering memory (which, I believe, will be low because RAM).

Finally, as soon as I performed the above regrouping, if I have a lot of backup memory on the machine, I can pack everything into a vector and read / write adjacent data to disk.

Doing all this gives the following points:

 Save. Save time = 1836.52 ms Load. Load time = 7114.93 ms Load with Reserve. Load time = 4277.58 ms SaveLookup. Save time = 1688.73 ms SaveLookupVector. Save time = 1394.95 ms LoadLookup. Load time = 3927.3 ms LoadLookupVector. Load time = 3739.37 ms 

Please note that the transition from Save to SaveLookup gives 8% acceleration, and the transition from Load with Reserve to LoadLookup also gives 8% acceleration. This is correct according to our theory!

The use of continuous memory also gives a total of 24% acceleration compared to the original storage time and a total of 47% acceleration compared to the original load time.

+2
source

I would suggest that you need a map to write the values ​​ordered in a file. It would be better to load only the values ​​in the container once, maybe std::deque would be better, since the sum is large, and use std::sort once, and then iterate through std::deque to write the values. You would score cache performance as well as runtime complexity for std::sort - this is N * Log (N), which would be better than balancing your card ~ 624 million times or paying misses misses on an unordered map.

0
source

Perhaps a prefix-ordered bypass during save will help reduce the amount of internal reordering at boot time?

Of course, you don’t have the visibility of the internal structure of the STL map containers, so it would be best to simulate this by binary grinding the iterator, as if it were linear. Given that you know the common N nodes, save node N / 2, then N / 4, N * 3/4 ​​and so on.

This can be done algorithmically by visiting each odd N / (2 ^ p) node in each pass p: N / 2, N * 1/4, N * 3/4, N * 1/8, N * 3/8, N * 5/8, N * 7/8, etc. Although you need to make sure that the series supports step sizes such that N * 4/8 = N / 2, but without resorting to 2 ^ step sizes (Pp ), and that in the last pass you visit all the remaining node. You may find it worthwhile to pre-compute the highest skip number (~ log2 (N)) and the float value S = N / (2 ^ P), so that 0.5 <S <= 1.0, and then scale for backup for each p.

But as others have said, you need to profile it first to find out if this is your problem, and profile again to see if this approach helps.

0
source

Since your data seems static and based on the number of elements, I would of course consider using your own structure in a binary file, and then use the memory mapping in this file.

Opening will be instant (just mmap file).

If you write the values ​​in sorted order, you can use a binary search on the displayed data.

If this is not enough, you can separate your data in buckets and save the list with offsets at the beginning of the file - or maybe even use a hash code.

If your keys are unique and somewhat contiguous, you can even get a smaller file by storing only the char values ​​in the [key] file position (and use a special value for null values). Of course, this will not work for the full range of uint64 , but depending on the data, they can be grouped together in buckets containing an offset.

Using mmap , this method will also use much less memory.


For faster access, you can create your own hash map on disk (still with “instant download”).

For example, let's say you have 1 million hashes (in your case there would be much more), you could write 1 million uint64 values ​​at the beginning of the file (the hash value would be the position uint64 containing the file). Each location will point to a block with one or more key / value pairs, and each of these blocks will begin with an account.

If the blocks are aligned by 2 or 4 bytes, you can use the uint32 filepos file instead (multiply pos by 2 or 4).

Since the data is static, you do not need to worry about possible insertions or deletions, which makes it quite easy to implement.

This has the advantage that you can still mmap whole file, and all key / value pairs with the same hash are close to each other, which brings them to the L1 cache (compared to the specified linked lists).

0
source

Source: https://habr.com/ru/post/1274903/


All Articles