Count the appearance of each item in a large data stream

Question

Count the appearance of each item in a large data stream

I have an N-particle simulation using T timestep. At each time interval, each particle calculates some data about itself and other nearby (in radius) particles, which are bit-coded into a c-string with a length of 4-22 bytes (depending on the number of nearest particles). I call it the status bar.

I need to count how many times each status bar appears to form a histogram. I tried using the Google Sparse Hash Map, but the memory overhead is crazy.

I performed several abbreviated tests (attached) over 100,000 timestamps, for 500 particles. This results in a little over 18.2 million unique status bars out of 50 possible states, which is consistent with the actual work that needs to be done.

It ends up using 323 MB in space for char * and int for each unique record, as well as the actual status bar itself. However, the task manager reports that the 870M is in use. This is 547M overhead, or ~ 251.87 bits / record, compared with what Google advertises about 4-5 bits.

So, I suggest that I should do something wrong. But then I found this site that showed similar results, however I'm not sure that its charts show only the size of the hash table or include the size of the actual data. In addition, his code does not release any lines that are inserted into an already existing hashmap (which means that its charts contain the size of the actual data, it will be completed).

Here is the code showing the exit problem:

#include <google/sparse_hash_map>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>

//String equality
struct eqstrc
{
    bool operator()(const char* s1, const char* s2) const
    {
        return (s1 == s2) || (s1 && s2 && !strcmp(s1,s2));
    }   
};

//Hashing function
template <class T>
class fnv1Hash
{
public:
    size_t operator()(const T& c) const {
            unsigned int hash = 2166136261;
            const unsigned char *key = (const unsigned char*)(c);
            size_t L = strlen((const char*)c);
            size_t i = 0;
            for(const unsigned char *s = key; i < L; ++s, ++i)
                hash = (16777619 * hash) ^ (*s);
            return (size_t)hash;
    }
};

//Function to form new string
char * new_string_from_integer(int num)
{
    int ndigits = num == 0 ? 1 : (int)log10((float)num) + 1;
    char * str = (char *)malloc(ndigits + 1);
    sprintf(str, "%d", num);
    return str;
}

typedef google::sparse_hash_map<const char*, int, fnv1Hash<const char*>, eqstrc> HashCharMap;


int main()
{
    HashCharMap hashMapChar;
    int N = 500;
    int T = 100000;

    //Fill hash table with strings
    for(int k = 0; k < T; ++k)
    {
        for(int i = 0; i < N; ++i)
        {
            char * newString = new_string_from_integer(i*k);
            std::pair<HashCharMap::iterator, bool> res =  hashMapChar.insert(HashCharMap::value_type(newString, HashCharMap::data_type()));
            (res.first)->second++;

            if(res.second == false) //If the string already in hash map, don't need this memory
                free(newString);
        }
    }

    //Count memory used by key 
    size_t dataCount = 0;
    for(HashCharMap::iterator hashCharItr = hashMapChar.begin(); hashCharItr != hashMapChar.end(); ++hashCharItr)
    {
        dataCount += sizeof(char*) + sizeof(unsigned int); //Size of data to store entries
        dataCount += (((strlen(hashCharItr->first) + 1) + 3) & ~0x03); //Size of entries, padded to 4 byte boundaries
    }
    printf("Hash Map Size: %lu\n", (unsigned long)hashMapChar.size());
    printf("Bytes written: %lu\n", (unsigned long)dataCount);

    system("pause");
}

Output

Hash Map Size: 18218975
Bytes written: 339018772
Peak Working Set (Reported by TaskManager): 891,228 K
Overhead: 560,155 K, or 251.87 bits/entry

I tried both Google Sparse Hash Map v1.10 and v2.0.2.

I am doing something wrong in using a hash map. Or is there a better way to approach this, because with these lines I would almost as easily save the list of lines, sort, and then count consecutive records.

Thanks for any help

Edit

, : 2 . 12 4 .

(): [id (12 ) | (4 )]
: [ (12 ) (N) | (4 )]
N : [id (12 ) | (4 )]

( 16), 4 .

, :

0x120A 0x001B 0x136F= Particle 288 (0x120), 10 (0xA). 11 (0xB). 1 (0x001) . - Particle 310 (0x136) 15 (0xF) .

0 - 9 , 4-22 ( 12 . . 500 , 1004 )

: , 12 , - 0x0000s. .

+4

c++ hash

user3734029 12 . '14 13:42

1

laune · Accepted Answer · 2014-06-12T16:54:05+0000

gcc Linux. 4-22 16 1 12, 24 13 20 32 .

, 18218975 ( "0".. "50000000" ) 291503600 , ( 0) 156681483.

, 135 - malloc.

( ?)

Count the appearance of each item in a large data stream

Output

Edit

More articles: