Efficient way to redefine a collection based on C ++

I have a large collection (ish β†’ 100K) that maps the user ID (int) to the number of different products they bought (also int.) I need to reorganize the data as efficiently as I can to find out how many users have a different number of products. So, for example, how many users has 1 product, how many users have two products, etc.

I realized this by changing the initial data from std::map to std::multimap (where the key and value just change to the opposite.) Then I can select the number of users having N products using count(N) (although I also uniquely saved values ​​in the set, so that I can be sure of the exact number of values ​​that I repeated and their order)

The code is as follows:

 // uc is a std::map<int, int> containing the original // mapping of user identifier to the count of different // products that they've bought. std::set<int> uniqueCounts; std::multimap<int, int> cu; // This maps count to user. for ( map<int, int>::const_iterator it = uc.begin(); it != uc.end(); ++it ) { cu.insert( std::pair<int, int>( it->second, it->first ) ); uniqueCounts.insert( it->second ); } // Now write this out for ( std::set<int>::const_iterator it = uniqueCounts.begin(); it != uniqueCounts.end(); ++it ) { std::cout << "==> There are " << cu.count( *it ) << " users that have bought " << *it << " products(s)" << std::endl; } 

I just can't help but feel that this is not the most effective way to do this. Does anyone know a smart way to do this?

I am limited to this ; I cannot use Boost or C ++ 11 for this .

Oh, also, if someone asks a question, this is not homework, nor an interview question.

+6
source share
4 answers

Assuming that you know the maximum number of products that one user could buy, you can see better performance by simply using a vector to store the results of the operation. Since you will need a selection for almost every entry on the source map, which is most likely not the fastest option.

It would also reduce the overhead on the map, benefit from the location of the memory, and replace the call with a multi-map count (which is not a constant time operation) with a constant search for the time vector.

So you can do something like this:

 std::vector< int > uniqueCounts( MAX_PRODUCTS_PER_USER ); for ( map<int, int>::const_iterator it = uc.begin(); it != uc.end(); ++it ) { uniqueCounts[ uc.second ]++; } // Now write this out for ( int i = 0, std::vector< int >::const_iterator it = uniqueCounts.begin(); it != uniqueCounts.end(); ++it, ++i ) { std::cout << "==> There are " << *it << " users that have bought " << i << " products(s)" << std::endl; } 

Even if you do not know the maximum number of products, it seems that you can just guess the maximum and adapt this code to increase the size of the vector, if necessary. In any case, this will result in fewer resources than your original example.

All this assumes that you really do not need user IDs after you have processed this data, of course (and, as indicated in the comments below, the number of products purchased for each user is relatively small and a mixed set. Otherwise, you it might be better to use a map instead of a vector - you still cannot call the multimap :: count function, but you can potentially lose some other advantages)

+4
source

It depends on what you mean by "more effective." First, is it really a bottle neck? Of course, there are a lot of 100 thousand records, but if you only need this every few minutes, it is normal if the algorithm takes a couple of seconds.

The only area for improvement that I see is memory usage. If this is a concern, you can skip the multimap generation and just save the counter map, something like this (beware, my C ++ is a bit rusty):

 std::map<int, int> countFrequency; // count => how many customers with that count for ( std::map<int, int>::const_iterator it = uc.begin(); it != uc.end(); ++it ) { // If it->second is not yet in countFrequency, // the default constructor initializes it to 0. countFrequency[it->second] += 1; } // Now write this out for ( std::map<int, int>::const_iterator it = countFrequency.begin(); it != countFrequency.end(); ++it ) { std::cout << "==> There are " << it->second << " users that have bought " << it->first << " products(s)" << std::endl; } 

If a user is added and buys count elements, you can update countFrequency with

 countFrequency[count] += 1; 

If an existing user moves from oldCount to newCount , you can update countFrequency with

 countFrequency[oldCount] -= 1; countFrequency[newCount] += 1; 

Now, as an aside, I recommend using unsigned int for counting (if there is no legitimate reason for negative counting) and typedef'ing of type userID for extra readability.

+2
source

If you can, I would recommend storing both parts of the data permanently. In other words, I would save a second card that reflects the number of products bought for the number of customers who bought this many products. This map contains the exact answer to your question, if you support it. Each time a customer buys a product, let n be the number of products that this customer has purchased now. Subtract one from the value in key n-1. Add one to the value in key n. If the key range is small enough, it could be an array instead of a map. Have you ever expected one customer to buy hundreds of products?

+1
source

Only for larks here is a mixed approach that uses vector if the data is small and map to cover the case when one user bought a really absurd amount of products. I doubt that you really need the latter in the app for the store, but a more general version of the problem can benefit from this.

 typedef std::map<int, int> Map; typedef Map::const_iterator It; template <typename Container> void get_counts(const Map &source, Container &dest) { for (It it = source.begin(); it != source.end(); ++it) { ++dest[it->second]; } } template <typename Container> void print_counts(Container &people, int max_count) { for (int i = 0; i <= max_count; ++i) { if contains(people, i) { std::cout << "==> There are " << people[i] << " users that have bought " << i << " products(s)" << std::endl; } } } // As an alternative to this overloaded contains(), you could write // an overloaded print_counts -- after all the one above is not an // efficient way to iterate a sparsely-populated map. // Or you might prefer a template function that visits // each entry in the container, calling a specified functor to // will print the output, and passing it the key and value. // This is just the smallest point of customization I thought of. bool contains(const Map &c, int key) { return c.count(key); } bool contains(const std::vector<int, int> &c, int key) { // also check 0 < key < c.size() for a more general-purpose function return c[key]; } void do_everything(const Map &uc) { // first get the max product count int max_count = 0; for (It it = uc.begin(); it != uc.end(); ++it) { max_count = max(max_count, it->second); } if (max_count > uc.size()) { // or some other threshold Map counts; get_counts(uc, counts); print_counts(counts, max_count); } else { std::vector<int> counts(max_count+1); get_counts(uc, counts); print_counts(counts, max_count); } } 

Here you can reorganize, create a template for the CountReOrderer class, which takes a template parameter that tells it whether to use vector or map for counting.

+1
source

Source: https://habr.com/ru/post/917440/


All Articles