Is it bad practice to maintain two copies of the same data if they are in different data structures?

Question

Is it bad practice to maintain two copies of the same data if they are in different data structures?

Let's say I have a universal set of indexed objects, Uand a subset of these objects S. Sis large (e.g. 1,000,000 elements), however UMUCH more (say 100,000,000 at least).

I would like to perform two basic operations on these sets:

(1) For any integer xfrom 0 to size Uminus 1, check for membership S, if not a member, then add xto Sand

(2) Select (and delete) a random item from S.

To perform the first part of operation (1), it makes sense for me to keep the Boolean vector vsize U, where is the value trueif the element xis a member of the set S.

However, since there is Uso much more than Sthat, choosing a random element in vand hoping that it is also an element in Sdoes not make sense. I, if U100 times larger than S, then it will only find the element S, on average, once every 100 attempts.

So, to perform the second operation, it makes sense to maintain a list of indices of the elements located in S, and choose a random element from this.

, , . :

** operation 1 - check membership and add **
input: boolean vector, v
       integer vector, S
       integer, x

if v[x] is not true:
    v[x] = true
    append x to S
return

, , . :

** operation 2 - select and remove random element of S **
input: boolean vector, v
       integer vector, S

generate random integer x between 0 and size of S
set v[S[x]] to false
remove S[x] from S
return

, , . ?

, , - . , . ( ):

** operation 1 - check membership and add**
input: integer vector, S
       integer, x

iterate over S
if x in S:
    return
else:
    append x to S
    return

, S

** operation 2 - select and remove random element of S **
input: boolean vector, v

while true:
    generate random integer x between 0 and size of S
    if v[x] true:
        v[x] = false
        return

, U S , U S . , ? ?

EDIT:

, , ++, , , ++, , .

+4

c++ performance data-structures

guskenny83 07 . '17 8:51

3

() . :

, . .

, .

, , .

. . . , . , , S.

, , . , 3 .

+2

bolov 07 . '17 9:11

, " " ( , - ). - :

std::set<int>, , , (, , ) ( ). , , , , , (, 15 2 ) -, . , 64- 64 , .

3-4 (log (N)/log (64) , N * log (N)/log (64) -case) . , , , ( 1-2 , , ). , , . ( ). :

// Indicates that [first, last) are in the resulting set.
typedef void SetResults(int first, int last, void* user_data);

... , C. , , , - , , node 0s, , N * Log (N)/Log (64) , . - std::set<int> , , . , std::set. , . , , .

, - , . ( , ). 0, , , . , .

, , 1, , , - , , . 64 (64 ) , SIMD (: 512 ). , , , , 1024 , , , 64 + .

# 2 U, S bitwise and ( : S , ). . FFS/FFZ, .

:

... , , 3 ( , , 1024 + ). S , 0 , . 1 , .

, 1 U, , null. , bitwise and, S, U. . , O (N/64 +) (, ), O (N), (64 64- , SIMD).

, , S, .

. , , . , , , (SoA-), , . . , .

However, in this case, if you use one of the two data structures proposed above, you do not need to maintain a separate sequence of random access indexes in order to efficiently find the random one that exists in both sets, since you can find many intersections so quickly and quickly. as you need.

+1

Team upvote Jan 4 '18 at 9:14

source share

sp2danny · Accepted Answer · 2017-11-07T09:58:39+0000

, , std::map .

#include <random>
#include <iterator>
#include <map>
#include <vector>

struct Data {};                           // Your actual Object here

constexpr auto universal_size = 100'000;  // I shrank it a little for
constexpr auto subset_size = 1'000;       // the example

std::vector<Data> U(universal_size);      // This is the indexed data store

std::map<int, Data*> S;                   // This is the subset

void add_if_not_in(int idx)               // idx is universal index.
{                                         // This is one of the
    S[idx] = &U[idx];                     // functionalities you
}                                         // requested.

void remove_by_universal_index(int idx)   // Not strictly needed.
{                                         // Removes object from 
    S.erase(idx);                         // subset, by universal
}                                         // index.

void remove_by_subset_index(int idx)      // Removes object from
{                                         // subset, by subset
    auto iter = S.begin();                // index. Used by 
    std::advance(iter, idx);              // remove_random()
    S.erase(iter);
}

std::mt19937 gen{};                       // A random generator

void remove_random()                      // The second functionality
{                                         // you requested.
    auto sz = S.size();                   // Removes one random element
    std::uniform_int_distribution<>       // from the subset.
        dis(0, sz-1);
    auto num = dis(gen);

    remove_by_subset_index(num);
}

void add_random()                         // Used to initialize subset.
{                                         // Adds one random element of
    auto sz = U.size();                   // universal set to subset.
    std::uniform_int_distribution<>
        dis(0, sz-1);
    auto idx = dis(gen);

    add_if_not_in(idx);    
}

void setup()                              // Initialize subset.
{                                         // Just add random until
    while (S.size() < subset_size)        // size is specified.
        add_random();
}

int main()                                // Try it 
{
    setup();
    add_random();
    remove_random();
}

http://coliru.stacked-crooked.com/a/8236da0ccaf05079

Is it bad practice to maintain two copies of the same data if they are in different data structures?

More articles: