The fastest way to remove duplicates from a vector <>

Question

The fastest way to remove duplicates from a vector <>

As the name says, in my head there are some ways to do this, but I don’t know which is faster.

So, let's say that we have: vector<int> vals with some values

1

After adding vals

 sort(vals.begin(), vals.end()); auto last = unique(vals.begin(), vals.end()); vals.erase(last, vals.end());

2

Convert to setting after adding vals :

 set<int> s( vals.begin(), vals.end() ); vals.assign( s.begin(), s.end() );

3

When I add my vals , I check if it is already in my vector:

 if( find(vals.begin(), vals.end(), myVal)!=vals.end() ) // add my val

4

Use set from the beginning

Ok, I have these 4 methods, my questions are:

1 From 1, 2 and 3 , what is the fastest?
2 4 faster than the first 3?
3 With 2, after converting a vector to a set, is it more convenient to use the set to do what I need to do, or should I do vals.assign( .. ) and continue with my vector?

+5

c ++ vector stl duplicates

user4749795 Nov 18 '15 at 7:46

source share

4 answers

Option 1 will beat everyone else. Complexity is only O (N log N), and the continuous memory of the vector keeps the constant coefficients low.

std :: set usually suffers a lot from non-contiguous distributions. This is not just slow access to them, it just takes a lot of time to create them.

+3

Msalters Nov 18 '15 at 8:05

source share

All of these methods have their drawbacks, although (1) is worth a look.

But take a look at this fifth option: remember that you can access the vector data buffer using the data() function. Then, bearing in mind that the redistribution will not happen, since the vector will only decrease, apply the algorithm that you will learn at school:

 unduplicate(vals.data(), vals.size()); void unduplicate(int* arr, std::size_t length) /*Reference: Gang of Four, I think*/ { int *it, *end = arr + length - 1; for (it = arr + 1; arr < end; arr++, it = arr + 1){ while (it <= end){ if (*it == *arr){ *it = *end--; } else { ++it; } } } }

And resize the vector at the end, if that is what is required. This is never worse than O (N ^ 2), therefore it surpasses the sort or sort of the insert, and then removes the approaches.

Your 4th option might be an idea if you can accept it. Analyze performance. Otherwise, use my algorithm from the 1960s.

+1

Bathsheba Nov 18 '15 at 7:54

source share

Recently, I ran into a similar problem and experimented with 1 , 2, and 4 , as well as with unordered_set version 4 . In the end, it turned out that the best result was the last one, 4 with unordered_set instead of set .

By the way, this empirical conclusion is not very surprising when you consider that both set and sort were a bit of iterators: they guaranteed the relative order of unequal elements. For example, inputs 4,3,5,2,4,3 lead to sorting the output of unique values 2,3,4,5 . This is optional if you can live with unique values in random order, i.e. 3,4,2,5 . When you use unordered_set , it does not guarantee order, only uniqueness, and therefore it does not need to do additional work to ensure the order of different elements.

0

Michael Jun 21 '16 at 10:08

source share

whrow · Accepted Answer · 2015-11-18T08:03:15+0000

Question 1 : both 1 and 2 are O (n log n), 3 are O (n ^ 2). Between 1 and 2, it depends on the data.

Question 2 : 4 is also O (n log n) and may be better than 1 and 2 if you have many duplicates, because it only stores one copy. Imagine a million values that are equal.

Question 3 : Well, it really depends on what you need to do.

The only thing that can be said without knowing more is that your alternative number 3 is asymptotically worse than others.

If you use C ++ 11 and don't need ordering, you can use std::unordered_set , which is a hash table and can be significantly faster than std::set .

The fastest way to remove duplicates from a vector <>

More articles: