Loading an STL set with pre-sorted data, C ++

Question

Loading an STL set with pre-sorted data, C ++

I work with C ++ in Visual Studio 2010. I have an STL set that I save to a file when my program shuts down. The next time I start the program, I load the (sorted) data back into the set. I am trying to optimize the boot process and am having problems. I suspect the problem is with frequent rebalancing, and I'm looking for a way to avoid this.

At first I did this without optimization, using "set-> insert (const value_type & x)"

Time: ~ 5.5 minutes

Then I tried using the insert () version, where you pass a hint for the location of the insert ():

iterator insert ( iterator position, const value_type& x );

Roughly, I did this:

 set<int> My_Set; set<int>::iterator It; It = My_Set.insert (0); for (int I=1; I<1000; I++) { It = My_Set.insert (It, I); //Remember the previous insertion iterator }

Time: ~ 5.4 minutes

Only any improvement! I do not think that the problem is related to the overhead when reading from a file - commenting on the insert () reduces the time to 2 seconds. I don't think the problem is with the overhead when copying my object - this is a Plain Old Data object with int and char.

The only thing I can think of is that the kit is constantly being rebalanced.

1.) Do you agree with my guess?

2.) Is there a way to “pause” the rebalance while I load the set and then rebalance once at the end? (Or ... will this help?)

3.) Is there a smarter way to load the sorted data, i.e. not just moving from the lowest to the highest? Perhaps, alternating my inserts, so as not to have to balance often? (Example: Insert 1, 1000, 2, 999, 3, 998, ...)

+4

c ++ set sorted stl

Jugulum Mar 23 '11 at 20:30

source share

3 answers

Have you tried the range constructor?

 #include <set> #include <fstream> #include <algorithm> #include <iterator> int main() { std::ifstream file("Plop"); std::set<int> myset; std::copy(std::istream_iterator<int>(file), std::istream_iterator<int>(), std::inserter(myset, myset.end())); }

I tried 4 methods with [0 → 10 000 000) items (sorted by file):

 void t1(std::set<int>& data, std::istream& file) { int x; while(file >> x) {data.insert(x); } } void t2(std::set<int>& data, std::istream& file) { int x; while(file >> x) {data.insert(data.end(), x);} } void t3(std::set<int>& data, std::istream& file) { std::set<int>::iterator it = data.begin(); int x; while(file >> x) {it = data.insert(it, x);} } void t4(std::set<int>& data, std::istream& file) { std::copy(std::istream_iterator<int>(file), std::istream_iterator<int>(), std::inserter(data, data.end())); }

Time in hours () on average for 3 runs (normal) and 3 runs (-O4)

  Plain Data Normal -O4 ========= ========= t1 Result: 21057300 6748061 t2 Result: 6580081 4752549 t3 Result: 6675929 4786003 t4 Result: 8452749 6460603

Conclusion 1: for sorted data:

 Best: data.insert(data.end(), <item>) // Hint end() Worst: data.insert(<item>); // No Hint

Conclusion 2: optimization calculation.

+2

Martin york Mar 23 '11 at 20:45

source share

It is possible that the kit is rebalanced. How many items do you REALLY take in 5.6 minutes? If your element set is large enough, you may run into physical RAM limitations and crash or just have a very bad cache miss.

There is definitely no way to disable rebalancing. If you could, then the collection could break its invariants, which would be bad.

Get a profiler and profile your code, rather than guessing what is required.
Have you tried inserting two param tabs with end instead of the previous iterator as another data point?
Have you tried inserting comparison time into a pre-reserved vector instead?
Can you get away with another type of container, like a bunch or a (sorted) vector?
If you can quickly load a vector, do it, then random_shuffle it, and then try pasting it into the set and see what happens.

+1

Mark b Mar 23 '11 at 20:55

source share

Macgucky · Accepted Answer · 2011-03-23T21:05:00+0000

How many elements are we talking about?

I did a short test with 10,000,000 integers (prepared in a vector) and entered them in three different ways into a set.

Prepare input:

  std::vector<int> input; for(int i = 0; i < 10*1000*1000; ++i) { input.push_back(i); }

Insert into the specified element by element with insert:

Vacation: 2.4 seconds / Debugging: 110.8 seconds

  std::set<int> mySet; std::for_each(input.cbegin(), input.cend(), [&mySet] (int value) { mySet.insert(value); });

Paste into a set using insert(itBegin, itEnd) :

Release: 0.9 seconds / Debug: 47.5 seconds

  std::set<int> mySet; mySet.insert(input.cbegin(), input.cend()); // this is also possible - same execution time: std::set<int> mySet(input.cbegin(), input.cend());

Thus, the insert can be greatly accelerated, but even the slow path should be far from a few minutes.

EDIT:

I did a test with debug-mode, meanwhile - wow - I know debugging performance, but it is more than I thought. With 50,000,000 items in debug mode, there is a poor distribution, so I updated the message to 10,000,000 items and showed the time to release and debug the assembly.

You can see the huge differences here - 50 times with a faster solution.

In addition, the quick solution ( insert(itBegin, itEnd) ) seems linear in relation to the number of elements (with preliminary data!). The previus test had five times more elements, and the insertion time decreased from 4.6 to 0.9 - about five times.

Loading an STL set with pre-sorted data, C ++

More articles: