Multithreaded (de) heap distribution

I have a very large array of ~ 30 M objects of approximately 80 bytes apiece - this is ~ 2.2 GB for those that follow it - they are stored on disk. The actual size of each object varies slightly, because each has a child QMap<quint32, QVariant> .

Unpacking these objects from raw data is expensive, so I implemented a multi-threaded read operation that sequentially pulls a few MB from disk and then transfers each raw block of data to a stream to unpack it in parallel through QtConcurrent . My objects are created (via new ) on the heap inside the worker threads, and then passed back to the main thread for the next step. Upon completion, these objects are deleted in the main thread.

In a single-threaded environment, this release is relatively fast (~ 4-5 seconds). However, with multi-threading on 4 threads, this release is incredibly slow (~ 26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown occurs in the MSVCR100 free , so it slows down itself.

Searching around SO assumes that allocating and deleting on different threads is safe . What is the source of moderation, and what can I do about it?

Edit: some code example that tells the idea of ​​what's happening: For troubleshooting, I completely removed the I / O disk from this example and simply created the objects and then deleted them.

 class MyObject { public: MyObject() { /* set defaults... irrelevant here */} ~MyObject() {} QMap<quint32, QVariant> map; //...other members } //... QList<MyObject*> results; /* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */ std::function<QList<MyObject*>(quint64 chunksize)> importMap = [](quint64 chunksize) -> QList<MyObject*> { QList<MyObject*> objs; for(int i = 0; i < chunksize; ++i) { MyObject* obj = new MyObject(); obj->map.insert(0, 1); //ran with and without the map insertions obj->map.insert(1, 2); objs.append(obj); } return objs; }; //end import map lambda /* set up the reduce lambda functor */ auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported) { results.append(chunkimported); }; //end import reduce lambda /* chunk up the data for import */ quint64 totalcount = 31833986; quint64 chunksize = 500000; QList<quint64> chunklist; while(totalcount >= chunksize) { totalcount -= chunksize; chunklist.append(chunksize); } if(totalcount > 0) chunklist.append(totalcount); /* create the objects concurrently */ QThreadPool::globalInstance()->setMaxThreadCount(1); //4 for multithreaded run QElapsedTimer tnew; tnew.start(); QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce); qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0); //do stuff with the objects here /* delete the objects */ QElapsedTimer tdelete; tdelete.start(); qDeleteAll(results); qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0); 

Here are the results with and without data insertion into MyObject :: map and with 1 or 4 streams available for QtConcurrent:

  • 1 Topic: tnew = 2.7 seconds; tdelete = 1.1 seconds
  • 4 Topics: tnew = 1.8 seconds; tdelete = 2.7 seconds
  • 1 Thread + QMap: tnew = 8.6 seconds; tdelete = 4.6 seconds
  • 4 Themes + QMap: tnew = 4.0 seconds; tdelete = 48.1 seconds

In both scenarios, it takes significantly more time to delete objects when they were created in parallel on 4 threads compared to sequential in 1 thread, which was further aggravated by the introduction of QMap in parallel.

+6
source share
5 answers

(update comment to answer)

This may be because in one thread all distributions are sequential, therefore frees are also available. With multi-threaded distributions, they are more mixed, so free need to do more work to clean up after each release.

+2
source

This is quite a lot of speculation, but I believe that the OS memory manager will have one system system, because it serves one pool of virtual memory, so throwing more threads on it will not improve speed, it will simply strangle its overhead. Thread safety combined with simultaneous access always has a penalty. Thus, the more threads you throw at it, the more fines you will receive.

There are quite a lot of 30M allocations, regardless of the size of the distributions, as well as the significant memory overhead consumption. I would recommend that you spend time creating memory pools by predefining monolithic chunks of memory and using the new location to place objects inside these pools. This will be a huge processor time savings and significant memory savings. In addition, it will increase the convenience of cache and cache attacks by reducing fragmentation.

To put this as a metaphor, putting 4 cooks on one stove will not make preparations 4 times faster, it will make each cook at least 4 times slower plus the time they will spend in a resource use conflict. This is pretty much what you see in practice.

+7
source

If you allocate from multiple threads in the same memory pool, you will create a bottleneck during freeing, because units that will be deleted sequentially are not connected to each other.

If you use fixed sizes, you should be able to use this in O (1) performance in your allocator / deallocator. A unit allocation system that puts a bunch of blocks of the same size on a free list and then pushes / pushes them as needed is what you should learn.

+1
source

Memory allocation and free, as you know, is slow, the OS arranges access to memory. This sequence makes a new and free thread safe, but also significantly slows down the work.

It is usually customary to preallocate a large block of memory if each part has a fixed size.

Another way is to use memory mapped files to bypass the allocation. Qt has a memory mapping class that can be used on all platforms. You can try this approach. How do you serialize QMap?

+1
source

I will be tempted to allocate a relatively large block of memory for each thread, and inside this thread try to use it as if it were a stack (or as a circular buffer). This potentially works well if you always put new objects on one end and remove them from the other. Or if you can select a group of objects in one step (as happens with the stack when returning a function call). Otherwise, you will need the new and removed functionality that you get from the heap, which, as you discovered, can be a major performance bottleneck in some cases.

Edit: I really think we are lacking in meaning. It really doesn't make sense that you delete at the end so slowly. If I understood the code correctly at this point, do you only have the main thread?

0
source

Source: https://habr.com/ru/post/1013359/


All Articles