I have a very large array of ~ 30 M objects of approximately 80 bytes apiece - this is ~ 2.2 GB for those that follow it - they are stored on disk. The actual size of each object varies slightly, because each has a child QMap<quint32, QVariant>
.
Unpacking these objects from raw data is expensive, so I implemented a multi-threaded read operation that sequentially pulls a few MB from disk and then transfers each raw block of data to a stream to unpack it in parallel through QtConcurrent
. My objects are created (via new
) on the heap inside the worker threads, and then passed back to the main thread for the next step. Upon completion, these objects are deleted in the main thread.
In a single-threaded environment, this release is relatively fast (~ 4-5 seconds). However, with multi-threading on 4 threads, this release is incredibly slow (~ 26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown occurs in the MSVCR100 free
, so it slows down itself.
Searching around SO assumes that allocating and deleting on different threads is safe . What is the source of moderation, and what can I do about it?
Edit: some code example that tells the idea of ββwhat's happening: For troubleshooting, I completely removed the I / O disk from this example and simply created the objects and then deleted them.
class MyObject { public: MyObject() { /* set defaults... irrelevant here */} ~MyObject() {} QMap<quint32, QVariant> map; //...other members } //... QList<MyObject*> results; /* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */ std::function<QList<MyObject*>(quint64 chunksize)> importMap = [](quint64 chunksize) -> QList<MyObject*> { QList<MyObject*> objs; for(int i = 0; i < chunksize; ++i) { MyObject* obj = new MyObject(); obj->map.insert(0, 1); //ran with and without the map insertions obj->map.insert(1, 2); objs.append(obj); } return objs; }; //end import map lambda /* set up the reduce lambda functor */ auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported) { results.append(chunkimported); }; //end import reduce lambda /* chunk up the data for import */ quint64 totalcount = 31833986; quint64 chunksize = 500000; QList<quint64> chunklist; while(totalcount >= chunksize) { totalcount -= chunksize; chunklist.append(chunksize); } if(totalcount > 0) chunklist.append(totalcount); /* create the objects concurrently */ QThreadPool::globalInstance()->setMaxThreadCount(1); //4 for multithreaded run QElapsedTimer tnew; tnew.start(); QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce); qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0); //do stuff with the objects here /* delete the objects */ QElapsedTimer tdelete; tdelete.start(); qDeleteAll(results); qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);
Here are the results with and without data insertion into MyObject :: map and with 1 or 4 streams available for QtConcurrent:
- 1 Topic:
tnew
= 2.7 seconds; tdelete
= 1.1 seconds - 4 Topics:
tnew
= 1.8 seconds; tdelete
= 2.7 seconds - 1 Thread + QMap:
tnew
= 8.6 seconds; tdelete
= 4.6 seconds - 4 Themes + QMap:
tnew
= 4.0 seconds; tdelete
= 48.1 seconds
In both scenarios, it takes significantly more time to delete objects when they were created in parallel on 4 threads compared to sequential in 1 thread, which was further aggravated by the introduction of QMap in parallel.