How can I speed up a Mac with 5,000 independent tasks?

I have a long (5-10 hours) Mac app that processes 5000 elements. Each element is processed through a series of transformations (using Saxon), runs a set of scripts (in Python and Racket), collects data and serializes it as a set of XML files, SQLite database and CoreData database. Each element is completely independent of every other element.

In general, he does a lot, takes a lot of time, and seems to have a high degree of parallelism.

After loading all the elements that need to be processed, the application uses GCD to parallelize the work using dispatch_apply :

 dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) { @autoreleasepool { ... } }); 

I am running the application on a Mac Pro with 12 cores (24 virtual). Therefore, I would expect 24 items to be processed at any time. However, I found that the number of items being processed varies from 8 to 24. This literally adds hours to the runtime (assuming it can run 24 items at a time).

On the one hand, perhaps the GCD is really, really smart, and it already gives me maximum bandwidth. But I am worried that, since most of the work happens in the scripts created by this application, perhaps the GCD reasoning from incomplete information and not making the best decisions.

Any ideas how to increase productivity? Once correct, the required attribute number one reduces the execution time of this application. I don't care about power consumption, intimidation of the Mac Pro or anything else.

UPDATE: Actually, it looks alarming in docs : “The actual number of tasks performed by the parallel queue at any given time is variable and can dynamically change as your application conditions change. Many factors affect the number of tasks performed by parallel queues, including the number of available cores, the amount of work performed by other processes , and the number and priority of tasks in other sequential dispatch queues. " (emphasis added) It seems that other processes performing the work will adversely affect the planning in the application.

It would be nice to be able to say "run these blocks at the same time, one per core, not try to do something smarter."

+4
source share
1 answer

If you are connected and defined, you can explicitly create 24 threads using the NSThread API, and each of these threads will pull out of the synchronized work item queue. I would bet money that performance would be noticeably worse.

GCD works most efficiently when processed items are never blocked. However, the workload that you are describing is quite complex and abounds with the ability to block your threads. First, you create a bunch of other processes. Right here, this means that you already rely on the OS to share time / resources between your main task and these subordinate tasks. In addition to setting the OS priority for each subprocess, the OS scheduler is not able to find out which processes are more important than others, and by default your subprocesses will have the same priority as their parent. However, it does not look like you have something to gain by changing the priorities of the processes. I assume that you are blocking the flow of the main task, waiting for the completion of subordinate tasks. It effectively parking this thread - it cannot do any useful work. But, as I said, I don’t think that much could be gained by changing the OS priorities of your subordinate tasks, because it really looks like an I / O-related workflow ...

Continue to describe the three I / O operations ("serializing them as a set of XML files, an SQLite database, and a CoreData database.") So now you have all these different threads and processes competing for what is supposedly shared storage . (i.e. if you don't write in 24 different databases, on 24 separate hard drives, one for each core, your process will eventually be serialized when accessing the drive.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is relatively slow. Your streams will be removed from the processor on which they were running (so there is another thread that can wait) for almost any recording disc.

If you want to maximize the performance that you get from GCD, you probably want to rewrite all the materials that you do in the subtasks in C / C ++ / Objective-C, introducing them into the process and then spending all the inputs associated with it output using dispatch_io primitives. For an API where you do not control reading and writing at a low level, you need to carefully manage and adjust the workload to optimize it for your existing equipment. For example, if you have a bunch of things to write to a single, common SQLite database, it makes no sense to have more than one thread trying to write to this database right away. You would be better off making a single thread (or sequential GCD queue) for writing to SQLite and submitting tasks to it after preprocessing.

I could continue here for quite some time, but the point is that you have a seemingly complicated workflow related to I / O. At the highest level, processor utilization or “number of threads started” will be a particularly poor performance indicator for such a task. Using subprocesses (i.e., scripts), you put a lot of control in the hands of the OS, which recklessly knows nothing about your workload and therefore can not do anything other than using its general scheduler to share resources. Opaque GCD pool management is actually the least of your problems.

At a practical level, if you want to speed things up, buy several, faster (i.e. SSDs) hard drives and redesign your task / workflow to use them separately and in parallel. I suspect this will give the biggest bang for your buck (for some time == money == hardware equivalence relation.)

+6
source

Source: https://habr.com/ru/post/1497156/


All Articles