Differences in Linux threading schedules in multi-core systems?

We have several latent-dependent pipelined programs that have measurable performance degradation when running on one Linux kernel compared to another. In particular, we see better performance with the 2.6.9 CentOS 4.x kernel (RHEL4) and worse performance with the 2.6.18 core from CentOS 5.x (RHEL5).

In a pipeline program, I mean one that has multiple threads. Mutated threads work with shared data. There is a queue between each thread. Thus, stream A receives data, places it in Qab, stream B pulls from Qab, performs some processing, then pushes into Qbc, stream C extracts from Qbc, etc. The source data comes from the network (generated by a third party).

We mainly measure the time from the moment the data is received, when the last thread performs its task. In our application, we see an increase from 20 to 50 microseconds when switching from CentOS 4 to CentOS 5.

I used several profiling methods for our application and determined that the added latency on CentOS 5 comes from queue operations (in particular, popping).

However, I can improve performance on CentOS 5 (to be the same as CentOS 4) by using a set of tasks to bind the program to a subset of the available cores.

Thus, it appears between CentOS 4 and 5, there were some changes (presumably for the kernel) that made the threads plan differently (and this difference is suboptimal for our application).

While I can "solve" this problem with a set of tasks (or in code via sched_setaffinity ()), I prefer not to. I hope there is some kind of kernel tunable (or maybe a compilation of transhumants) whose default has been changed between versions.

Does anyone have any experience? Perhaps a few more areas to explore?

Update: In this particular case, the problem was resolved by updating the BIOS from the server vendor (Dell). I pulled my hair out for quite some time. Until I get back to the basics and check my vendor's BIOS updates. Suspiciously, one of the updates said something like "improve performance in maximum performance mode." As soon as I updated the BIOS, CentOS 5 was faster - generally speaking, but especially in my queue tests and actual production cycles.

+6
source share
2 answers

Hmm .. if the time spent on pop () from the producer-consumer queue significantly affects the overall performance of your application, I would suggest that the structure of your threads / workFlow is not optimal, somewhere. If there isnโ€™t a lot of thought in the queues, I would be surprised if any PC queue clicked on any modern OS, took more than ฮผS, or even so, even if the queue uses kernel locks in the classic "Computer Science 117" - How to make a limited PC queue in three ways semaphores.

Can you just learn the functionality of the threads / sec that are least effective in those that do the most, therefore reducing the amount of push / pop for the overall work item that flows through your system?

+1
source

The Linux planner has been a big area of โ€‹โ€‹change and controversy over the years. You might want to try a completely new core and give it a reason. Yes, you may have to compile it yourself - this will be good for you. You can also (when you have a newer kernel) want to consider placing different processes in different containers with everything else in an additional one and see if this helps.

As for other random things that you can try, you can increase the priority of your various processes, add real-time semantics (caution, a buggy program with real-time consoles can starve in the rest of the system).

+1
source

Source: https://habr.com/ru/post/888927/


All Articles