Looks like you did your research! You have already identified two main problems that can cause audio failures. The question is: how important was this 10 years ago, and today it is only folklore and religious worship.
My two cents:
1. Heap distribution in the rendering cycle:
They can have quite a bit of overhead, depending on how small your pieces of processing are. The main culprit is that very few runtimes have a bunch of threads, so every time you mess with a bunch, your performance depends on what other threads in your process do. If, for example, a GUI stream deletes thousands of objects and you - at the same time - gain access to the heap from the audio playback stream, you may experience a significant delay.
Writing your own memory management with pre-allocated buffers may seem messy, but in the end these are just two functions that you can hide somewhere in the utility source. Since you usually know your placement sizes in advance, there are many possibilities for fine-tuning and optimizing memory management. For example, you can store segments as a simple linked list. If everything is done correctly, then it is an advantage that you again use the last buffer used. This buffer has a very high probability of getting into the cache.
If fixed-size blocking devices do not work, you can look at the ring buffers. They are very suitable for using streaming audio.
2. To block or not to block:
I would say that nowadays the use of locking mutexes and semaphores is great if you can estimate that you make less than 1000-5000 of them per second (on a PC everything looks different like a raspberry Pi, etc.), If you stay below this range, it is unlikely that the overhead will appear in the performance profile.
Translated to your use case: if, for example, you work with 48 kHz audio signal and 100 sample fragments, you generate approximately 960 lock / unlock operations in a simple two-stream consumption / producer circuit. which is within range. If you completely remove the rendering stream, the lock will not be displayed during profiling. If you, on the other hand, use only 5% of the available processing power, locks may appear, but you will also not have a performance problem :-)
Switching to a lock is also an option, but it is also a hybrid solution that first makes some attempts to lock and then returns to a hard lock. This way you will get the best of both worlds. This article has a lot to read about this topic online.
Anyway:
You should carefully raise the priority of the thread of your threads without a GUI to make sure that if they are started in a lock, they quickly exit it. It's also a good idea to read what Inversion Inversion is and what you can do to avoid this:
https://en.wikipedia.org/wiki/Priority_inversion