Real-time multithreaded audio programming. To block or not to block

When writing audio software, many people on the Internet say that not using memory allocation or a lock code, that is, no locks, is of utmost importance. Due to the fact that they are not deterministic, this can lead to the fact that the output buffer will be full and the sound will work.

Real-time audio programming

When I write a video program, I usually use both, that is, allocating video frames on the heap and passing between the threads, using locks and conditional variables (limited buffers). I adore the power this provides, as a separate thread can be used for each operation, allowing the software to maximize each core, providing better performance.

With audio, I would like to do something similar by transferring frames of perhaps 100 samples between streams, however there are two problems.

  • How to create frames without using memory allocation? I suppose I could use a pool of frames that were previously allocated, but that seems messy.

  • I know that you can use a lock-free queue, and there is a good library for this boost for this. This would be a great way to exchange data between threads, but constantly polling the queue to see if data is available seems like a massive processor timeline.

In my experience, using mutexes does not actually take much time, provided that the section in which the mutex is locked is short.

What is the best way to achieve the transfer of sound frames between streams, while maintaining latency to a minimum, rather than wasting resources and using relatively small non-deterministic behavior?

+5
source share
3 answers

Looks like you did your research! You have already identified two main problems that can cause audio failures. The question is: how important was this 10 years ago, and today it is only folklore and religious worship.

My two cents:

1. Heap distribution in the rendering cycle:

They can have quite a bit of overhead, depending on how small your pieces of processing are. The main culprit is that very few runtimes have a bunch of threads, so every time you mess with a bunch, your performance depends on what other threads in your process do. If, for example, a GUI stream deletes thousands of objects and you - at the same time - gain access to the heap from the audio playback stream, you may experience a significant delay.

Writing your own memory management with pre-allocated buffers may seem messy, but in the end these are just two functions that you can hide somewhere in the utility source. Since you usually know your placement sizes in advance, there are many possibilities for fine-tuning and optimizing memory management. For example, you can store segments as a simple linked list. If everything is done correctly, then it is an advantage that you again use the last buffer used. This buffer has a very high probability of getting into the cache.

If fixed-size blocking devices do not work, you can look at the ring buffers. They are very suitable for using streaming audio.

2. To block or not to block:

I would say that nowadays the use of locking mutexes and semaphores is great if you can estimate that you make less than 1000-5000 of them per second (on a PC everything looks different like a raspberry Pi, etc.), If you stay below this range, it is unlikely that the overhead will appear in the performance profile.

Translated to your use case: if, for example, you work with 48 kHz audio signal and 100 sample fragments, you generate approximately 960 lock / unlock operations in a simple two-stream consumption / producer circuit. which is within range. If you completely remove the rendering stream, the lock will not be displayed during profiling. If you, on the other hand, use only 5% of the available processing power, locks may appear, but you will also not have a performance problem :-)

Switching to a lock is also an option, but it is also a hybrid solution that first makes some attempts to lock and then returns to a hard lock. This way you will get the best of both worlds. This article has a lot to read about this topic online.

Anyway:

You should carefully raise the priority of the thread of your threads without a GUI to make sure that if they are started in a lock, they quickly exit it. It's also a good idea to read what Inversion Inversion is and what you can do to avoid this:

https://en.wikipedia.org/wiki/Priority_inversion

+4
source

"I suppose I could use a pool of frames that were previously allocated, but it seems dirty" - not really. Either select an array of frames, or new top frames in a loop, and then drag the pointers / pointers to the lock queue. Now you have there is an auto-managed pool of frames. Perform one click when you need a frame, click on it when you are done with it. There is no continuous malloc / free / new / delete, no chance or memory, ease of debugging and control the flow of frames (if the pool ends, threads requesting frames will wait eye images will be released back to the pool), all built-in.

Using an array may seem easier / safer / faster than a new cycle, but new individual frames have the advantage of easily changing the number of frames in the pool at runtime.

+1
source

Um, why do you transfer frames from 100 samples between threads?

Assuming that you are operating at a nominal sampling frequency of 44.1 kHz and transmitting 100 samples at a time between streams, this suggests that the switching speed of the streams should be at least 100 samples / (44100 samples / s * 2). 2 represents both the manufacturer and the consumer. This means that you have a ~ 1.13ms time slice for every 100 samples you send. Almost all operating systems operate in time slices of more than 10 ms. Thus, it is impossible to create a sound engine in which you use only 100 samples between streams at 44.1 kHz in a modern OS.

The solution is to buffer more samples in the timeline, either through the queue or using larger frames. Most modern real-time audio APIs use 128 samples per channel (on dedicated audio equipment) or 256 samples per channel (on game consoles).

Ultimately, the answer to your question is basically the answer you expect ... Skip the unique queues of pointers to buffers, not the buffers themselves; manage all audio buffers in a fixed pool allocated at program startup; and block all queues as little time as possible.

Interestingly, this is one of the few good situations in an audio program where there is a great performance advantage for crowding out assembly code. You definitely don't need malloc and free access to every queue lock. The operating system provided by atomic locking features can ALWAYS be improved if you know your processor.

Last: there is no such thing as a lock. All multi-threaded implementations of the "lockfree" queue rely on internal CPU integrity or complex comparisons and exchanges to ensure that exclusive memory access is guaranteed for each thread.

0
source

Source: https://habr.com/ru/post/1210243/


All Articles