Why does chopped thread affect real-time encoding with ffmpeg x264 so much?

I use ffmpeg libx264 to encode a 720p screen shot from x11 in real time with a frame rate of 30. when I use paraseer -tune zerolatency , the average encoding time for each frame can reach 12 ms with the base profile base.

After examining the ffmpeg x264 source code, I found that the key parameter leading to such a long encoding time is sliced-threads , which is activated using -tune zerolatency. After disabling the use of -x264-params sliced-threads = 0, the encoding time may be as much as 2ms

When the cut is turned off, the CPU flows will be 40%, and when turned on, only 20%.

Can someone explain the details about this thread? Especially in real time (it is assumed that the frame is not encoded for encoding, but only encoded when capturing a frame).

+5
source share
1 answer

The documentation shows that frame-based streaming has better throughput than slice-based. He also notes that the latter does not scale well due to the fact that parts of the encoder are serial.

Acceleration against coding streams for a veryfast profile (not real-time):

 threads speedup psnr slice frame slice frame x264 --preset veryfast --tune psnr --crf 30 1: 1.00x 1.00x +0.000 +0.000 2: 1.41x 2.29x -0.005 -0.002 3: 1.70x 3.65x -0.035 +0.000 4: 1.96x 3.97x -0.029 -0.001 5: 2.10x 3.98x -0.047 -0.002 6: 2.29x 3.97x -0.060 +0.001 7: 2.36x 3.98x -0.057 -0.001 8: 2.43x 3.98x -0.067 -0.001 9: 3.96x +0.000 10: 3.99x +0.000 11: 4.00x +0.001 12: 4.00x +0.001 

The main difference is that streaming frames adds latency to frames, since this requires different frames, while in the case of streaming processing on a slice, all streams work with the same frame. In real time, you will need to wait until a frame appears to fill the pipeline, rather than offline.

Common threads, also known as frame-based threading, use a smart system with a raster frame for parallelism. But this is due to: as mentioned earlier, for each additional thread, another latent latent is required. Slicing based on a slice does not have such a problem: each frame is divided into slices, each fragment is encoded on one core, and then the result is deleted together to make the final frame. Its maximum efficiency is much lower for various reasons, but it allows at least some parallelism without increasing latency.

From: Developer Diary x264

Sliceless threading: an example with two threads. Start encoding frame # 0. When done halfway, start encoding frame # 1. Topic # 1 currently only has access to the upper half of its frame, since the rest have not been encoded yet. Therefore, it should limit the range of motion search. But this is probably normal (unless you use a lot of streams in a small frame), since there are rarely such long vertical motion vectors. After some time, both streams encoded one row of macroblocks, so stream # 1 still gets the opportunity to use the range of motion = +/- 1/2 of the frame height. Later, thread # 0 ends frame # 0 and goes to frame # 2. Thread # 0 now gets movement restrictions, and thread # 1 is unlimited.

From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt

Therefore, it makes sense to enable sliced-threads with -tune zereolatency , since you need to send a frame as soon as possible and then code them effectively (performance and quality).

Using too many threads on the contrary can affect performance, because the overhead to maintain them can exceed potential benefits.

+8
source

Source: https://habr.com/ru/post/1235596/


All Articles