Unlimited memory access

I am working on an embedded device that does not support unattended memory accesses.

For a video decoder, I have to process pixels (one byte per pixel) in blocks of 8x8 pixels. The device has some SIMD processing capabilities that allow me to work with 4 bytes in parallel.

The problem is that blocks with 8x8 pixels are not guaranteed to run at an aligned address, and functions must read / write up to three of these 8x8 blocks.

How do you approach this if you want very good performance? After a little thought, I came up with the following three ideas:

  • All memory accesses are bytes. This is the easiest way to do this, but it’s slow and it doesn’t work very well with SIMD capabilities (this is what I am doing in my reference C code now).

  • Write four copy functions (one for each alignment case) that load pixel data through two 32-bit reads, shift the bits to the correct position, and write the data to some aligned zero memory block. Video processing functions can then use 32-bit access and SIMD. Disadvantage: the CPU will not be able to hide the memory delay behind processing.

  • Same ideas as above, but instead of writing pixels to scratch your memory, do the video processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is large (about 60, I think).

Btw: , , SIMD.

, , ?

+3
5

/.

( SIMD) , ( , /dest , , ).

SIMD . +, , , SIMD .

, dest , , , , , . .

, . , , .

+3

memcpy (, , , ) (, - malloc). .

, , . , (, 32- 8-? SIMD-?) , , .

+4

, -SIMD.

, 3, 25% (.. ). , , .

, .

+3

1), , (, , )

+2

: - (, № 2), ? , .

Of course, using 60ish functions in assembler before measuring will be considered a "premature optimization." :)

+2
source

Source: https://habr.com/ru/post/1538314/


All Articles