ARM Neon in C: How to combine different 128-bit data types when using the built-in?

Question

ARM Neon in C: How to combine different 128-bit data types when using the built-in?

translator

Regarding built-in functions, how do you pass a 128-bit variable of type uint8x16_t to a function that uint16x8_t ?

EXTENDED VERSION

Context: I have a grayscale image, 1 byte per pixel. I want to reduce it by 2 times. For each 2x2 input window, I want to take the minimum pixel. In plain C, the code would look like this:

 for (int y = 0; y < rows; y += 2) { uint8_t* p_out = outBuffer + (y / 2) * outStride; uint8_t* p_in = inBuffer + y * inStride; for (int x = 0; x < cols; x += 2) { *p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) ); p_out++; p_in+=2; } }

If both rows and columns are multiples of 2. I will name the “step” by step in bytes, which goes from one pixel to the pixel, directly below the image.

Now I want to vectorize this. The idea is this:

take two consecutive lines of pixels
load 16 bytes in a from the top line and load 16 bytes immediately in b
compute minimum byte by byte between a and b . Store in a .
create a copy of a by moving it to the right by 1 byte (8 bits). Save it to b .
compute minimum byte by byte between a and b . Store in a .
store every second byte a in the output image (discards half bytes)

I want to write this using neon properties. The good news is that for each step there is an internal correspondence that corresponds to it.

For example, in paragraph 3 you can use (from here ):

 uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);

And at point 4, you can use one of the following, using a shift of 8 bits (from here ):

 uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b); uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b); uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);

This is because I don’t care what happens to the byte 1,3,5,7,9,11,13,15, because in any case they will be discarded from the final result. (The correctness of this has been verified, and this is not the point of the question.)

HOWEVER, the output of vminq_u8 is of type uint8x16_t , and it is NOT compatible with the built-in switching functions that I would like to use. In C ++, I examined the problem with this templated data structure , and I was told that the problem cannot be reliably solved with the union (Editing: although this answer relates to C ++, but actually in the C type punning IS is resolved ), as well as using pointers to create , because it violates the rule of strict smoothing.

How can I combine different types of data using ARM Neon properties?

+3

c vectorization arm neon

Antonio Apr 20 '17 at 12:13

source share

1 answer

Antonio · Accepted Answer · 2017-04-20T12:13:09+0000

For this problem, arm_neon.h provides the vreinterpret {q} _dsttype_srctype casting operator.

In some situations, you may need to treat the vector as having a different type, without changing its value. The set of properties to perform this type of conversion.

So, if a and b declared as:

 uint8x16_t a, b;

Your point 4 can be written as ^(*) :

 b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );

However , note that, unfortunately, this does not apply to data types using an array of vector types, see ARM Neon: how to convert from uint8x16_t to uint8x8x2_t?

<sub> (*) It should be said that this is much more cumbersome from the equivalent (in this particular context) SSE code, since the SSE has only one 128-bit integer data type (namely __m128i ):

 __m128i b = _mm_srli_si128(a,1);

sub>

ARM Neon in C: How to combine different 128-bit data types when using the built-in?

More articles: