translator
Regarding built-in functions, how do you pass a 128-bit variable of type uint8x16_t to a function that uint16x8_t ?
EXTENDED VERSIONContext: I have a grayscale image, 1 byte per pixel. I want to reduce it by 2 times. For each 2x2 input window, I want to take the minimum pixel. In plain C, the code would look like this:
for (int y = 0; y < rows; y += 2) { uint8_t* p_out = outBuffer + (y / 2) * outStride; uint8_t* p_in = inBuffer + y * inStride; for (int x = 0; x < cols; x += 2) { *p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) ); p_out++; p_in+=2; } }
If both rows and columns are multiples of 2. I will name the βstepβ by step in bytes, which goes from one pixel to the pixel, directly below the image.
Now I want to vectorize this. The idea is this:
- take two consecutive lines of pixels
- load 16 bytes in
a from the top line and load 16 bytes immediately in b - compute minimum byte by byte between
a and b . Store in a . - create a copy of
a by moving it to the right by 1 byte (8 bits). Save it to b . - compute minimum byte by byte between
a and b . Store in a . - store every second byte
a in the output image (discards half bytes)
I want to write this using neon properties. The good news is that for each step there is an internal correspondence that corresponds to it.
For example, in paragraph 3 you can use (from here ):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4, you can use one of the following, using a shift of 8 bits (from here ):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b); uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b); uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
This is because I donβt care what happens to the byte 1,3,5,7,9,11,13,15, because in any case they will be discarded from the final result. (The correctness of this has been verified, and this is not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t , and it is NOT compatible with the built-in switching functions that I would like to use. In C ++, I examined the problem with this templated data structure , and I was told that the problem cannot be reliably solved with the union (Editing: although this answer relates to C ++, but actually in the C type punning IS is resolved ), as well as using pointers to create , because it violates the rule of strict smoothing.
How can I combine different types of data using ARM Neon properties?