Getting GCC to generate a PTEST instruction when using vector extensions

When using the GCC extensions for C, how can I verify that all values ​​on the vector are zero?

For instance:

#include <stdint.h> typedef uint32_t v8ui __attribute__ ((vector_size (32))); v8ui* foo(v8ui *mem) { v8ui v; for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7]; mem++) v &= *(mem); return mem; } 

SSE4.2 has a PTEST instruction that allows you to run a test similar to that used as a for condition, but the code generated by GCC just unpacks the vector and checks the individual elements one by one:

 .L2: vandps (%rax), %ymm1, %ymm1 vmovdqa %xmm1, %xmm0 addq $32, %rax vmovd %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $1, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $2, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $3, %xmm0, %edx testl %edx, %edx jne .L2 vextractf128 $0x1, %ymm1, %xmm0 vmovd %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $1, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $2, %xmm0, %edx testl %edx, %edx jne .L2 vpextrd $3, %xmm0, %edx testl %edx, %edx jne .L2 vzeroupper ret 

Is there a way to get GCC to generate an effective test for this without returning to using the built-in functions?

Refresh . For reference, code using non-portable built-in GCC for (V)PTEST :

 typedef uint32_t v8ui __attribute__ ((vector_size (32))); typedef long long int v4si __attribute__ ((vector_size (32))); const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 }; v8ui* foo(v8ui *mem) { v8ui v; for ( v = ones; !__builtin_ia32_ptestz256((v4si)v, (v4si)ones); mem++) v &= *(mem); return mem; } 
+6
source share
3 answers

gcc 4.9.2 -O3 -mavx2 (in 64 bit mode) did not understand that ptest could be used for this, or with || or | .

Version | extracts vector elements with vmovd and vpextrd and combines things with 7 or insns between 32-bit registers. So this is pretty bad and does not use any simplifications that will still cause the same logical meaning of truth.

Version || just as bad and does the same extract-an-element-at-time, but does test / jne for each.

So, at the moment, you cannot count on the GCC to recognize tests like this and do something remotely effective. ( pcmpeq / movmsk / test is another sequence that would not be bad, but gcc does not generate it either.)

+2
source

Wouldn't it help? If you look at performance, sometimes you will be surprised at what your own type can offer. Here is the code that uses vanilla memcmp (), as well as the vptest statement (used through the corresponding internal). I did not perform functions.

 #include <stdint.h> #include <stdio.h> #include <string.h> #include <immintrin.h> typedef uint32_t v8ui __attribute__ ((vector_size (32))); v8ui* foo1(v8ui *mem) { v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; if (memcmp(mem, &v, sizeof (v8ui)) == 0) { printf("Ones\n"); } else { printf("NOT Ones\n"); } return mem; } v8ui* foo2(v8ui *mem) { v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; __m256i a, b; a = _mm256_loadu_si256((__m256i *)(&v)); b = _mm256_loadu_si256((__m256i *)(&mem)); if (!_mm256_testz_si256(a, b)) { printf("NOT Ones\n"); } else { printf("Ones\n"); } return mem; } int main() { v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; foo1(&v); foo2(&v); } 

Compile flags:

gcc -mavx2 foo.c

Doh! Only now I realized that you want to get GCC to generate the vptest instruction without using the built-in functions. I will leave the code anyway.

+1
source

If the compiler is not optimal for automatic optimization, you have three options:

  • Get a new compiler.
  • Perform optimization manually (for example, using the built-in functions, for example, in your test and another answer).
  • Modify the compiler for automatic optimization.

You almost completely excluded the first option using gcc extensions, although llvm / clang can extend these extensions for you.

You excluded the second option quite frankly.

The third option seems to me the best option. gcc is open source, so you can make (and commit) your own changes. If you can modify gcc for automatic optimization (ideally from 100% C standard), then you will not only achieve your goal of creating this optimization without adding it to your program, but you can also save countless manual optimizations (especially non-standard ones that block you in using a specific compiler) in the future.

0
source

Source: https://habr.com/ru/post/984743/


All Articles