How can I install __m128i without using any SSE instruction?

I have many functions that use the same constant __m128i values. For instance:

const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); 

Therefore, I want to keep all these constants in one place. But there is a problem: I am checking for an existing CPU extension at runtime. If the CPU does not support, for example, SSE (or AVX), what will the program crash during the initialization of constants.

So, is it possible to initialize these constants without using SSE?

+5
source share
4 answers

Initializing the __m128i vector without using SSE instructions is possible, but depends on how the compiler defines __m128i.

For Microsoft Visual Studio, you can define the following macros (it defines __m128i as char [16]):

 template <class T> inline char GetChar(T value, size_t index) { return ((char*)&value)[index]; } #define AS_CHAR(a) char(a) #define AS_2CHARS(a) \ GetChar(int16_t(a), 0), GetChar(int16_t(a), 1) #define AS_4CHARS(a) \ GetChar(int32_t(a), 0), GetChar(int32_t(a), 1), \ GetChar(int32_t(a), 2), GetChar(int32_t(a), 3) #define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \ {AS_CHAR(a0), AS_CHAR(a1), AS_CHAR(a2), AS_CHAR(a3), \ AS_CHAR(a4), AS_CHAR(a5), AS_CHAR(a6), AS_CHAR(a7), \ AS_CHAR(a8), AS_CHAR(a9), AS_CHAR(aa), AS_CHAR(ab), \ AS_CHAR(ac), AS_CHAR(ad), AS_CHAR(ae), AS_CHAR(af)} #define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \ {AS_2CHARS(a0), AS_2CHARS(a1), AS_2CHARS(a2), AS_2CHARS(a3), \ AS_2CHARS(a4), AS_2CHARS(a5), AS_2CHARS(a6), AS_2CHARS(a7)} #define _MM_SETR_EPI32(a0, a1, a2, a3) \ {AS_4CHARS(a0), AS_4CHARS(a1), AS_4CHARS(a2), AS_4CHARS(a3)} 

For GCC, it will be (it defines __m128i as long long [2]):

 #define CHAR_AS_LONGLONG(a) (((long long)a) & 0xFF) #define SHORT_AS_LONGLONG(a) (((long long)a) & 0xFFFF) #define INT_AS_LONGLONG(a) (((long long)a) & 0xFFFFFFFF) #define LL_SETR_EPI8(a, b, c, d, e, f, g, h) \ CHAR_AS_LONGLONG(a) | (CHAR_AS_LONGLONG(b) << 8) | \ (CHAR_AS_LONGLONG(c) << 16) | (CHAR_AS_LONGLONG(d) << 24) | \ (CHAR_AS_LONGLONG(e) << 32) | (CHAR_AS_LONGLONG(f) << 40) | \ (CHAR_AS_LONGLONG(g) << 48) | (CHAR_AS_LONGLONG(h) << 56) #define LL_SETR_EPI16(a, b, c, d) \ SHORT_AS_LONGLONG(a) | (SHORT_AS_LONGLONG(b) << 16) | \ (SHORT_AS_LONGLONG(c) << 32) | (SHORT_AS_LONGLONG(d) << 48) #define LL_SETR_EPI32(a, b) \ INT_AS_LONGLONG(a) | (INT_AS_LONGLONG(b) << 32) #define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \ {LL_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7), LL_SETR_EPI8(a8, a9, aa, ab, ac, ad, ae, af)} #define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \ {LL_SETR_EPI16(a0, a1, a2, a3), LL_SETR_EPI16(a4, a5, a6, a7)} #define _MM_SETR_EPI32(a0, a1, a2, a3) \ {LL_SETR_EPI32(a0, a1), LL_SETR_EPI32(a2, a3)} 

So, in your code, the initialization of the __m128i constant will look like this:

 const __m128i K8 = _MM_SETR_EPI8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _MM_SETR_EPI16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _MM_SETR_EPI32(1, 2, 3, 4); 
+4
source

I suggest defining the initialization data globally as scalar data, and then loading it locally into const __m128i :

 static const uint8_t gK8[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; static inline foo() { const __m128i K8 = _mm_loadu_si128((__m128i *)gK8); // ... } 
+4
source

You can use union.

 union M128 { char[16] i8; __m128i i128; }; const M128 k8 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; 

If the M128 join is defined locally where you use the loop, this should not have an overhead of performance (it will be loaded into memory once at the beginning of the loop). Since it contains a variable of type __m128i, M128 inherits the correct alignment.

 void foo() { M128 k8 = ...; // use k8.i128 in your for loop } 

If it is defined somewhere else, you need to copy it to a local register before you start the loop, otherwise the compiler will not be able to optimize it.

 void foo() { __m128i tmp = k8.i128; // for loop here } 

This will load k8 into the cpu register and store it there for the entire cycle if there are enough free registers to execute the body of the cycle.

Depending on which compiler you use, these associations may already be defined (VS does), but the definitions provided by the compiler may not be portable.

+4
source

You usually do not need this. Compilers very well use the same repository for several functions that use the same constant. Like combining multiple instances of the same string literal into a single string constant, multiple instances of the same _mm_set* in different functions will be loaded from the same vector constant (or generated on the fly for _mm_setzero_si128() or _mm_set1_epi8(-1) )

Using the binary output mode (disassembly) Godbolt allows you to find out whether different functions are loaded from the same memory block or not. Have a look at the added comment that resolves relative RIP addresses to absolute addresses.

  • gcc: all the same constants use the same repository , regardless of whether they are from automatic vectorization or _mm_set , 32B cannot intersect with constants 16B, even if the constant 16B is a subset of 32B.

  • clang: identical constants share storage . The constants 16B and 32B do not overlap, even if one is a subset of the other. Some functions that use repeating constants use the AVX2 vpbroadcastd broadcast load (which doesn’t even use the ALU processor on Intel processors of the SnB family). For some reason, he chooses this based on the size of the operation element, rather than the repeatability of the constant. Note that clang asm output repeats the constant for each use, but the final binary does not.

  • MSVC: identical constants share storage . Pretty much the same as gcc. (The full asm output is hard to miss, use the search. I could only get asm by setting the main path to the .exe file, and then work out the path to the asm output made with cl.exe -O2 /FAs and run system("type .../foo.asm") ).

The compiler is good at this, since this is not a new problem. It existed with strings from the earliest days of compilers.

I did not check if this works in the source files (for example, for the built-in vector function used in several compilation units). If you still need static / global vector constants, see below:


There seems to be no easy and portable way to statically initialize a static / global __m128 . C compilers do not even accept _mm_set* as an initializer, because it works as a function. They do not take advantage of the fact that they could see the 16B compilation time constant through it.

 const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); // Illegal in C // C++: generates a constructor that copies from .rodata to the BSS 

Although the constructor only requires SSE1 or SSE2, you still don't want this. It's horrible. DO NOT DO IT . You end up paying the cost of the memory of your constants twice.


Fabio union 's answer looks like the best portable way to statically initialize a vector constant, but that means you need to access a member of the __m128i union. This can help with grouping related constants side by side (hopefully in the same cache line), even if they are used by scattered functions. There are unbearable ways that also (for example, put related constants in their own ELF section with GNU C __attribute__ ((section ("constants_for_task_A"))) ). We hope that they can group them together in the .rodata section (which becomes part of the .text section).

+2
source

Source: https://habr.com/ru/post/1242529/


All Articles