Access to arbitrary 16-bit elements packed in a 128-bit register

Question

Access to arbitrary 16-bit elements packed in a 128-bit register

With built-in Intel built-in compilers, considering 128-bit case, packing 8 16-bit elements, how can I access (cheaper) arbitrary elements from the register, for later use _mm_cvtepi8_epi64 (the sign extends two 8-bit elements, packed in the lower 16 bits register, up to two 64-bit elements)?

I will explain why I ask:

Input: buffer with memory with k bytes, each of which is 0x0 or 0xff.
Required output: for every two consecutive input bytes, a register typing two four words (64 bits) with 0x0 and 0xffff ffff ffff ffff , respectively.
The ultimate goal: to summarize the buffer k doubles, disguised in accordance with the entries of the input buffer.

Note. The values 0x0 and 0xff input buffer can be changed to whatever is most useful, provided that there is a masking effect before the sum remains.

As my question shows, my current plan is as follows: streaming through input buffers:

Extend the input mask buffer from 8 bits to 64 bits.
Doubling buffer mask with extended mask.
Summarize the masked doubles.

Thanks Asaf

+6

assembly sse micro-optimization simd intrinsics

Whaa Apr 01 '12 at 11:18

source share

3 answers

Rather, it is tangent to the question itself, filling out some information about the comments more, because the part of the comment itself is too small to hold this (sic!):

At least gcc can deal with the following code:

 #include <smmintrin.h> extern int fumble(__m128i x); int main(int argc, char **argv) { __m128i foo; __m128i* bar = (__m128i*)argv; foo = _mm_cvtepi8_epi64(*bar); return fumble(foo); }

He turns this into the following assembly:

  Disassembly of section .text.startup:

 0000000000000000:
    0: 66 0f 38 22 06 pmovsxbq (% rsi),% xmm0
    5: e9 XX XX XX XX jmpq .....

This means that internal elements do not have to be in the form of a memory argument - the compiler processes the dereferencing of the mem argument transparently and, if possible, uses the corresponding mem-operand statement. ICC does the same. I don't have a Windows machine / Visual C ++ to check if this also does MSVC, but I would expect this.

+3

FrankH. Apr 4 '12 at 23:31

source share

Have you looked at _ mm_extract_epi16 (PEXTRW) and _ mm_insert_epi16 (PINSRW) ?

+2

Paul r Apr 01 '12 at 11:37

source share

Peter Cordes · Accepted Answer · 2015-06-04T22:21:30+0000

IDK why no one ever posted the correct answer, which appeared several times in the comments, but here it is:

Each byte is a mask for the whole double, so PMOVSXBQ does exactly what we need: load two bytes from the m16 pointer and sign them to two 64-bit (qword) halves of the xmm register.

 # UNTESTED CODE (loop setup stuff) # RSI: double pointer # RDI: mask pointer # RCX: loop conter = mask byte-count LEA RDI, [RDI + RCX*1] LEA RSI, [RSI + RCX*8] ; sizeof(double) = 8 NEG RCX ; point to the end and count up XORPD XMM0, XMM0 ; clear accumulator ALIGN 16 .loop: PMOVSXBQ XMM1, [RDI + RCX] ANDPD XMM1, [RSI + RCX * 8] ADDPD XMM0, XMM1 ADD RCX, 2 ; 2 bytes / doubles per iter JL .loop HADDPD XMM0, XMM0 ; combine the two parallel sums ret

Writing this with intrinsics should be simple. As others have pointed out, simply use dereferenced pointers as arguments for internal concepts.

To answer another part of your question about how to move data to align it for PMOVSX :

In Sandybridge and later, using PMOVSXBQ from RAM is probably good. On earlier processors that cannot handle two loads per cycle, each time loading 16B of mask data and moving them 2 bytes using PSRLDQ xmm1, 2 , 2 bytes of mask data will be placed in the lower 2 bytes of the register, Or maybe PUNPCKHQDQ or PSHUFD to get two chains of dependencies, moving high 64 to low 64 of another register. You will need to check which port is used for this command (shift versus shuffle / fetch) and see which conflicts are less with PMOVSX and ADDPD .

punpck and PSHUFD use p1 / p5 on SnB, so PMOVSX . ADDPD can only work on p1. andpd can only work on p5. Hmm, maybe PAND will be better, since it can work on p0 (and p1 / p5). Otherwise, nothing in the loop will use execution port 0. If there is a latency restriction for transferring data from integer to fp domains, this is inevitable if we use PMOVSX , as this will result in mask data in the int domain. Better to use more batteries to make the cycle longer than the longest chain of dependencies. But keep it under 28 feet or so to fit in the loop buffer to make sure that 4 loops can pass as a loop.

And one more thing about optimizing everything: Loop alignment is not really required, since on nehalem and later it will fit into the loop buffer.

You should expand the cycle to 2 or 4, because Intel Prewell Haswell processors do not have enough execution units to process all 4 (smooth) uops in one cycle. (3 vectors and one fusible add / jl . Two loads merge with the vector they enter.) Sandybridge and later can perform both loads in each cycle, so one iteration per cycle is possible, with the exception of overhead per cycle.

Oh, ADDPD has a 3-cycle delay. Therefore, you need to deploy and use multiple batteries to avoid the problem chain associated with the cycle, which is the bottleneck. Probably turns around 4, and then sums up 4 batteries at the end. You will have to do this in the source code even using the built-in functions, since this will change the order of operations for FP math, so the compiler may not want to do this when deploying.

Thus, each cycle with a reversal cycle will occupy 4 clock cycles, plus 1 uop for the cycle overhead. On Nehalem, where you have a tiny cache cache, but not a cache cache, deployment can mean that you should start to take care of the decoder bandwidth. However, on a pre-sandy bridge, one load on the watch is likely to be a bottleneck.

For decoder bandwidth, you can probably use ANDPS instead of andpd , which requires one byte to encode. IDK if that helps.

Extending this for 256b ymm registers will require AVX2 for the simplest implementation (for VPMOVSXBQ ymm ). You can get acceleration on AVX-only by running two VPMOVSXBQ xmm and combining them with VINSERTF128 or something else.

Access to arbitrary 16-bit elements packed in a 128-bit register

More articles: