IDK why no one ever posted the correct answer, which appeared several times in the comments, but here it is:
Each byte is a mask for the whole double, so PMOVSXBQ does exactly what we need: load two bytes from the m16 pointer and sign them to two 64-bit (qword) halves of the xmm register.
# UNTESTED CODE (loop setup stuff)
Writing this with intrinsics should be simple. As others have pointed out, simply use dereferenced pointers as arguments for internal concepts.
To answer another part of your question about how to move data to align it for PMOVSX :
In Sandybridge and later, using PMOVSXBQ from RAM is probably good. On earlier processors that cannot handle two loads per cycle, each time loading 16B of mask data and moving them 2 bytes using PSRLDQ xmm1, 2 , 2 bytes of mask data will be placed in the lower 2 bytes of the register, Or maybe PUNPCKHQDQ or PSHUFD to get two chains of dependencies, moving high 64 to low 64 of another register. You will need to check which port is used for this command (shift versus shuffle / fetch) and see which conflicts are less with PMOVSX and ADDPD .
punpck and PSHUFD use p1 / p5 on SnB, so PMOVSX . ADDPD can only work on p1. andpd can only work on p5. Hmm, maybe PAND will be better, since it can work on p0 (and p1 / p5). Otherwise, nothing in the loop will use execution port 0. If there is a latency restriction for transferring data from integer to fp domains, this is inevitable if we use PMOVSX , as this will result in mask data in the int domain. Better to use more batteries to make the cycle longer than the longest chain of dependencies. But keep it under 28 feet or so to fit in the loop buffer to make sure that 4 loops can pass as a loop.
And one more thing about optimizing everything: Loop alignment is not really required, since on nehalem and later it will fit into the loop buffer.
You should expand the cycle to 2 or 4, because Intel Prewell Haswell processors do not have enough execution units to process all 4 (smooth) uops in one cycle. (3 vectors and one fusible add / jl . Two loads merge with the vector they enter.) Sandybridge and later can perform both loads in each cycle, so one iteration per cycle is possible, with the exception of overhead per cycle.
Oh, ADDPD has a 3-cycle delay. Therefore, you need to deploy and use multiple batteries to avoid the problem chain associated with the cycle, which is the bottleneck. Probably turns around 4, and then sums up 4 batteries at the end. You will have to do this in the source code even using the built-in functions, since this will change the order of operations for FP math, so the compiler may not want to do this when deploying.
Thus, each cycle with a reversal cycle will occupy 4 clock cycles, plus 1 uop for the cycle overhead. On Nehalem, where you have a tiny cache cache, but not a cache cache, deployment can mean that you should start to take care of the decoder bandwidth. However, on a pre-sandy bridge, one load on the watch is likely to be a bottleneck.
For decoder bandwidth, you can probably use ANDPS instead of andpd , which requires one byte to encode. IDK if that helps.
Extending this for 256b ymm registers will require AVX2 for the simplest implementation (for VPMOVSXBQ ymm ). You can get acceleration on AVX-only by running two VPMOVSXBQ xmm and combining them with VINSERTF128 or something else.