I think you want:
double i0[2]; double i1[2]; __m128d x1 = _mm_load_pd(i0); __m128d x2 = _mm_load_pd(i1); __m128d sum = _mm_add_pd(x1, x2); // do whatever you want to with "sum" now
When you do _mm_load_pd , it puts the first double bit in the lower 64 bits of the register, and the second in the upper 16 bits. Thus, after loads above x1 contains two double values i0[0] and i0[1] (and similar for x2 ). Calling _mm_add_pd vertically adds the corresponding elements to x1 and x2 , so after adding sum contains i0[0] + i1[0] in its lower 64 bits and i0 i0[1] + i1[1] in its upper 64 bits.
Edit: I must point out that there is no benefit in using _mm_load_pd instead of _mm_load_ps . As the function names show, two packed doubles are explicitly loaded in the pd class, and the ps version loads four packed floats with the same precision. Since these are purely bit-bit memory movements, and they both use the SSE floating point block, there is no penalty for using _mm_load_ps to load double data. And there is the advantage of _mm_load_ps : its command encoding is one byte shorter than _mm_load_pd , so it is more efficient due to the meaning of the command cache (and, possibly, decoding commands, I am not an expert on all the intricacies of modern x86 processors). The above code using _mm_load_ps will look like this:
double i0[2]; double i1[2]; __m128d x1 = (__m128d) _mm_load_ps((float *) i0); __m128d x2 = (__m128d) _mm_load_ps((float *) i1); __m128d sum = _mm_add_pd(x1, x2);
There is no function implied by ghosts; it simply forces the compiler to reinterpret the contents of the SSE register as holding doubles instead of floats so that it can be passed to the arithmetic function with double precision _mm_add_pd .
source share