First try
Using scipy.weave and SSE2 intrinsics gives a slight improvement. The first call is a bit slower, since the code needs to be loaded from disk and cached, subsequent calls are faster:
import numpy import time from os import urandom from scipy import weave SIZE = 2**20 def faster_slow_xor(aa,bb): b = numpy.fromstring(bb, dtype=numpy.uint64) numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b) return b.tostring() code = """ const __m128i* pa = (__m128i*)a; const __m128i* pend = (__m128i*)(a + arr_size); __m128i* pb = (__m128i*)b; __m128i xmm1, xmm2; while (pa < pend) { xmm1 = _mm_loadu_si128(pa); // must use unaligned access xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; } """ def inline_xor(aa, bb): a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.fromstring(bb, dtype=numpy.uint64) arr_size = a.shape[0] weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"']) return b.tostring()
Second attempt
Based on the comments, I reviewed the code to see if copying could be avoided. Turns out I read the documentation about the string object incorrectly, so here is my second attempt:
support = """ #define ALIGNMENT 16 static void memxor(const char* in1, const char* in2, char* out, ssize_t n) { const char* end = in1 + n; while (in1 < end) { *out = *in1 ^ *in2; ++in1; ++in2; ++out; } } """ code2 = """ PyObject* res = PyString_FromStringAndSize(NULL, real_size); const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT; const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT; memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head); const __m128i* pa = (__m128i*)((char*)a + head); const __m128i* pend = (__m128i*)((char*)a + real_size - tail); const __m128i* pb = (__m128i*)((char*)b + head); __m128i xmm1, xmm2; __m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head); while (pa < pend) { xmm1 = _mm_loadu_si128(pa); xmm2 = _mm_loadu_si128(pb); _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; ++pc; } memxor((const char*)pa, (const char*)pb, (char*)pc, tail); return_val = res; Py_DECREF(res); """ def inline_xor_nocopy(aa, bb): real_size = len(aa) a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.frombuffer(bb, dtype=numpy.uint64) return weave.inline(code2, ["a", "b", "real_size"], headers = ['"emmintrin.h"'], support_code = support)
The difference is that the line is selected inside the C code. It is impossible to align it to the 16-byte boundary, as required by the SSE2 instructions, so unaudited memory areas at the beginning and end are copied using byte access.
In either case, the input is passed using numpy arrays because weave insists on copying Python str objects to std::string s. frombuffer does not copy, so this is normal, but the memory is not 16 byte aligned, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128 .
Instead of using _mm_store_si128 we use _mm_stream_si128 , which ensures that any records are transferred to main memory as soon as possible. Thus, the output array does not use valuable cache lines.
Delays
As for the timings, the record slow_xor in the first edit refers to my improved version (built-in bitwise xor, uint64 ), I removed this confusion. slow_xor refers to the code from the original questions. All timings are performed for 1000 runs.
slow_xor : 1.85s (1x)faster_slow_xor : 1.25s (1.48x)inline_xor : 0.95s (1.95x)inline_xor_nocopy : inline_xor_nocopy (5.78x)
The code was compiled using gcc 4.4.3, and I made sure that the compiler really uses SSE instructions.