Python simple task: fast bitwise XOR for data buffers

Problem:

Run a bitwise XOR on two equal size buffers. Buffers must be of type python str , as this is traditionally a type of data buffers in python. Return the resulting value as str . Do it as quickly as possible.

Inputs are two lines of 1 megabyte (2 ** 20 bytes).

The challenge is to essentially beat my inefficient algorithm using python or existing third-party python modules (relaxed rules: or create your own module.) Marginal increases are useless.

 from os import urandom from numpy import frombuffer,bitwise_xor,byte def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r aa=urandom(2**20) bb=urandom(2**20) def test_it(): for x in xrange(1000): slow_xor(aa,bb) 
+48
performance python algorithm xor
Jan 22 '10 at 19:08
source share
11 answers

First try

Using scipy.weave and SSE2 intrinsics gives a slight improvement. The first call is a bit slower, since the code needs to be loaded from disk and cached, subsequent calls are faster:

 import numpy import time from os import urandom from scipy import weave SIZE = 2**20 def faster_slow_xor(aa,bb): b = numpy.fromstring(bb, dtype=numpy.uint64) numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b) return b.tostring() code = """ const __m128i* pa = (__m128i*)a; const __m128i* pend = (__m128i*)(a + arr_size); __m128i* pb = (__m128i*)b; __m128i xmm1, xmm2; while (pa < pend) { xmm1 = _mm_loadu_si128(pa); // must use unaligned access xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; } """ def inline_xor(aa, bb): a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.fromstring(bb, dtype=numpy.uint64) arr_size = a.shape[0] weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"']) return b.tostring() 

Second attempt

Based on the comments, I reviewed the code to see if copying could be avoided. Turns out I read the documentation about the string object incorrectly, so here is my second attempt:

 support = """ #define ALIGNMENT 16 static void memxor(const char* in1, const char* in2, char* out, ssize_t n) { const char* end = in1 + n; while (in1 < end) { *out = *in1 ^ *in2; ++in1; ++in2; ++out; } } """ code2 = """ PyObject* res = PyString_FromStringAndSize(NULL, real_size); const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT; const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT; memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head); const __m128i* pa = (__m128i*)((char*)a + head); const __m128i* pend = (__m128i*)((char*)a + real_size - tail); const __m128i* pb = (__m128i*)((char*)b + head); __m128i xmm1, xmm2; __m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head); while (pa < pend) { xmm1 = _mm_loadu_si128(pa); xmm2 = _mm_loadu_si128(pb); _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; ++pc; } memxor((const char*)pa, (const char*)pb, (char*)pc, tail); return_val = res; Py_DECREF(res); """ def inline_xor_nocopy(aa, bb): real_size = len(aa) a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.frombuffer(bb, dtype=numpy.uint64) return weave.inline(code2, ["a", "b", "real_size"], headers = ['"emmintrin.h"'], support_code = support) 

The difference is that the line is selected inside the C code. It is impossible to align it to the 16-byte boundary, as required by the SSE2 instructions, so unaudited memory areas at the beginning and end are copied using byte access.

In either case, the input is passed using numpy arrays because weave insists on copying Python str objects to std::string s. frombuffer does not copy, so this is normal, but the memory is not 16 byte aligned, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128 .

Instead of using _mm_store_si128 we use _mm_stream_si128 , which ensures that any records are transferred to main memory as soon as possible. Thus, the output array does not use valuable cache lines.

Delays

As for the timings, the record slow_xor in the first edit refers to my improved version (built-in bitwise xor, uint64 ), I removed this confusion. slow_xor refers to the code from the original questions. All timings are performed for 1000 runs.

  • slow_xor : 1.85s (1x)
  • faster_slow_xor : 1.25s (1.48x)
  • inline_xor : 0.95s (1.95x)
  • inline_xor_nocopy : inline_xor_nocopy (5.78x)

The code was compiled using gcc 4.4.3, and I made sure that the compiler really uses SSE instructions.

+32
Feb 01 '10 at 20:07
source share

Performance Comparison: numpy vs Cython vs C vs Fortran vs. Boost.Python (pyublas)

 | function | time, usec | ratio | type | |------------------------+------------+-------+--------------| | slow_xor | 2020 | 1.0 | numpy | | xorf_int16 | 1570 | 1.3 | fortran | | xorf_int32 | 1530 | 1.3 | fortran | | xorf_int64 | 1420 | 1.4 | fortran | | faster_slow_xor | 1360 | 1.5 | numpy | | inline_xor | 1280 | 1.6 | C | | cython_xor | 1290 | 1.6 | cython | | xorcpp_inplace (int32) | 440 | 4.6 | pyublas | | cython_xor_vectorised | 325 | 6.2 | cython | | inline_xor_nocopy | 172 | 11.7 | C | | xorcpp | 144 | 14.0 | boost.python | | xorcpp_inplace | 122 | 16.6 | boost.python | #+TBLFM: $3=@2$2/$2;%.1f 

To reproduce the results, download http://gist.github.com/353005 and type make (to install the dependencies, type: sudo apt-get install build-essential python-numpy python-scipy cython gfortran , dependencies for Boost.Python , pyublas not are included due to the need for manual intervention in the work)

Where:

  • slow_xor() is an OP question
  • faster_slow_xor() , inline_xor() , inline_xor_nocopy() from @Torsten Marek answer
  • cython_xor() and cython_vectorised() from @gnibbler answer

And xor_$type_sig() :

 ! xorf.f90.template subroutine xor_$type_sig(a, b, n, out) implicit none integer, intent(in) :: n $type, intent(in), dimension(n) :: a $type, intent(in), dimension(n) :: b $type, intent(out), dimension(n) :: out integer i forall(i=1:n) out(i) = ieor(a(i), b(i)) end subroutine xor_$type_sig 

Used from Python as follows:

 import xorf # extension module generated from xorf.f90.template import numpy as np def xor_strings(a, b, type_sig='int64'): assert len(a) == len(b) a = np.frombuffer(a, dtype=np.dtype(type_sig)) b = np.frombuffer(b, dtype=np.dtype(type_sig)) return getattr(xorf, 'xor_'+type_sig)(a, b).tostring() 

xorcpp_inplace() (Boost.Python, pyublas):

xor.cpp :

 #include <inttypes.h> #include <algorithm> #include <boost/lambda/lambda.hpp> #include <boost/python.hpp> #include <pyublas/numpy.hpp> namespace { namespace py = boost::python; template<class InputIterator, class InputIterator2, class OutputIterator> void xor_(InputIterator first, InputIterator last, InputIterator2 first2, OutputIterator result) { // `result` migth `first` but not any of the input iterators namespace ll = boost::lambda; (void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2); } template<class T> py::str xorcpp_str_inplace(const py::str& a, py::str& b) { const size_t alignment = std::max(sizeof(T), 16ul); const size_t n = py::len(b); const char* ai = py::extract<const char*>(a); char* bi = py::extract<char*>(b); char* end = bi + n; if (n < 2*alignment) xor_(bi, end, ai, bi); else { assert(n >= 2*alignment); // applying Marek algorithm to align const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment; const ptrdiff_t tail = (size_t) end % alignment; xor_(bi, bi + head, ai, bi); xor_((const T*)(bi + head), (const T*)(end - tail), (const T*)(ai + head), (T*)(bi + head)); if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail); } return b; } template<class Int> pyublas::numpy_vector<Int> xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a, pyublas::numpy_vector<Int> b) { xor_(b.begin(), b.end(), a.begin(), b.begin()); return b; } } BOOST_PYTHON_MODULE(xorcpp) { py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>); // for strings py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy } 

Used from Python as follows:

 import os import xorcpp a = os.urandom(2**20) b = os.urandom(2**20) c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace() 
+33
Apr 02 '10 at 10:15
source share

Here are my results for cython

 slow_xor 0.456888198853 faster_xor 0.400228977203 cython_xor 0.232881069183 cython_xor_vectorised 0.171468019485 

Vectorization in cython shaves about 25% of the for loop on my computer. However, more than half the time is spent building a python string ( return ) - I don’t think that an extra copy can be avoided (legally), since the array can contain zero bytes.

An illegal way would be to pass a Python string and change it in place and double the speed of the function.

xor.py

 from time import time from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 import pyximport; pyximport.install() import xor_ def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r def faster_xor(aa,bb): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) c=bitwise_xor(a,b) r=c.tostring() return r aa=urandom(2**20) bb=urandom(2**20) def test_it(): t=time() for x in xrange(100): slow_xor(aa,bb) print "slow_xor ",time()-t t=time() for x in xrange(100): faster_xor(aa,bb) print "faster_xor",time()-t t=time() for x in xrange(100): xor_.cython_xor(aa,bb) print "cython_xor",time()-t t=time() for x in xrange(100): xor_.cython_xor_vectorised(aa,bb) print "cython_xor_vectorised",time()-t if __name__=="__main__": test_it() 

xor_.pyx

 cdef char c[1048576] def cython_xor(char *a,char *b): cdef int i for i in range(1048576): c[i]=a[i]^b[i] return c[:1048576] def cython_xor_vectorised(char *a,char *b): cdef int i for i in range(131094): (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i] return c[:1048576] 
+17
Feb 01 '10 at 0:22
source share

Easy acceleration is to use a larger “piece”:

 def faster_xor(aa,bb): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) c=bitwise_xor(a,b) r=c.tostring() return r 

with uint64 , of course, is also imported from numpy . I timeit this at 4 milliseconds, versus 6 milliseconds for the byte version.

+10
Jan 22 '10 at 19:39
source share

Your problem is not the speed of the NumPy xOr method, but with all conversions like buffering / data. Personally, I suspect that the point of this post may have really been to brag about Python, because what you are doing here handles three GIGABYTES data in timeframes along with non-interpreted languages, which are faster by nature.

The code below shows that even on my humble Python machine, it can xOr "aa" (1MB) and "bb" (1MB) in "c" (1MB) a thousand times (just 3 GB) in less than two seconds. Seriously, how much more needs to be improved? Especially from the interpreted language! 80% of the time was spent calling frombuffer and tostring. Actual xOr-ing fails in another 20% of cases. With 3 GB in 2 seconds, it will be difficult for you to improve this, in fact, even using memcpy in c.

In case it was a real question, and not just a hidden boast of Python, the answer is to minimize the number, quantity, and frequency of your type conversions, such as frombuffer and tostring. Actual xOr'ing is already lightning fast.

 from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 def slow_xor(aa,bb): a=frombuffer(aa,dtype=byte) b=frombuffer(bb,dtype=byte) c=bitwise_xor(a,b) r=c.tostring() return r bb=urandom(2**20) aa=urandom(2**20) def test_it(): for x in xrange(1000): slow_xor(aa,bb) def test_it2(): a=frombuffer(aa,dtype=uint64) b=frombuffer(bb,dtype=uint64) for x in xrange(1000): c=bitwise_xor(a,b); r=c.tostring() test_it() print 'Slow Complete.' #6 seconds test_it2() print 'Fast Complete.' #under 2 seconds 

In any case, the "test_it2" above does exactly the same amount of xOr-ing as the "test_it", but 1/5 times. A 5x speed increase should qualify as “substantial,” no?

+6
Jan 31 '10 at 21:06
source share

The fastest bitwise XOR is "^". I can type this much faster than bitwise_xor; -)

+3
Feb 02 '10 at 23:44
source share

Python3 has int.from_bytes and int.to_bytes , thus:

 x = int.from_bytes(b"a" * (1024*1024), "big") y = int.from_bytes(b"b" * (1024*1024), "big") (x ^ y).to_bytes(1024*1024, "big") 

It's faster than IO, it’s hard to check how fast it looks, it looks like 0.018 .. 0.020s on my machine. The strange "little" -endian conversion is a little faster.

CPython 2.x has the basic function _PyLong_FromByteArray , it is not exported, but is accessible through ctypes:

 In [1]: import ctypes In [2]: p = ctypes.CDLL(None) In [3]: p["_PyLong_FromByteArray"] Out[3]: <_FuncPtr object at 0x2cc6e20> 

The details of Python 2 remain as an exercise for the reader.

+2
May 6 '14 at 12:55
source share

If you want to perform fast operations on array data types, you should try Cython (cython.org). If you give him the correct declarations, he will have to compile to pure c code.

+1
Jan 23 '10 at
source share

How much do you need a string answer? Note that the c.tostring() method should copy the data in c to a new line, because Python strings are immutable (and c modifies). Python 2.6 and 3.1 are of type bytearray , which acts like str ( bytes in Python 3.x), except that it is modified.

Another optimization uses the out parameter for bitwise_xor to indicate where to store the result.

On my car, I get

 slow_xor (int8): 5.293521 (100.0%) outparam_xor (int8): 4.378633 (82.7%) slow_xor (uint64): 2.192234 (41.4%) outparam_xor (uint64): 1.087392 (20.5%) 

with the code at the end of this message. Please note, in particular, that a method using a pre-allocated buffer is two times faster than creating a new object (when working with 4-byte ( uint64 ) fragments). This is consistent with a slower method that does two operations per piece (xor + copy) to a faster 1 (just xor).

In addition, FWIW, a ^ b equivalent to bitwise_xor(a,b) , and a ^= b equivalent to bitwise_xor(a, b, a) .

So, 5x acceleration without writing any external modules :)

 from time import time from os import urandom from numpy import frombuffer,bitwise_xor,byte,uint64 def slow_xor(aa, bb, ignore, dtype=byte): a=frombuffer(aa, dtype=dtype) b=frombuffer(bb, dtype=dtype) c=bitwise_xor(a, b) r=c.tostring() return r def outparam_xor(aa, bb, out, dtype=byte): a=frombuffer(aa, dtype=dtype) b=frombuffer(bb, dtype=dtype) c=frombuffer(out, dtype=dtype) assert c.flags.writeable return bitwise_xor(a, b, c) aa=urandom(2**20) bb=urandom(2**20) cc=bytearray(2**20) def time_routine(routine, dtype, base=None, ntimes = 1000): t = time() for x in xrange(ntimes): routine(aa, bb, cc, dtype=dtype) et = time() - t if base is None: base = et print "%s (%s): %f (%.1f%%)" % (routine.__name__, dtype.__name__, et, (et/base)*100) return et def test_it(ntimes = 1000): base = time_routine(slow_xor, byte, ntimes=ntimes) time_routine(outparam_xor, byte, base, ntimes=ntimes) time_routine(slow_xor, uint64, base, ntimes=ntimes) time_routine(outparam_xor, uint64, base, ntimes=ntimes) 
+1
Apr 02 '10 at 20:55
source share

You can try the symmetric difference in the bits of the step dial.

http://www.sagemath.org/doc/reference/sage/misc/bitset.html

0
Jan 22 '10 at 19:51
source share

The fastest way (in speed) will do what Max. S recommended. Implement it in C.

Helper code for this task should be fairly simple to write. This is just one function in a module that creates a new line and executes xor. All this. When you have implemented one such module, just take the code as a template. Or you even take a module implemented from someone else that implements a simple Python enhancement module and just throws away everything that is not needed for your task.

The real tricky part is the simple, proper setup of the RefCounter-Stuff. But once I realized how it works, it is manageable - also, since the task at hand is very simple (allocate some memory and return it), the parameters should not touch (Ref-wise)).

0
Jan 29 '10 at 22:01
source share



All Articles