Slow performance when using a functor to provide a function or operator as a parameter to a C ++ template?

Question

Slow performance when using a functor to provide a function or operator as a parameter to a C ++ template?

I have a family of complex functions that perform very similar tasks, with the exception of one operator on the right in the middle of the function. A simplified version of my code might be something like this:

#include <assert.h> static void memopXor(char * buffer1, char * buffer2, char * res, unsigned n){ for (unsigned x = 0 ; x < n ; x++){ res[x] = buffer1[x] ^ buffer2[x]; } }; static void memopPlus(char * buffer1, char * buffer2, char * res, unsigned n){ for (unsigned x = 0 ; x < n ; x++){ res[x] = buffer1[x] + buffer2[x]; } }; static void memopMul(char * buffer1, char * buffer2, char * res, unsigned n){ for (unsigned x = 0 ; x < n ; x++){ res[x] = buffer1[x] * buffer2[x]; } }; int main(int argc, char ** argv){ char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memopXor(b1, b2, res1, 5); assert(res1[0] == 0); assert(res1[1] == 0); assert(res1[2] == 0); assert(res1[3] == 0); assert(res1[4] == 1); char res2[5] = {}; memopPlus(b1, b2, res2, 5); assert(res2[0] == 0); assert(res2[1] == 2); assert(res2[2] == 4); assert(res2[3] == 6); assert(res2[4] == 8); char res3[5] = {}; memopMul(b1, b2, res3, 5); assert(res3[0] == 0); assert(res3[1] == 1); assert(res3[2] == 4); assert(res3[3] == 9); assert(res3[4] == 16); }

It seems that you can use C ++ templates to avoid code duplication, so I was looking for a way to change my code to something like below (pseudocode):

 #include <assert.h> template <FUNCTION> void memop<FUNCTION>(char * buffer1, char * buffer2, char * res, size_t n){ for (size_t x = 0 ; x < n ; x++){ res[x] = FUNCTION(buffer1[x], buffer2[x]); } } int main(int argc, char ** argv){ char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memop<operator^>(b1, b2, res1, 5); assert(res1[0] == 0); assert(res1[1] == 0); assert(res1[2] == 0); assert(res1[3] == 0); assert(res1[4] == 0); char res2[5] = {}; memop<operator+>(b1, b2, res2, 5); assert(res2[0] == 0); assert(res2[1] == 2); assert(res2[2] == 4); assert(res2[3] == 6); assert(res2[4] == 8); char res3[5] = {}; memop<operator*>(b1, b2, res3, 5); assert(res3[0] == 0); assert(res3[1] == 1); assert(res3[2] == 4); assert(res3[3] == 9); assert(res3[4] == 16); }

The difficulty is that I do not want to accept any kind of slowdown as a result of the code generation. This means that solutions involving indirect calls (either via vtable or function pointers) are out of order.

A common C ++ solution for this problem is to wrap an operator with a call inside the operator () method of a functor class. Generally, to get something like the code below:

 #include <assert.h> template <typename Op> void memop(char * buffer1, char * buffer2, char * res, unsigned n){ Op o; for (unsigned x = 0 ; x < n ; x++){ res[x] = o(buffer1[x], buffer2[x]); } }; struct Xor { char operator()(char a, char b){ return a ^ b; } }; struct Plus { char operator()(char a, char b){ return a + b; } }; struct Mul { char operator()(char a, char b){ return a * b; } }; int main(int argc, char ** argv){ char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memop<Xor>(b1, b2, res1, 5); assert(res1[0] == 0); assert(res1[1] == 0); assert(res1[2] == 0); assert(res1[3] == 0); assert(res1[4] == 0); char res2[5] = {}; memop<Plus>(b1, b2, res2, 5); assert(res2[0] == 0); assert(res2[1] == 2); assert(res2[2] == 4); assert(res2[3] == 6); assert(res2[4] == 8); char res3[5] = {}; memop<Mul>(b1, b2, res3, 5); assert(res3[0] == 0); assert(res3[1] == 1); assert(res3[2] == 4); assert(res3[3] == 9); assert(res3[4] == 16); }

Is there any penalty for doing this?

+4

c ++ templates

kriss Mar 07 '12 at 3:42

source share

3 answers

Only you can tell if something is enough for your needs. But I went ahead and ran your code in my own box to find out what would happen.

 int main(int argc, char ** argv){ char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; int ans = 0; for (int i = 0; i < 100000000; i++) { char res1[5] = {}; memopXor(b1, b2, res1, 5); // memop<Xor>(b1, b2, res1, 5); char res2[5] = {}; memopPlus(b1, b2, res2, 5); // memop<Plus>(b1, b2, res2, 5); char res3[5] = {}; memopMul(b1, b2, res3, 5); // memop<Mul>(b1, b2, res3, 5); ans += res1[0] + res2[1] + res3[2]; // prevents optimization } std::cout << ans << std::endl; return 0; }

I compiled both versions with -O3 in g ++. time returns 2.40s for the manual version and 2.58s for the template version.

(By the way, I had to fix your memopMul() to actually do the multiplication.)

+1

chrisaycock Mar 07 '12 at 3:55

source share

The only problem I could encounter with the above code is that the compiler had problems with an alias of memory operations when calling memop, see C ++ alias rules .

Remember also that in the template version, the compiler will generate a different object for each unique template argument passed, which means that for three memop calls with three different operations, you will get three binary implementations. This should make the code almost identical to the source code.

I agree with another commentator that features should be included. If performance is critical, but you have to compare your code before and after the proposed changes, just to be safe, all that is required is the wrong compiler flag to blow things up.

+1

fileoffset Mar 07 '12 at 6:32

source share

Matthieu M. · Accepted Answer · 2012-03-07T07:25:51+0000

The code you reveal is pretty much useless since bencharmk is being sent.

 char cversion() { char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memopXor(b1, b2, res1, 5); return res1[4]; } char cppversion() { char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memop<Xor>(b1, b2, res1, 5); return res1[4]; }

Compiled for such LLVM IR:

 define signext i8 @cversion()() nounwind uwtable readnone { ret i8 0 } define signext i8 @cppversion()() nounwind uwtable readnone { ret i8 0 }

That is, the compiler performs the entire calculation at compile time.

So, I allowed to define a new function:

 void cppmemopXor(char * buffer1, char * buffer2, char * res, unsigned n) { memop<Xor>(buffer1, buffer2, res, n); }

and removed the static qualifier on memopXor , and then repeated the experiment:

 define void @memopXor(char*, char*, char*, unsigned int)(i8* nocapture %buffer1, i8* nocapture %buffer2, i8* nocapture %res, i32 %n) nounwind uwtable { %1 = icmp eq i32 %n, 0 br i1 %1, label %._crit_edge, label %.lr.ph .lr.ph: ; preds = %.lr.ph, %0 %indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ] %2 = getelementptr inbounds i8* %buffer1, i64 %indvars.iv %3 = load i8* %2, align 1, !tbaa !0 %4 = getelementptr inbounds i8* %buffer2, i64 %indvars.iv %5 = load i8* %4, align 1, !tbaa !0 %6 = xor i8 %5, %3 %7 = getelementptr inbounds i8* %res, i64 %indvars.iv store i8 %6, i8* %7, align 1, !tbaa !0 %indvars.iv.next = add i64 %indvars.iv, 1 %lftr.wideiv = trunc i64 %indvars.iv.next to i32 %exitcond = icmp eq i32 %lftr.wideiv, %n br i1 %exitcond, label %._crit_edge, label %.lr.ph ._crit_edge: ; preds = %.lr.ph, %0 ret void }

And the C ++ version with templates:

 define void @cppmemopXor(char*, char*, char*, unsigned int)(i8* nocapture %buffer1, i8* nocapture %buffer2, i8* nocapture %res, i32 %n) nounwind uwtable { %1 = icmp eq i32 %n, 0 br i1 %1, label %_ZL5memopI3XorEvPcS1_S1_j.exit, label %.lr.ph.i .lr.ph.i: ; preds = %.lr.ph.i, %0 %indvars.iv.i = phi i64 [ %indvars.iv.next.i, %.lr.ph.i ], [ 0, %0 ] %2 = getelementptr inbounds i8* %buffer1, i64 %indvars.iv.i %3 = load i8* %2, align 1, !tbaa !0 %4 = getelementptr inbounds i8* %buffer2, i64 %indvars.iv.i %5 = load i8* %4, align 1, !tbaa !0 %6 = xor i8 %5, %3 %7 = getelementptr inbounds i8* %res, i64 %indvars.iv.i store i8 %6, i8* %7, align 1, !tbaa !0 %indvars.iv.next.i = add i64 %indvars.iv.i, 1 %lftr.wideiv = trunc i64 %indvars.iv.next.i to i32 %exitcond = icmp eq i32 %lftr.wideiv, %n br i1 %exitcond, label %_ZL5memopI3XorEvPcS1_S1_j.exit, label %.lr.ph.i _ZL5memopI3XorEvPcS1_S1_j.exit: ; preds = %.lr.ph.i, %0 ret void }

As expected, they are structurally identical, since the functor code was completely embedded (which can be seen even without understanding IR).

Please note that this is not a result in isolation. For example, std::sort performs twice three times faster than qsort , because a functor is used instead of an indirect call function. Of course, using a template function and a functor means that every other instance will generate new code, as if you encoded the function manually, but this is exactly what you did manually anyway.

Slow performance when using a functor to provide a function or operator as a parameter to a C ++ template?

More articles: