The code you reveal is pretty much useless since bencharmk is being sent.
char cversion() { char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memopXor(b1, b2, res1, 5); return res1[4]; } char cppversion() { char b1[5] = {0, 1, 2, 3, 4}; char b2[5] = {0, 1, 2, 3, 4}; char res1[5] = {}; memop<Xor>(b1, b2, res1, 5); return res1[4]; }
Compiled for such LLVM IR:
define signext i8 @cversion()() nounwind uwtable readnone { ret i8 0 } define signext i8 @cppversion()() nounwind uwtable readnone { ret i8 0 }
That is, the compiler performs the entire calculation at compile time.
So, I allowed to define a new function:
void cppmemopXor(char * buffer1, char * buffer2, char * res, unsigned n) { memop<Xor>(buffer1, buffer2, res, n); }
and removed the static qualifier on memopXor , and then repeated the experiment:
define void @memopXor(char*, char*, char*, unsigned int)(i8* nocapture %buffer1, i8* nocapture %buffer2, i8* nocapture %res, i32 %n) nounwind uwtable { %1 = icmp eq i32 %n, 0 br i1 %1, label %._crit_edge, label %.lr.ph .lr.ph: ; preds = %.lr.ph, %0 %indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ] %2 = getelementptr inbounds i8* %buffer1, i64 %indvars.iv %3 = load i8* %2, align 1, !tbaa !0 %4 = getelementptr inbounds i8* %buffer2, i64 %indvars.iv %5 = load i8* %4, align 1, !tbaa !0 %6 = xor i8 %5, %3 %7 = getelementptr inbounds i8* %res, i64 %indvars.iv store i8 %6, i8* %7, align 1, !tbaa !0 %indvars.iv.next = add i64 %indvars.iv, 1 %lftr.wideiv = trunc i64 %indvars.iv.next to i32 %exitcond = icmp eq i32 %lftr.wideiv, %n br i1 %exitcond, label %._crit_edge, label %.lr.ph ._crit_edge: ; preds = %.lr.ph, %0 ret void }
And the C ++ version with templates:
define void @cppmemopXor(char*, char*, char*, unsigned int)(i8* nocapture %buffer1, i8* nocapture %buffer2, i8* nocapture %res, i32 %n) nounwind uwtable { %1 = icmp eq i32 %n, 0 br i1 %1, label %_ZL5memopI3XorEvPcS1_S1_j.exit, label %.lr.ph.i .lr.ph.i: ; preds = %.lr.ph.i, %0 %indvars.iv.i = phi i64 [ %indvars.iv.next.i, %.lr.ph.i ], [ 0, %0 ] %2 = getelementptr inbounds i8* %buffer1, i64 %indvars.iv.i %3 = load i8* %2, align 1, !tbaa !0 %4 = getelementptr inbounds i8* %buffer2, i64 %indvars.iv.i %5 = load i8* %4, align 1, !tbaa !0 %6 = xor i8 %5, %3 %7 = getelementptr inbounds i8* %res, i64 %indvars.iv.i store i8 %6, i8* %7, align 1, !tbaa !0 %indvars.iv.next.i = add i64 %indvars.iv.i, 1 %lftr.wideiv = trunc i64 %indvars.iv.next.i to i32 %exitcond = icmp eq i32 %lftr.wideiv, %n br i1 %exitcond, label %_ZL5memopI3XorEvPcS1_S1_j.exit, label %.lr.ph.i _ZL5memopI3XorEvPcS1_S1_j.exit: ; preds = %.lr.ph.i, %0 ret void }
As expected, they are structurally identical, since the functor code was completely embedded (which can be seen even without understanding IR).
Please note that this is not a result in isolation. For example, std::sort performs twice three times faster than qsort , because a functor is used instead of an indirect call function. Of course, using a template function and a functor means that every other instance will generate new code, as if you encoded the function manually, but this is exactly what you did manually anyway.