I'll write about it, there are some good programming tips that should make a difference for any C # programmer who cares about writing fast code. In general, I caution against using micro-tests; differences of 15% or less are not generally statistically significant due to the unpredictability of the speed of code execution on the modern CPU core. A good approach to reducing the likelihood of measuring something that isn't there is to repeat the test at least 10 times to remove caching effects and replace the test so that you can eliminate the effects of code alignment.
But what you saw is real, the delegates calling the static method are actually slower. The effect is quite small in x86 code, but it is much worse in x64 code, do not forget to redo the tab "Project"> "Properties"> "Create"> "Prefer 32-bit" and "Platform" to try both.
Knowing why this is slower, you need to look at the machine code that generates jitter. In the case of delegates, this code is very well hidden. You will not see this when you look at the code using Debug> Windows> Disassembly. And you can't even skip the code, a managed debugger was written to hide it and completely refuses to show it. I have to describe the technique to put the “visual” back into Visual Studio.
I need to talk a little about stubs. A piece is a small bracket of machine code that the CLR dynamically creates in addition to code that generates jitter. Pins are used to implement interfaces; they provide flexibility in that the order of the methods in the method table for the class does not have to correspond to the order of the interface methods. And they are important for delegates, the topic of this issue. Workpieces also matter for compilation at the exact moment, the source code in the stub points to the jitter entry point to get the method compiled when it is called. After that, the stub is replaced, now the trembling target method is called. This stub makes a slow call to the static method; the stub for the purpose of the static method is more complex than the stub for the instance method.
To see stubs, you must interrupt the debugger to make it show your code. Some tweaking is required: first select Tools> Options> Debug> General. Uncheck the box “Only my code”, uncheck the box “Suppress JIT optimization”. If you are using VS2015, then select “Use managed compatibility mode”, the VS2015 debugger is very buggy and takes this debugging method seriously, this option provides a workaround forcing you to use the VS2010 managed debugger. Switch to Release configuration. Then "Project"> "Properties"> "Debug" check the box "Enable debugging of native code". And Project> Properties> Build, untick the “Preferred 32-bit” check box and the “Target platform” should be AnyCPU.
Set a breakpoint in the Run () method, be careful that breakpoints are not very accurate in optimized code. It is best to customize the method header. Once it hits, use Debug> Windows> Disassembly to see the machine code generated by the jitter. The delegate call call looks like this on the Haswell kernel, it may not match what you see if you have an older processor that does not yet support AVX:
funcResult += _func.Invoke(1d, 2d); 0000001a mov rax,qword ptr [rsi+8] ; rax = _func 0000001e mov rcx,qword ptr [rax+8] ; rcx = _func._methodBase (?) 00000022 vmovsd xmm2,qword ptr [0000000000000070h] ; arg3 = 2d 0000002b vmovsd xmm1,qword ptr [0000000000000078h] ; arg2 = 1d 00000034 call qword ptr [rax+18h] ; call stub
A 64-bit method call passes the first 4 arguments in registers, any additional arguments are passed through the stack (not here). XMM registers are used here because the arguments are floating point. At the moment, jitter does not yet know whether the method is static or an instance that cannot be detected until this code is executed. This is a stub task to hide the difference. This is supposed to be an instance method, so I annotated arg2 and arg3.
Set a breakpoint in the CALL command, the second time it hits (so after the stub no longer indicates jitter), you can look at it. This must be done manually, use Debug> Windows> Registers and copy the value of the RAX register. Debugging> Windows> Memory> Memory1 and paste the value, put "0x" in front of it and add 0x18. Right-click this window and select "8-byte integer", copy the first displayed value. This is the address of the stub code.
Now the trick, the currently driven debugging mechanism, is still in use and will not let you look at the stub code. You must force the mode switch so that the controlled debugging mechanism is controlled. Use Debug> Windows> Call Stack and double-click the method call at the bottom, for example RtlUserThreadStart. Forces the debugger to switch engines. Now you can go and you can paste the address in the "Address" field, put "0x" in front of it. Out produces a stub code:
00007FFCE66D0100 jmp 00007FFCE66D0E40
A very simple, direct transition to the target delegate method. This will be a quick code. Jitter correctly guessed the instance method, and the delegate object already provided the this argument in the RCX register, so nothing special needs to be done.
Go to the second test and do the same to look at the stub to invoke the instance. Now the stub is very different:
000001FE559F0850 mov rax,rsp ; ? 000001FE559F0853 mov r11,rcx ; r11 = _func (?) 000001FE559F0856 movaps xmm0,xmm1 ; shuffle arg3 into right register 000001FE559F0859 movaps xmm1,xmm2 ; shuffle arg2 into right register 000001FE559F085C mov r10,qword ptr [r11+20h] ; r10 = _func.Method 000001FE559F0860 add r11,20h ; ? 000001FE559F0864 jmp r10 ; jump to _func.Method
The code is a little winning and not optimal, Microsoft can probably do a better job here, and I'm not 100% sure that I annotated it correctly. I assume that the unnecessary mov rax, rsp command is only applicable to stubs for methods with more than 4 arguments. I don’t know why the add instruction is needed. The most important detail that matters is moving the XMM register, it must shuffle them, because the static method does not have this argument. It is this requirement of permutation that makes the code slower.
You can do the same exercise with x86 jitter, the static stub method now looks like this:
04F905B4 mov eax,ecx 04F905B6 add eax,10h 04F905B9 jmp dword ptr [eax] ; jump to _func.Method
Much easier than a 64-bit stub, so 32-bit code does not suffer from slowdown almost as much. One of the reasons this is so different is that 32-bit code skips floating points in the FPU stack and they do not need to be shuffled. It will not necessarily be faster if you use integral or object arguments.
Very mysterious, I hope I have not made everyone sleep yet. Beware that I may have received some annotations incorrectly, I do not fully understand the stubs and how the CLR prepares delegates to make code as fast as possible. But there is, of course, decent programming advice. You really support instance methods as delegate goals, making them static not optimizations.