Why, if there is no significant overload of the CIL stack for the second example, does it happen βfasterβ than the first?
Please note that the number of CIL instructions does not accurately reflect the amount of work or memory that will be used. A single instruction can be very low or very high, so counting CIL instructions is not an accurate way to measure "performance".
Also understand that CIL is not what is being executed. JIT compiles CIL into valid machine instructions with an optimization phase, so CIL can be very different from the actual instructions executed.
In the second case, since you are using a non-shared collection, each Push call requires an integer to be boxed, as you defined in CIL.
Boxing the whole effectively creates an object that wraps Int32 for you. Instead of just loading a 32-bit integer onto the stack, it now needs to load a 32-bit integer onto the stack and then pop it, which also effectively loads the object reference onto the stack.
If you check this in the Disassembly window, you will see that the difference between the generic and non-generic versions is dramatic and much more significant than the proposed CIL.
The general version compiles efficiently as a series of calls like:
0000022c nop S.Push(25); 0000022d mov ecx,dword ptr ds:[03834978h] 00000233 mov edx,19h 00000238 cmp dword ptr [ecx],ecx 0000023a call 71618DD0 0000023f nop S.Push(26); 00000240 mov ecx,dword ptr ds:[03834978h] 00000246 mov edx,1Ah 0000024b cmp dword ptr [ecx],ecx 0000024d call 71618DD0 00000252 nop S.Push(27);
Non-scam, on the other hand, should create boxed objects and instead compiles to:
00000645 nop S.Push(25); 00000646 mov ecx,7326560Ch 0000064b call FAAC20B0 00000650 mov dword ptr [ebp-48h],eax 00000653 mov eax,dword ptr ds:[03AF4978h] 00000658 mov dword ptr [ebp+FFFFFEE8h],eax 0000065e mov eax,dword ptr [ebp-48h] 00000661 mov dword ptr [eax+4],19h 00000668 mov eax,dword ptr [ebp-48h] 0000066b mov dword ptr [ebp+FFFFFEE4h],eax 00000671 mov ecx,dword ptr [ebp+FFFFFEE8h] 00000677 mov edx,dword ptr [ebp+FFFFFEE4h] 0000067d mov eax,dword ptr [ecx] 0000067f mov eax,dword ptr [eax+2Ch] 00000682 call dword ptr [eax+18h] 00000685 nop S.Push(26); 00000686 mov ecx,7326560Ch 0000068b call FAAC20B0 00000690 mov dword ptr [ebp-48h],eax 00000693 mov eax,dword ptr ds:[03AF4978h] 00000698 mov dword ptr [ebp+FFFFFEE0h],eax 0000069e mov eax,dword ptr [ebp-48h] 000006a1 mov dword ptr [eax+4],1Ah 000006a8 mov eax,dword ptr [ebp-48h] 000006ab mov dword ptr [ebp+FFFFFEDCh],eax 000006b1 mov ecx,dword ptr [ebp+FFFFFEE0h] 000006b7 mov edx,dword ptr [ebp+FFFFFEDCh] 000006bd mov eax,dword ptr [ecx] 000006bf mov eax,dword ptr [eax+2Ch] 000006c2 call dword ptr [eax+18h] 000006c5 nop
Here you can see the meaning of boxing.
In your case, when boxing, an integer causes the linking of box objects to the stack. On my system, this causes stackoverflow for any calls that exceed Foo(127) (32 bits), which means that integers and object references in the box (4 bytes each) are all stored on the stack, since 127 * 1000 * 8 == 1016000, which is dangerously close to the stack size by default thread size 1 for .NET applications.
When using the universal version, since there is no boxed object, integers must not be stored on the stack, and the same register is reused. This allows you to significantly increase (> 40,000 on my system) before using the stack.
Please note that this will be a CLR version and platform dependent, as there is another xIT / x64 JIT.