The key to getting the two closer together is to ensure that the comparison is fair.
First of all, make sure that the costs associated with running the Debug build load the pdb characters as you did.
Then you need to make sure that the initialization counts are not counted. Obviously, these are real costs and may be significant for some people, but in this case we are interested in the cycle itself.
Next, you need to deal with specific platform behavior. If you are on a 64-bit Windows machine, you can work in 32 bit or 64 bit mode. In 64-bit mode, JIT is very different, often changing the resulting result code. In particular, and I would guess appropriately, you get access to twice as many general-purpose registers.
In this case, the inner part of the loop, naively translated into machine code, will have to load into the registers the constants used in unit tests. If there is not enough of everything necessary in the cycle, then he must eject them from memory. Even from a level 1 cache, this will be a significant hit compared to keeping everyone in the register.
In VS 2010 MS , the default target has changed from anycpu to x86 . I donโt have anything like resources or a client that knows MSFT, so I wonโt try to guess about it. However, anyone looking at something like the performance analysis you are doing should definitely try both.
Once these differences are eliminated, the numbers look much more reasonable. Any further differences are likely to require more than educated guesses; instead, they will need to investigate the actual differences in the generated machine code.
There are a few things that I think would be interesting for the optimizing compiler.
- Those already mentioned:
- The lcm option is interesting, but I don't see the compiler writer worried.
- reduction of division by multiplication and masking.
- I donโt know enough about this, but other people have tried , noted that they significantly improve the divider on more recent Intel chips.
- Perhaps you could organize something complicated using SSE2.
- Of course, the operation of modulo 16 is ripe for conversion to a mask or shift.
- The compiler may notice that none of the tests have side effects.
- he could speculatively try to evaluate several of them at once; on a superscar processor, this could pump things up pretty quickly, but it will largely depend on how well the linker interacts with the OO execution mechanism.
- If the pressure in the register was hard, you could implement the constants as one variable, set at the beginning of each cycle, and then increase as you move.
These are all guesses, and they should be considered as a cold. If you want to find out, take it apart.