Well, that won't be easy. I think you could build after compilation using the performance counter code that runs after the IL blocks. For example, if you had a section of a method that loaded int onto the stack, then you executed a static method using this int as part of optimized code, you could write a counter for the 2nd load and call.
Even using existing read / write projects using AI / managed assembly, this would be a pretty difficult task to take down.
Of course, some of the instructions that your counter wrote down can be optimized at compile time at compile time on x86 / ia64 / x64, but this is a risk you should take to try to profile based on abstract lanaguage as an IL.
source share