Thinking out loud: It's wonderful how often people do 90% of the tedious work and leave 10% (where the fun begins) for someone else! Okay, I'm having fun all the time!
Let me first repeat the experiment on my i7-4790K, 8u40 EA:
Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.atomicCount thrpt 5 47.669 ± 18.440 ops/us UnsafeCounter_Benchmark.lockCount thrpt 5 14.497 ± 7.815 ops/us UnsafeCounter_Benchmark.syncNoVCount thrpt 5 11.618 ± 2.130 ops/us UnsafeCounter_Benchmark.syncVCount thrpt 5 11.337 ± 4.532 ops/us UnsafeCounter_Benchmark.unsafeCount thrpt 5 7.452 ± 1.042 ops/us UnsafeCounter_Benchmark.unsafeGACount thrpt 5 43.332 ± 3.435 ops/us UnsafeCounter_Benchmark.unsyncCount thrpt 5 102.773 ± 11.943 ops/us
Indeed, something seems suspicious of the unsafeCount test. In fact, you should assume that all data is suspicious before you confirm it. For nano-objects, you need to check the generated code to see if you really measure what you want to measure. In JMH, this is very fast executable with -prof perfasm . In fact, if you look at the hottest area of unsafeCount , you will notice some fun things:
0.12% 0.04% 0x00007fb45518e7d1: mov 0x10(%r10),%rax 17.03% 23.44% 0x00007fb45518e7d5: test %eax,0x17318825(%rip) 0.21% 0.07% 0x00007fb45518e7db: mov 0x18(%r10),%r11 ; getfield offset 30.33% 10.77% 0x00007fb45518e7df: mov %rax,%r8 0.00% 0x00007fb45518e7e2: add $0x1,%r8 0.01% 0x00007fb45518e7e6: cmp 0xc(%r10),%r12d ; typecheck 0x00007fb45518e7ea: je 0x00007fb45518e80b ; bail to v-call 0.83% 0.48% 0x00007fb45518e7ec: lock cmpxchg %r8,(%r10,%r11,1) 33.27% 25.52% 0x00007fb45518e7f2: sete %r8b 0.12% 0.01% 0x00007fb45518e7f6: movzbl %r8b,%r8d 0.03% 0.04% 0x00007fb45518e7fa: test %r8d,%r8d 0x00007fb45518e7fd: je 0x00007fb45518e7d1 ; back branch
Translation: a) The offset field gets reread at each iteration - since the CAS memory effects imply constant reading, and therefore the field should be pessimistically re-read; b) The fun part is that the unsafe field is also re-read for typecheck - for the same reason.
This is why high-performance code should look like this:
--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java +++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java @@ -5,13 +5,13 @@ import sun.misc.Unsafe; public class UnsafeCASCounter implements Counter { private volatile long counter = 0; - private final Unsafe unsafe = UnsafeHelper.unsafe; - private long offset; - { + private static final Unsafe unsafe = UnsafeHelper.unsafe; + private static final long offset; + static { try { offset = unsafe.objectFieldOffset(UnsafeCASCounter.class.getDeclaredField("counter")); } catch (NoSuchFieldException e) { - e.printStackTrace(); + throw new IllegalStateException("Whoops!"); } }
If you do this, unsafeCount performance will go up:
Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.unsafeCount thrpt 5 9.733 ± 0.673 ops/us
... which is now fairly close to synchronized tests, given the margins of error. If you look at -prof perfasm now, this is the unsafeCount :
0.08% 0.02% 0x00007f7575191900: mov 0x10(%r10),%rax 28.09% 28.64% 0x00007f7575191904: test %eax,0x161286f6(%rip) 0.23% 0.08% 0x00007f757519190a: mov %rax,%r11 0x00007f757519190d: add $0x1,%r11 0x00007f7575191911: lock cmpxchg %r11,0x10(%r10) 47.27% 23.48% 0x00007f7575191917: sete %r8b 0.10% 0x00007f757519191b: movzbl %r8b,%r8d 0.02% 0x00007f757519191f: test %r8d,%r8d 0x00007f7575191922: je 0x00007f7575191900
This cycle is very tight, and it seems that nothing can make it move faster. We spend most of our time loading the “updated” value and actually CAS-ing. But we argue a lot! To find out if the conflict is the main reason, add a delay:
--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java +++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java @@ -20,6 +21,7 @@ public class UnsafeCASCounter implements Counter { long before = counter; while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) { before = counter; + Blackhole.consumeCPU(1000); } }
... works:
Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.unsafeCount thrpt 5 99.869 ± 107.933 ops/us
Voila. We do more work in a cycle, but it saves us a lot. I tried to explain this earlier in Nanotime Nanotime , it might be nice to go back there and learn more about the benchmarking methodology, especially when heavy operations are being measured. This highlights the trap in the entire experiment, not just with unsafeCount .
Exercise for the OP and interested readers: explain why unsafeGACount and atomicCount run much faster than other tests. You now have the tools.
PS Starting N threads on a machine with C threads (C <N) is stupid: you might think that you have a “rivalry” with N threads, but instead you are doing and only “competing” with C threads. This is especially funny, when people make 1000 threads on 4 main machines ...
PPS Check time: 10 minutes to perform profiling and additional experiments, 20 minutes to write this. And how much time did you spend manually replicating the result ?;)