Strange behavior in sun.misc.Unsafe.compareAndSwap measurement via JMH

Question

Strange behavior in sun.misc.Unsafe.compareAndSwap measurement via JMH

I decided to measure the gain with various locking strategies and use JMH for this purpose. I use JMH to verify bandwidth and average time, as well as a simple user test to verify correctness. There are six strategies:

Number of atoms
ReadWrite Lock Count
Sync with mutable
Non-volatile synchronization unit
sun.misc.Unsafe.compareAndSwap
sun.misc.Unsafe.getAndAdd
Non-sync account

Checkpoint Code:

@State(Scope.Benchmark) @BenchmarkMode({Mode.Throughput, Mode.AverageTime}) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Fork(1) @Warmup(iterations = 5) @Measurement(iterations = 5) public class UnsafeCounter_Benchmark { public Counter unsync, syncNoV, syncV, lock, atomic, unsafe, unsafeGA; @Setup(Level.Iteration) public void prepare() { unsync = new UnsyncCounter(); syncNoV = new SyncNoVolatileCounter(); syncV = new SyncVolatileCounter(); lock = new LockCounter(); atomic = new AtomicCounter(); unsafe = new UnsafeCASCounter(); unsafeGA = new UnsafeGACounter(); } @Benchmark public void unsyncCount() { unsyncCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsyncCounter() { unsync.increment(); } @Benchmark public void syncNoVCount() { syncNoVCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void syncNoVCounter() { syncNoV.increment(); } @Benchmark public void syncVCount() { syncVCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void syncVCounter() { syncV.increment(); } @Benchmark public void lockCount() { lockCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void lockCounter() { lock.increment(); } @Benchmark public void atomicCount() { atomicCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void atomicCounter() { atomic.increment(); } @Benchmark public void unsafeCount() { unsafeCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsafeCounter() { unsafe.increment(); } @Benchmark public void unsafeGACount() { unsafeGACounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsafeGACounter() { unsafeGA.increment(); } public static void main(String[] args) throws RunnerException { Options baseOpts = new OptionsBuilder() .include(UnsafeCounter_Benchmark.class.getSimpleName()) .threads(100) .jvmArgs("-ea") .build(); new Runner(baseOpts).run(); } }

And the bench results:

Jdk 8u20

 Benchmark Mode Samples Score Error Units okuuUnsafeCounter_Benchmark.atomicCount thrpt 5 42.178 ± 17.643 ops/us okuuUnsafeCounter_Benchmark.lockCount thrpt 5 24.044 ± 2.264 ops/us okuuUnsafeCounter_Benchmark.syncNoVCount thrpt 5 22.849 ± 1.344 ops/us okuuUnsafeCounter_Benchmark.syncVCount thrpt 5 20.235 ± 2.027 ops/us okuuUnsafeCounter_Benchmark.unsafeCount thrpt 5 12.460 ± 1.326 ops/us okuuUnsafeCounter_Benchmark.unsafeGACount thrpt 5 39.106 ± 2.966 ops/us okuuUnsafeCounter_Benchmark.unsyncCount thrpt 5 93.076 ± 9.674 ops/us okuuUnsafeCounter_Benchmark.atomicCount avgt 5 2.604 ± 0.133 us/op okuuUnsafeCounter_Benchmark.lockCount avgt 5 4.161 ± 0.546 us/op okuuUnsafeCounter_Benchmark.syncNoVCount avgt 5 4.440 ± 0.523 us/op okuuUnsafeCounter_Benchmark.syncVCount avgt 5 5.073 ± 0.439 us/op okuuUnsafeCounter_Benchmark.unsafeCount avgt 5 9.088 ± 5.964 us/op okuuUnsafeCounter_Benchmark.unsafeGACount avgt 5 2.611 ± 0.164 us/op okuuUnsafeCounter_Benchmark.unsyncCount avgt 5 1.047 ± 0.050 us/op

Most of the measurement, as I expect, is other than UnsafeCounter_Benchmark.unsafeCount , which is used by sun.misc.Unsafe.compareAndSwapLong with a while . This is the slowest lock.

 public void increment() { long before = counter; while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) { before = counter; } }

I assume that poor performance is due to the fact that the loop and JMH cause a higher conflict, but when I checked the validity on Executors , I get the numbers as I expect:

 Counter result: UnsyncCounter 97538676 Time passed in ms:259 Counter result: AtomicCounter 100000000 Time passed in ms:1805 Counter result: LockCounter 100000000 Time passed in ms:3904 Counter result: SyncNoVolatileCounter 100000000 Time passed in ms:14227 Counter result: SyncVolatileCounter 100000000 Time passed in ms:19224 Counter result: UnsafeCASCounter 100000000 Time passed in ms:8077 Counter result: UnsafeGACounter 100000000 Time passed in ms:2549

Validation Code:

 public class UnsafeCounter_Test { static class CounterClient implements Runnable { private Counter c; private int num; public CounterClient(Counter c, int num) { this.c = c; this.num = num; } @Override public void run() { for (int i = 0; i < num; i++) { c.increment(); } } } public static void makeTest(Counter counter) throws InterruptedException { int NUM_OF_THREADS = 1000; int NUM_OF_INCREMENTS = 100000; ExecutorService service = Executors.newFixedThreadPool(NUM_OF_THREADS); long before = System.currentTimeMillis(); for (int i = 0; i < NUM_OF_THREADS; i++) { service.submit(new CounterClient(counter, NUM_OF_INCREMENTS)); } service.shutdown(); service.awaitTermination(1, TimeUnit.MINUTES); long after = System.currentTimeMillis(); System.out.println("Counter result: " + counter.getClass().getSimpleName() + " " + counter.getCounter()); System.out.println("Time passed in ms:" + (after - before)); } public static void main(String[] args) throws InterruptedException { makeTest(new UnsyncCounter()); makeTest(new AtomicCounter()); makeTest(new LockCounter()); makeTest(new SyncNoVolatileCounter()); makeTest(new SyncVolatileCounter()); makeTest(new UnsafeCASCounter()); makeTest(new UnsafeGACounter()); } }

I know this is a very terrible test, but in this case Unsafe CAS is twice as fast as Sync, and everything goes as expected. Can someone clarify the described behavior? For more information, see GitHub repo: Bench , CAS Insecure Counter

+5

java cas microbenchmark jmh

Kirill Kirmit Nov 18 '14 at 13:45

source share

1 answer

Aleksey Shipilev · Accepted Answer · 2014-11-18T23:12:40+0000

Thinking out loud: It's wonderful how often people do 90% of the tedious work and leave 10% (where the fun begins) for someone else! Okay, I'm having fun all the time!

Let me first repeat the experiment on my i7-4790K, 8u40 EA:

 Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.atomicCount thrpt 5 47.669 ± 18.440 ops/us UnsafeCounter_Benchmark.lockCount thrpt 5 14.497 ± 7.815 ops/us UnsafeCounter_Benchmark.syncNoVCount thrpt 5 11.618 ± 2.130 ops/us UnsafeCounter_Benchmark.syncVCount thrpt 5 11.337 ± 4.532 ops/us UnsafeCounter_Benchmark.unsafeCount thrpt 5 7.452 ± 1.042 ops/us UnsafeCounter_Benchmark.unsafeGACount thrpt 5 43.332 ± 3.435 ops/us UnsafeCounter_Benchmark.unsyncCount thrpt 5 102.773 ± 11.943 ops/us

Indeed, something seems suspicious of the unsafeCount test. In fact, you should assume that all data is suspicious before you confirm it. For nano-objects, you need to check the generated code to see if you really measure what you want to measure. In JMH, this is very fast executable with -prof perfasm . In fact, if you look at the hottest area of unsafeCount , you will notice some fun things:

  0.12% 0.04% 0x00007fb45518e7d1: mov 0x10(%r10),%rax 17.03% 23.44% 0x00007fb45518e7d5: test %eax,0x17318825(%rip) 0.21% 0.07% 0x00007fb45518e7db: mov 0x18(%r10),%r11 ; getfield offset 30.33% 10.77% 0x00007fb45518e7df: mov %rax,%r8 0.00% 0x00007fb45518e7e2: add $0x1,%r8 0.01% 0x00007fb45518e7e6: cmp 0xc(%r10),%r12d ; typecheck 0x00007fb45518e7ea: je 0x00007fb45518e80b ; bail to v-call 0.83% 0.48% 0x00007fb45518e7ec: lock cmpxchg %r8,(%r10,%r11,1) 33.27% 25.52% 0x00007fb45518e7f2: sete %r8b 0.12% 0.01% 0x00007fb45518e7f6: movzbl %r8b,%r8d 0.03% 0.04% 0x00007fb45518e7fa: test %r8d,%r8d 0x00007fb45518e7fd: je 0x00007fb45518e7d1 ; back branch

Translation: a) The offset field gets reread at each iteration - since the CAS memory effects imply constant reading, and therefore the field should be pessimistically re-read; b) The fun part is that the unsafe field is also re-read for typecheck - for the same reason.

This is why high-performance code should look like this:

 --- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java +++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java @@ -5,13 +5,13 @@ import sun.misc.Unsafe; public class UnsafeCASCounter implements Counter { private volatile long counter = 0; - private final Unsafe unsafe = UnsafeHelper.unsafe; - private long offset; - { + private static final Unsafe unsafe = UnsafeHelper.unsafe; + private static final long offset; + static { try { offset = unsafe.objectFieldOffset(UnsafeCASCounter.class.getDeclaredField("counter")); } catch (NoSuchFieldException e) { - e.printStackTrace(); + throw new IllegalStateException("Whoops!"); } }

If you do this, unsafeCount performance will go up:

 Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.unsafeCount thrpt 5 9.733 ± 0.673 ops/us

... which is now fairly close to synchronized tests, given the margins of error. If you look at -prof perfasm now, this is the unsafeCount :

  0.08% 0.02% 0x00007f7575191900: mov 0x10(%r10),%rax 28.09% 28.64% 0x00007f7575191904: test %eax,0x161286f6(%rip) 0.23% 0.08% 0x00007f757519190a: mov %rax,%r11 0x00007f757519190d: add $0x1,%r11 0x00007f7575191911: lock cmpxchg %r11,0x10(%r10) 47.27% 23.48% 0x00007f7575191917: sete %r8b 0.10% 0x00007f757519191b: movzbl %r8b,%r8d 0.02% 0x00007f757519191f: test %r8d,%r8d 0x00007f7575191922: je 0x00007f7575191900

This cycle is very tight, and it seems that nothing can make it move faster. We spend most of our time loading the “updated” value and actually CAS-ing. But we argue a lot! To find out if the conflict is the main reason, add a delay:

 --- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java +++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java @@ -20,6 +21,7 @@ public class UnsafeCASCounter implements Counter { long before = counter; while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) { before = counter; + Blackhole.consumeCPU(1000); } }

... works:

 Benchmark Mode Samples Score Error Units UnsafeCounter_Benchmark.unsafeCount thrpt 5 99.869 ± 107.933 ops/us

Voila. We do more work in a cycle, but it saves us a lot. I tried to explain this earlier in Nanotime Nanotime , it might be nice to go back there and learn more about the benchmarking methodology, especially when heavy operations are being measured. This highlights the trap in the entire experiment, not just with unsafeCount .

Exercise for the OP and interested readers: explain why unsafeGACount and atomicCount run much faster than other tests. You now have the tools.

PS Starting N threads on a machine with C threads (C <N) is stupid: you might think that you have a “rivalry” with N threads, but instead you are doing and only “competing” with C threads. This is especially funny, when people make 1000 threads on 4 main machines ...

PPS Check time: 10 minutes to perform profiling and additional experiments, 20 minutes to write this. And how much time did you spend manually replicating the result ?;)

Strange behavior in sun.misc.Unsafe.compareAndSwap measurement via JMH

More articles: