As far as I know, HotSpot cannot optimize redundant calls to pure methods (i.e. calls to pure methods with the same arguments), except indirectly through inlining.
That is, if redundant calls to a clean method are all embedded in the call site, redundancy is detected indirectly in the embedded code by normal optimizations such as CSE and GVN, and thus the cost of the extra calls usually disappears. However, if the methods are not installed, I don’t think that the JVM designates them as “clean” and therefore cannot eliminate them (unlike, for example, many native compilers that can ).
However, given that inlining can remove redundant calls, the question remains: why aren't the redundant calls Math.sin and Math.cos embedded and ultimately optimized?
As it turned out, Math.sin and Math.cos , like other other Math and other methods in the JDK, are specially treated as built-in functions . Below you will find a detailed overview of what happens in Java 8 and some versions of Java 9. The disassembly you showed is a later version of Java 9 that handles this differently, which is described at the end.
The way to handle triggers in the JVM ... is complicated. Basically, Math.sin and Math.cos are built-in as built-in methods using native x86 FP instructions, but there are caveats.
There are many extraneous factors in your test that make it difficult to analyze, such as array allocation, calling Blackhole.consume , using both Math.sin and Math.cos , passing a constant (which can cause some trigger commands to be fully optimized ), using interface A and implementing this interface, etc.
Instead, let the strip break through and reduce it to a much simpler version, which simply calls Math.sin(x) three times with the same argument and returns the sum:
private double i = Math.PI / 4 - 0.01; @Benchmark public double testMethod() { double res0 = Math.sin(i); double res1 = Math.sin(i); double res2 = Math.sin(i); return res0 + res1 + res2; }
Running this with arguments JHM -bm avgt -tu ns -wi 5 -f 1 -i 5 I get about 40 ns / op, which is at the lower end of the range for a single fsin call on modern x86 hardware. Let me take a look at the assembly:
[Constants] 0x00007ff2e4dbbd20 (offset: 0): 0x54442d18 0x3fe921fb54442d18 0x00007ff2e4dbbd24 (offset: 4): 0x3fe921fb 0x00007ff2e4dbbd28 (offset: 8): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4 0x00007ff2e4dbbd2c (offset: 12): 0xf4f4f4f4 0x00007ff2e4dbbd30 (offset: 16): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4 0x00007ff2e4dbbd34 (offset: 20): 0xf4f4f4f4 0x00007ff2e4dbbd38 (offset: 24): 0xf4f4f4f4 0xf4f4f4f4f4f4f4f4 0x00007ff2e4dbbd3c (offset: 28): 0xf4f4f4f4 (snip) [Verified Entry Point] 0x00007ff2e4dbbd50: sub $0x28,%rsp 0x00007ff2e4dbbd57: mov %rbp,0x20(%rsp) ;*synchronization entry ; - stackoverflow.TrigBench:: testMethod@-1 (line 38) 0x00007ff2e4dbbd5c: vmovsd 0x10(%rsi),%xmm2 ;*getfield i ; - stackoverflow.TrigBench:: testMethod@1 (line 38) 0x00007ff2e4dbbd61: vmovapd %xmm2,%xmm1 0x00007ff2e4dbbd65: sub $0x8,%rsp 0x00007ff2e4dbbd69: vmovsd %xmm1,(%rsp) 0x00007ff2e4dbbd6e: fldl (%rsp) 0x00007ff2e4dbbd71: fsin 0x00007ff2e4dbbd73: fstpl (%rsp) 0x00007ff2e4dbbd76: vmovsd (%rsp),%xmm1 0x00007ff2e4dbbd7b: add $0x8,%rsp ;*invokestatic sin ; - stackoverflow.TrigBench:: testMethod@20 (line 40) 0x00007ff2e4dbbd7f: vmovsd 0xffffff99(%rip),%xmm3 ; {section_word} 0x00007ff2e4dbbd87: vandpd 0xffe68411(%rip),%xmm2,%xmm0 ; {external_word} 0x00007ff2e4dbbd8f: vucomisd %xmm0,%xmm3 0x00007ff2e4dbbd93: jnb 0x7ff2e4dbbe4c 0x00007ff2e4dbbd99: vmovq %xmm3,%r13 0x00007ff2e4dbbd9e: vmovq %xmm1,%rbp 0x00007ff2e4dbbda3: vmovq %xmm2,%rbx 0x00007ff2e4dbbda8: vmovapd %xmm2,%xmm0 0x00007ff2e4dbbdac: movabs $0x7ff2f9abaeec,%r10 0x00007ff2e4dbbdb6: callq %r10 0x00007ff2e4dbbdb9: vmovq %xmm0,%r14 0x00007ff2e4dbbdbe: vmovq %rbx,%xmm2 0x00007ff2e4dbbdc3: vmovq %rbp,%xmm1 0x00007ff2e4dbbdc8: vmovq %r13,%xmm3 0x00007ff2e4dbbdcd: vandpd 0xffe683cb(%rip),%xmm2,%xmm0 ;*invokestatic sin ; - stackoverflow.TrigBench:: testMethod@4 (line 38) ; {external_word} 0x00007ff2e4dbbdd5: vucomisd %xmm0,%xmm3 0x00007ff2e4dbbdd9: jnb 0x7ff2e4dbbe56 0x00007ff2e4dbbddb: vmovq %xmm3,%r13 0x00007ff2e4dbbde0: vmovq %xmm1,%rbp 0x00007ff2e4dbbde5: vmovq %xmm2,%rbx 0x00007ff2e4dbbdea: vmovapd %xmm2,%xmm0 0x00007ff2e4dbbdee: movabs $0x7ff2f9abaeec,%r10 0x00007ff2e4dbbdf8: callq %r10 0x00007ff2e4dbbdfb: vmovsd %xmm0,(%rsp) 0x00007ff2e4dbbe00: vmovq %rbx,%xmm2 0x00007ff2e4dbbe05: vmovq %rbp,%xmm1 0x00007ff2e4dbbe0a: vmovq %r13,%xmm3 ;*invokestatic sin ; - stackoverflow.TrigBench:: testMethod@12 (line 39) 0x00007ff2e4dbbe0f: vandpd 0xffe68389(%rip),%xmm2,%xmm0 ;*invokestatic sin ; - stackoverflow.TrigBench:: testMethod@4 (line 38) ; {external_word} 0x00007ff2e4dbbe17: vucomisd %xmm0,%xmm3 0x00007ff2e4dbbe1b: jnb 0x7ff2e4dbbe32 0x00007ff2e4dbbe1d: vmovapd %xmm2,%xmm0 0x00007ff2e4dbbe21: movabs $0x7ff2f9abaeec,%r10 0x00007ff2e4dbbe2b: callq %r10 0x00007ff2e4dbbe2e: vmovapd %xmm0,%xmm1 ;*invokestatic sin ; - stackoverflow.TrigBench:: testMethod@20 (line 40) 0x00007ff2e4dbbe32: vmovq %r14,%xmm0 0x00007ff2e4dbbe37: vaddsd (%rsp),%xmm0,%xmm0 0x00007ff2e4dbbe3c: vaddsd %xmm0,%xmm1,%xmm0 ;*dadd ; - stackoverflow.TrigBench:: testMethod@30 (line 41) 0x00007ff2e4dbbe40: add $0x20,%rsp 0x00007ff2e4dbbe44: pop %rbp 0x00007ff2e4dbbe45: test %eax,0x15f461b5(%rip) ; {poll_return} 0x00007ff2e4dbbe4b: retq 0x00007ff2e4dbbe4c: vmovq %xmm1,%r14 0x00007ff2e4dbbe51: jmpq 0x7ff2e4dbbdcd 0x00007ff2e4dbbe56: vmovsd %xmm1,(%rsp) 0x00007ff2e4dbbe5b: jmp 0x7ff2e4dbbe0f
To the right, we see that the generated code loads field i fsin FP x87 1 stack and uses the fsin command to calculate Math.sin(i) .
The next part is also interesting:
0x00007ff2e4dbbd7f: vmovsd 0xffffff99(%rip),%xmm3 ; {section_word} 0x00007ff2e4dbbd87: vandpd 0xffe68411(%rip),%xmm2,%xmm0 ; {external_word} 0x00007ff2e4dbbd8f: vucomisd %xmm0,%xmm3 0x00007ff2e4dbbd93: jnb 0x7ff2e4dbbe4c
The first instruction loads the constant 0x3fe921fb54442d18 , which is 0.785398... , also known as pi / 4 . The second is the vpand value of i with some other constant. Then we compare pi / 4 with the result of vpand and jump somewhere if the last is less than or equal to the first.
A? If you follow the jump, there are series commands (redundant) vpandpd and vucomisd against the same values ​​(and using the same constant for vpand ), which quickly leads to this sequence:
0x00007ff2e4dbbe32: vmovq %r14,%xmm0 0x00007ff2e4dbbe37: vaddsd (%rsp),%xmm0,%xmm0 0x00007ff2e4dbbe3c: vaddsd %xmm0,%xmm1,%xmm0 ;*dadd ... 0x00007ff2e4dbbe4b: retq
This will simply triple the value returned by the fsin call (which was hidden in r14 and [rsp] during various transitions) and returns.
So, we see here that two redundant calls to Math.sin(i) were eliminated in the case where "jumps are taken", although the exception still explicitly adds together all the values ​​as if they were unique, and makes the bunch redundant and and compare instructions.
If we don’t take the jump, we get the same callq %r10 behavior that you show when disassembling.
What's going on here?
We will find enlightenment if we go to the inline_trig library_call.cpp to the JVM hotspot source . At the beginning of this method, we see this (some code is omitted for brevity):
// Rounding required? Check for argument reduction! if (Matcher::strict_fp_requires_explicit_rounding) { // (snip) // Pseudocode for sin: // if (x <= Math.PI / 4.0) { // if (x >= -Math.PI / 4.0) return fsin(x); // if (x >= -Math.PI / 2.0) return -fcos(x + Math.PI / 2.0); // } else { // if (x <= Math.PI / 2.0) return fcos(x - Math.PI / 2.0); // } // return StrictMath.sin(x); // (snip) // Actually, sticking in an 80-bit Intel value into C2 will be tough; it // requires a special machine instruction to load it. Instead we'll try // the 'easy' case. If we really need the extra range +/- PI/2 we'll // probably do the math inside the SIN encoding. // Make the merge point RegionNode* r = new RegionNode(3); Node* phi = new PhiNode(r, Type::DOUBLE); // Flatten arg so we need only 1 test Node *abs = _gvn.transform(new AbsDNode(arg)); // Node for PI/4 constant Node *pi4 = makecon(TypeD::make(pi_4)); // Check PI/4 : abs(arg) Node *cmp = _gvn.transform(new CmpDNode(pi4,abs)); // Check: If PI/4 < abs(arg) then go slow Node *bol = _gvn.transform(new BoolNode( cmp, BoolTest::lt )); // Branch either way IfNode *iff = create_and_xform_if(control(),bol, PROB_STATIC_FREQUENT, COUNT_UNKNOWN); set_control(opt_iff(r,iff)); // Set fast path result phi->init_req(2, n); // Slow path - non-blocking leaf call Node* call = NULL; switch (id) { case vmIntrinsics::_dsin: call = make_runtime_call(RC_LEAF, OptoRuntime::Math_D_D_Type(), CAST_FROM_FN_PTR(address, SharedRuntime::dsin), "Sin", NULL, arg, top()); break; break; }
Basically, there is a quick path and a slow path for trigger methods - if the sin argument is greater than Math.PI / 4 , we use the slow path. The test involves calling Math.abs , which is what the mysterious vandpd 0xffe68411(%rip),%xmm2,%xmm0 : it cleared the upper bit, which is a quick way to make abs for floating point values ​​in SSE or AVX registers.
Now the rest of the code makes sense: most of the code that we see is three quick paths after optimization: two redundant fsin calls fsin been fixed, but there are no surrounding checks. This is probably just a limitation of the optimizer: either the optimizer is simply not strong enough to eliminate everything, or the extension of these internal methods occurs after the optimization phase, which would combine them 2 .
On the slow path, we do the make_runtime_call call make_runtime_call , which displays as callq %r10 . This is the so-called call of the stub method, which internally implements sin , including the problem with decreasing arguments mentioned in the comments. On my system, the slow path is not necessarily much slower than the fast path: if you change - to + when initializing i :
private double i = Math.PI / 4 - 0.01;
you invoke a slow path, which for one call to Math.sin(i) takes ~ 50 ns versus 40 ns for fast path 3 . The problem arises when optimizing three redundant calls to Math.sin(i) . As can be seen from the above source, callq %r10 occurs three times (and, tracking through the execution path, we see that they are all taken as soon as the first jump occurs). This means that the runtime is about 150 ns for three calls, or almost 4 times the fast path case.
Obviously, the JDK cannot runtime_call nodes in this case, even if they are for identical arguments. Most likely, the runtime_call nodes in the internal representation are relatively opaque and are not subject to CSE and other optimizations that could help. These calls are mainly used for internal extension and some internal JVM methods and are not really key objects for this type of optimization, therefore this approach seems reasonable.
Recent Java 9
All of this has changed in Java 9 with this change .
The "fast path" where fsin was directly embedded was removed. My use of quotes around the “fast track” is intentional here: there is every reason to believe that the methods supported by SSE or AVX, sin can be faster than x87 fsin , which has not received much love for a decade. Indeed, this change replaces fsin calls "using the Intel LIBM implementation" ( this is the algorithm in its full glory for those who are interested ).
Great, so maybe it’s faster (maybe the OP didn’t provide a number, even after the request, so we don’t know), but the side effect is that without inlay we always explicitly apply for every Math.sin and Math.cos , which appears in the source: CSE does not occur.
, -, , , , , , . , , (, , Oracle), ).
1 , : [rsi + 0x10] , xmm2 , reg-reg xmm1 (vmovsd %xmm1,(%rsp) ), FF x87 fldl (%rsp) . , [rsp + 0x10] fld ! , , 5 .
2 , fsin , runtime: return Math.sin(i); 40ns.
3 At least for arguments close to Math.PI / 4. Outside this range, timing is very fast for values ​​close to pi / 2(about 40 ns — as fast as the “fast path”) and usually about 65 ns for very large values, which probably makes the reduction by division / arr.