Probable and unlikely causes of Heisenbugs in Java?

Question

Probable and unlikely causes of Heisenbugs in Java?

I have a classic example of Heisenbug caused by a condition that I have not seen before. My legacy application (about 100K sloc of old code) does not work properly in a specific instance and just allows remote debugging. JPDA changes the behavior enough to make the application work correctly: do nothing except add "-Xdebug -Xnoagent -Xrunjdwp: transport = dt_socket, server = y, suspend = n, address = 6666 "on the vm command line hides the error (with or without an actual connection). Given that I have a fully repeatable test case, I hate to bother him much with code changes if it hides again. And, of course, this only happens in production.

Usually I immediately accept the problem with threads, but: a) the behavior is 100% failure and 100% work; b) there is no explicit use of threads in the code in question. Our team then tried to come up with a list of other reasons for this behavior, so I thought that maybe the Qaru group could add a little more.

Heisenbugs in Java:

Topics: poor timing, race conditions, implicit order assumptions.
Explicit debugging / registration code: Changes to the code path cause / prevent the problem. Less often, changes in the log level can lead to changes in time (thread repetition) and differences in the use of I / O resources.
Source libraries can drag non-java problems with Heisenbug.
Waiting for finalizers to launch.
incorrect assumptions about weak links.
Suppose a fixed-size cache is never populated.
expects unique hash codes.
Assumes that == works on strings (or does not work on strings that may be interned in some cases).
VM error (no, it will never happen;).
test methodology error. Especially when there are hidden variables that depend on the success of the test. (This looks like our current problem. The success of one test led to the client performing the next test, which failed due to a policy issue. The failure led to a debugging run in accordance with the policy, which led to success. sigh)

Any other cases worth exploring?

Editing :

yes, JPDA enable code uses the old syntax. I have not tested whether using modern syntax affects behavior.
This particular machine uses 1.8.0_45-b14 for the JRE and the 64-bit server VMS HotSpot (build 25.45-b02)
while the issue should be general, the issue of incitement is real and relevant. Since the problem manifests itself in a deployed system, I am torn between the desire to leave it running with -Xdebug as a workaround, so that it remains operational and wants to track the underlying error and kill it.
the faulty program in question is part of a multi-stage data processing pipeline - the details should not matter, but best understood as a stand-alone application that receives some information from the database and then uses it to modify some files, Part of the system that breaks, apparently , lies in the fact that information from the database is not interpreted properly - anything from a broken ORM object or cache. When it is “broken”, the application logic that determines whether it should work (based on the contents of db) makes the wrong choice for all iterations (thousands of iterations, including multiple program calls). When it "works" (the only difference is that vm works with -Xdebug or not), the application makes the right choice for all iterations. In this configuration, it is fully consistent. The same code that works with different databases is not subject. There is some evidence (preceding my participation in this code) that similar behavior was noticed in the past, which mysteriously began to work after seemingly minor code changes ... see "Heisenbug"

+5

java debugging heisenbug jpda

m.thome Jan 08 '16 at 2:51

source share

2 answers

Paul rubel · Answer 1 · 2016-01-08T15:02:20+0000

-Xdebug looks like a behavior change switch. What are the Java command line options for remotely debugging the JVM? claims that adding it turns you from JIT to all interpretable ones. Other oracle java docs ( for the jrocket admittedly ) seem to indicate that it is slower for some vague reason and not suitable for deployed systems.

I can imagine that various GC schemes can make a difference.

Andres · Answer 2 · 2016-01-08T14:54:29+0000

I had a case where the failure was caused by the power-saving feature on the hardware, which never got activated when the error was studied.

Probable and unlikely causes of Heisenbugs in Java?

More articles: