Saturday, December 17, 2011

A tale about a JVM crash


It happened again :) 
After a long time it happened again....
I was called to investigate a JVM crash. Usually it is the sign of a long adventure along side with the Java platform. A thrilling and frustrating event in the same time. This time I was lucky it took me 4 hours to reach a valid solution ....


Chapter I : First Encounter

First of all I looked at the directory from which the process started running. I started looking at the hs_err_pidXXXXX.log files that existed there about... 50 such files as there is a watchdog process which relaunch the application immediately after the crash. In this case just for the sake of crashing again.
I was happy to discover they accumulate very fast the process needs merely 45 seconds to crash. Someone who is not familiar with those kind of problems may be startled by fact that the process crashes so fast but for me that was heaven! 45 seconds to reproduction.
I started looking at the crash files. They were very disappointing, each one them crashed on different address the thread was applicative. In addition stack trace was missing . I snooped around to make sure no native code was running on this process(no native code good news). 
All I had was a pale description of a JVM frustration:  
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000000000000, pid=4384, tid=8364
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x0000000000000000
#
# If you would like to submit a bug report, please visit:
#

---------------  T H R E A D  ---------------

Current thread (0x000000001029e000):  JavaThread "XXXXXXXXXXXXX" daemon [_thread_in_Java, id=8364, stack(0x000000001a840000,0x000000001a940000)]

siginfo: ExceptionCode=0xc0000005, ExceptionInformation=0x0000000000000008 0x0000000000000000

The fact that crashing thread was an applicative thread was indicating that the crash was not related to GC.I tried switching the GC mechanism anyhow with no change in behavior. I also removed compressedOops and crash remain there standing still.
My next guess was the JIT compiler so as a first step I added the following command line option -Djava.compiler=NONE. That guess was working like a charm JVM stopped crashing.  That was still not a sign for euphoria, running interpreted mode is not a very good choice so the next step included divide and conquer.

Chapter II: Lion hunting

I added the following flags to the JVM  -XX:+LogCompilation -XX:+UnlockDiagnosticsVMOptions
When a HotSpot JVM is running with log compilation flag it is creating hotspot.log file which includes detailed info regarding all of its JIT activity. For a detailed document regarding this log file take a look here.
Truth be told I still don't fully understand the generated XML but I was looking for the following lines :


Compilation of a method is performed asynchronously by specific JVM threads and every request for JIT is queued before compilation. Grabbing all these lines will give me a full list of all compilation attempts done by the JVM.  This list now was taken to a text editor and manipulated so it can be fed to the .hotspot_compiler file.

.hotspot_compiler is a JIT exclusion file which should be placed in the working directory from which the JVM is launching. It contains lines of the following format:
exclude java/util/Properties$LineReader readLine ()I

Telling the JVM|:
"Dear JVM  
 Do not compile the method realLine which resides in the inner class LineReader of the class java.util.Properties,  Thanks in advance,
      A desperate programmer"

I started with approx. 2000 such lines and fed them to the .hotspot_compiler file (after removing the compile=NONE flag of course ) and the JVM did not crash. After the divide and conquer process I was left with a single method which its compilation made the JVM freak out.
Everybody is happy crash eliminated with minimal effect on performance.

Chapter III: After-max

Everybody is happy, except of me.... a lot of questions remain unanswered.
The method which aggravated the crash looks like this: 
protected boolean isXXXXXXXXX(double stdDev,double mean){
        //|stddev/mean| > 10^-5
        if(stdDev==0.0){
            return mean==0.0;
        }
        if(Double.isNaN(stdDev) || Double.isNaN(mean)){
            return true;
        }
        return Math.abs(mean/stdDev) < 100000.0;
    } 
What? That's a method which can cause the JVM to crash ? I don't think so... 
My guess was that it is related to some kind of interaction.....
I tried to reproduce the crash on an isolated test and as expected, nothing.
Could not find anything in the internet since I don't even know how to ask the correct question.

What am I going to do next ?:
  • Try to upgrade the JDK sub version though I could not find a fixed bug related to this in the release notes.
  • Try JDK 1.7
  • Try 32 bit JVM 
  • Try to disable certain optimizations such as Escape analysis 
  • Use -XX:+PrintOptoAssembly
But the most important of all hoping for some input from the community......

2 comments:

Unknown said...

Sometimes rewriting the code can avoid a bug. If this fixes or doesn't fix the problem it may give you some idea as to the cause.

I would try

//|stddev/mean| > 10^-5
return !(Math.abs(stdDev / mean) <= 1e-5);

The use of !(a <= b) instead of a > b is so that NaN returns true rather than false

vi said...

I would suggest you the following:
- upgrade HotSpot
- the bug may be already fixed
- try 6u29 - it has newer HS (20.4)
- or even pick HS21/HS22 from JDK7u
- try debug build of HS
- the error can change from arbitrary crash to meaningful assertion
- use both -XX:+PrintOptoAssembly and -XX:+PrintAssembly and compare results with the results from successful run