Saturday, December 17, 2011

A tale about a JVM crash


It happened again :) 
After a long time it happened again....
I was called to investigate a JVM crash. Usually it is the sign of a long adventure along side with the Java platform. A thrilling and frustrating event in the same time. This time I was lucky it took me 4 hours to reach a valid solution ....


Chapter I : First Encounter

First of all I looked at the directory from which the process started running. I started looking at the hs_err_pidXXXXX.log files that existed there about... 50 such files as there is a watchdog process which relaunch the application immediately after the crash. In this case just for the sake of crashing again.
I was happy to discover they accumulate very fast the process needs merely 45 seconds to crash. Someone who is not familiar with those kind of problems may be startled by fact that the process crashes so fast but for me that was heaven! 45 seconds to reproduction.
I started looking at the crash files. They were very disappointing, each one them crashed on different address the thread was applicative. In addition stack trace was missing . I snooped around to make sure no native code was running on this process(no native code good news). 
All I had was a pale description of a JVM frustration:  
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000000000000, pid=4384, tid=8364
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C  0x0000000000000000
#
# If you would like to submit a bug report, please visit:
#

---------------  T H R E A D  ---------------

Current thread (0x000000001029e000):  JavaThread "XXXXXXXXXXXXX" daemon [_thread_in_Java, id=8364, stack(0x000000001a840000,0x000000001a940000)]

siginfo: ExceptionCode=0xc0000005, ExceptionInformation=0x0000000000000008 0x0000000000000000

The fact that crashing thread was an applicative thread was indicating that the crash was not related to GC.I tried switching the GC mechanism anyhow with no change in behavior. I also removed compressedOops and crash remain there standing still.
My next guess was the JIT compiler so as a first step I added the following command line option -Djava.compiler=NONE. That guess was working like a charm JVM stopped crashing.  That was still not a sign for euphoria, running interpreted mode is not a very good choice so the next step included divide and conquer.

Chapter II: Lion hunting

I added the following flags to the JVM  -XX:+LogCompilation -XX:+UnlockDiagnosticsVMOptions
When a HotSpot JVM is running with log compilation flag it is creating hotspot.log file which includes detailed info regarding all of its JIT activity. For a detailed document regarding this log file take a look here.
Truth be told I still don't fully understand the generated XML but I was looking for the following lines :


Compilation of a method is performed asynchronously by specific JVM threads and every request for JIT is queued before compilation. Grabbing all these lines will give me a full list of all compilation attempts done by the JVM.  This list now was taken to a text editor and manipulated so it can be fed to the .hotspot_compiler file.

.hotspot_compiler is a JIT exclusion file which should be placed in the working directory from which the JVM is launching. It contains lines of the following format:
exclude java/util/Properties$LineReader readLine ()I

Telling the JVM|:
"Dear JVM  
 Do not compile the method realLine which resides in the inner class LineReader of the class java.util.Properties,  Thanks in advance,
      A desperate programmer"

I started with approx. 2000 such lines and fed them to the .hotspot_compiler file (after removing the compile=NONE flag of course ) and the JVM did not crash. After the divide and conquer process I was left with a single method which its compilation made the JVM freak out.
Everybody is happy crash eliminated with minimal effect on performance.

Chapter III: After-max

Everybody is happy, except of me.... a lot of questions remain unanswered.
The method which aggravated the crash looks like this: 
protected boolean isXXXXXXXXX(double stdDev,double mean){
        //|stddev/mean| > 10^-5
        if(stdDev==0.0){
            return mean==0.0;
        }
        if(Double.isNaN(stdDev) || Double.isNaN(mean)){
            return true;
        }
        return Math.abs(mean/stdDev) < 100000.0;
    } 
What? That's a method which can cause the JVM to crash ? I don't think so... 
My guess was that it is related to some kind of interaction.....
I tried to reproduce the crash on an isolated test and as expected, nothing.
Could not find anything in the internet since I don't even know how to ask the correct question.

What am I going to do next ?:
  • Try to upgrade the JDK sub version though I could not find a fixed bug related to this in the release notes.
  • Try JDK 1.7
  • Try 32 bit JVM 
  • Try to disable certain optimizations such as Escape analysis 
  • Use -XX:+PrintOptoAssembly
But the most important of all hoping for some input from the community......

Monday, November 14, 2011

Java 7 and String.intern()

For a long time interning strings in HotSpot-VM would store the result in permanent generation. Since Java 7 this is no longer true. Interned strings are stored in the heap. As a result if you have an application where relied on intern() you may need to re-size your perm and your heap. Heap need to be larger Perm need to be smaller.
I understood that permanent generation is going to be removed from the HotSpot-VM in futures releases of Java.
Maybe this is the first step in this direction :)
Reference
http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html

Wednesday, April 27, 2011

Java 6u25 released

Two new operations in ThreadingMBean
 - getThreadCPUTime - over threadID array
 - getThreadAllocateBytes - allocated bytes on thread basis 

http://www.oracle.com/technetwork/java/javase/6u25releasenotes-356444.html

- Tiered compilation is in...


HotSpot diagnostic information has been expanded in several ways:
  • Tracking of cumulative Java heap bytes allocated on a per-thread basis (Threading mbean was added an operation getThreadAllocateBytes

  • On OutOfMemoryError, indication of the faulting thread in the heap dump
  • Improved handling of unexpected exceptions in application native code
  • Better indication of native heap exhaustion
  • More details in hs_err files

Wednesday, April 6, 2011

Tuesday, March 22, 2011

Thread Top

Hi,
I have finally found time to write a small tool which can help profiling Java applications.
Its name is threadTop and it connects to your application via JMX.
It can measure

  •   thread contention 
  •   thread CPU usage
I think it can be very useful when combined with jstack.
Waiting for your feedback :

Tuesday, March 15, 2011

Controlling GC logs at runtime

GC logging can be controlled using the following command line options:
  • -XX:+PrintGC (same as -verbose:gc) - Displays basic GC details 
  • -XX:+PrintGCDetails  - Displays advanced GC details 
  • -XX:+PrintGCTimeStamps - Displays the time stamp for every GC event  reported
  • -XX:+PrintGCDateStamps -  Displays the date for every GC event  reported  
  • -Xloggc:c:\gc.log     - Writes GC logging to the specified file.
It is usually recommended to run with -XX:+PrintGC -XX:+PrintGCTimeStamps -Xloggc:c:\gc.log even in production. The amount of data written to those files is not huge and it can help to analyze GC throughput ,pause time and even to expose bugs in the GC mechanism.
On the other hand these files are not managed by a "log4j style" rolling appender and will not be trimmed by default. So what would we choose ?
The good news is that logging GC can be configured at runtime. It can be enabled and disabled at will via JMX.
The JMX object com.sun.management:type=HotSpotDiagnostic has an attribute named DiagnosticOptions. This attribute contains 12 boolean sub-attributes (as for Java 1.6.0_24). All of them are very interesting but for the GC logging we need to take a look on 4 of them:
  • PrintGC
  • PrintGCDetails
  • PrintGCTimeStamps
  • PrintGCDateStamps

The bad news is that -Xloggc   cannot be configured via JMX so you will need to configure it on startup.


All these can be modified using jconsole or VisualVM using the operation setVMOption.

My recommendation is that even if you do not want to log GC by default add the following params to your command line -Xloggc:c:\gc.log -XX:-PrintGC -XX:+PrintGCTimeStamps 
(Notice that the PrintGC has a -  sign before it)
When you will need to analyze GC just turn on PrintGC and possible the PrintGCDetails.


Thursday, March 3, 2011

-XX:PrintFLSStatistics=1

When trying to understand what is going on with GC logging on G1GC. I found an undocumented JVM param that can really help when using CMS. -XX:PrintFLSStatistics=1 will report extra data for every GC:

8.790: [ParNew: 72703K->6341K(76672K), 0.0067245 secs] 72703K->6341K(816832K)After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 94724096
Max   Chunk Size: 94724096
Number of Blocks: 1
Av.  Block  Size: 94724096
Tree      Height: 1
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
, 0.0068238 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]  

This can really help when trying to understand CMS concurrent mode failure due to fragmentation

Pity it is not reporting data in a single line though :-)

Friday, February 25, 2011

JDK1.6.0_24 released

Nothing interesting though.... JDK1.6.0_23 included the latest version of VisualVM but this is old news by now.

First impressions from G1GC

I was trying to see how G1GC will cope with an application that I am investigating.
Here are my disappointing results:
1. With JDK 1.6.0_20 G1GC is not reporting memory correctly via JMX. problem is solved in later jdks but I am not able at this stage to upgrade the JVM
2. VisualVM is not reporting memory correctly with G1GC need to open a bug for this issue.
3. I cannot find thorough documentation regarding what the GC logs mean (when using G1GC with -Xloggc)
4. The application crashed every few minutes. Again this was with  160_20 and I need to recheck with 160_24