I was called to investigate a JVM crash. Usually it is the sign of a long adventure along side with the Java platform. A thrilling and frustrating event in the same time. This time I was lucky it took me 4 hours to reach a valid solution ....
Chapter I : First Encounter
First of all I looked at the directory from which the process started running. I started looking at the hs_err_pidXXXXX.log files that existed there about... 50 such files as there is a watchdog process which relaunch the application immediately after the crash. In this case just for the sake of crashing again.
I was happy to discover they accumulate very fast the process needs merely 45 seconds to crash. Someone who is not familiar with those kind of problems may be startled by fact that the process crashes so fast but for me that was heaven! 45 seconds to reproduction.
I started looking at the crash files. They were very disappointing, each one them crashed on different address the thread was applicative. In addition stack trace was missing . I snooped around to make sure no native code was running on this process(no native code good news).
All I had was a pale description of a JVM frustration:
# A fatal error has been detected by the Java Runtime Environment:
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000000000000, pid=4384, tid=8364
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000000000
# If you would like to submit a bug report, please visit:
The fact that crashing thread was an applicative thread was indicating that the crash was not related to GC.I tried switching the GC mechanism anyhow with no change in behavior. I also removed compressedOops and crash remain there standing still.
My next guess was the JIT compiler so as a first step I added the following command line option -Djava.compiler=NONE. That guess was working like a charm JVM stopped crashing. That was still not a sign for euphoria, running interpreted mode is not a very good choice so the next step included divide and conquer.
Chapter II: Lion hunting
I added the following flags to the JVM -XX:+LogCompilation -XX:+UnlockDiagnosticsVMOptions
When a HotSpot JVM is running with log compilation flag it is creating hotspot.log file which includes detailed info regarding all of its JIT activity. For a detailed document regarding this log file take a look here.
Truth be told I still don't fully understand the generated XML but I was looking for the following lines :
Compilation of a method is performed asynchronously by specific JVM threads and every request for JIT is queued before compilation. Grabbing all these lines will give me a full list of all compilation attempts done by the JVM. This list now was taken to a text editor and manipulated so it can be fed to the .hotspot_compiler file.
.hotspot_compiler is a JIT exclusion file which should be placed in the working directory from which the JVM is launching. It contains lines of the following format:
Do not compile the method realLine which resides in the inner class LineReader of the class java.util.Properties, Thanks in advance,
A desperate programmer"
I started with approx. 2000 such lines and fed them to the .hotspot_compiler file (after removing the compile=NONE flag of course ) and the JVM did not crash. After the divide and conquer process I was left with a single method which its compilation made the JVM freak out.
Everybody is happy crash eliminated with minimal effect on performance.
Chapter III: After-max
Everybody is happy, except of me.... a lot of questions remain unanswered.
The method which aggravated the crash looks like this:
For a long time interning strings in HotSpot-VM would store the result in permanent generation. Since Java 7 this is no longer true. Interned strings are stored in the heap. As a result if you have an application where relied on intern() you may need to re-size your perm and your heap. Heap need to be larger Perm need to be smaller.
I understood that permanent generation is going to be removed from the HotSpot-VM in futures releases of Java.
Maybe this is the first step in this direction :)
-XX:+PrintGCTimeStamps - Displays the time stamp for every GC event reported
-XX:+PrintGCDateStamps - Displays the date for every GC event reported
-Xloggc:c:\gc.log - Writes GC logging to the specified file.
It is usually recommended to run with -XX:+PrintGC -XX:+PrintGCTimeStamps -Xloggc:c:\gc.log even in production. The amount of data written to those files is not huge and it can help to analyze GC throughput ,pause time and even to expose bugs in the GC mechanism.
On the other hand these files are not managed by a "log4j style" rolling appender and will not be trimmed by default. So what would we choose ?
The good news is that logging GC can be configured at runtime. It can be enabled and disabled at will via JMX.
The JMX object com.sun.management:type=HotSpotDiagnostic has an attribute named DiagnosticOptions. This attribute contains 12 boolean sub-attributes (as for Java 1.6.0_24). All of them are very interesting but for the GC logging we need to take a look on 4 of them:
The bad news is that -Xloggc cannot be configured via JMX so you will need to configure it on startup.
All these can be modified using jconsole or VisualVM using the operation setVMOption.
My recommendation is that even if you do not want to log GC by default add the following params to your command line -Xloggc:c:\gc.log -XX:-PrintGC -XX:+PrintGCTimeStamps
(Notice that the PrintGC has a - sign before it)
When you will need to analyze GC just turn on PrintGC and possible the PrintGCDetails.
When trying to understand what is going on with GC logging on G1GC. I found an undocumented JVM param that can really help when using CMS. -XX:PrintFLSStatistics=1 will report extra data for every GC:
8.790: [ParNew: 72703K->6341K(76672K), 0.0067245 secs] 72703K->6341K(816832K)After GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 94724096 Max Chunk Size: 94724096 Number of Blocks: 1 Av. Block Size: 94724096 Tree Height: 1 After GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 , 0.0068238 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
This can really help when trying to understand CMS concurrent mode failure due to fragmentation
Pity it is not reporting data in a single line though :-)
I was trying to see how G1GC will cope with an application that I am investigating.
Here are my disappointing results:
1. With JDK 1.6.0_20 G1GC is not reporting memory correctly via JMX. problem is solved in later jdks but I am not able at this stage to upgrade the JVM
2. VisualVM is not reporting memory correctly with G1GC need to open a bug for this issue.
3. I cannot find thorough documentation regarding what the GC logs mean (when using G1GC with -Xloggc)
4. The application crashed every few minutes. Again this was with 160_20 and I need to recheck with 160_24