WebLogic Pro - Investigate JVM Crashes

Investigate JVM Crashes
Get to the root cause when no core dump is available
by Rao Akella

February 17, 2005

A JVM is an ordinary process like any other, and it can sometimes terminate unexpectedly. Java has built-in support for handling exceptions, and JVMs can tolerate run-of-the-mill problems better than most. The very exceptional nature of a JVM crash makes it both interesting and important to determine its root cause, since it can be indicative of a serious problem.

A core dump is created usually when a process crashes. A core dump file is a memory map of the running process, and it saves the state of the application at the time of its termination. Therefore, it is important—probably the most important—evidence in determining why the JVM crashed.

Sometimes, however, a core dump is not created, which is like a missing body in a murder mystery, and we are forced to fall back on circumstantial evidence (and sometimes reenact the murder) to determine just what killed the JVM. This scenario is what we'll address here.

Let's look at the symptoms. The application terminates unexpectedly and no core dump can be found related to the crash. This file is usually created in the directory where the JVM was started. On Solaris, it is typically called core, whereas on Linux it is called core.<pid>. BEA's JRockit JVM creates "minidumps" in Windows with the name jrockit.<pid>.mdmp.

In looking at why this problem occurs, it is usually possible to create a core dump when the application handles the exception (which happens when the fatal problem occurs in the JVM's own code). However, if the error occurs in a JNI or JDBC call, or in a third-party library where the exception is unhandled, it's not possible to create a core dump when the operating system terminates the JVM.

A Troubleshooting Checklist
Your checklist for troubleshooting this scenario includes several key items.

Are you sure the JVM crashed? The JVM simply may have exited upon encountering a problem it couldn't handle. However, the JVM will almost always print and/or log a descriptive error message before doing so. For example, when JRockit encounters a fatal JNI error, it prints this message, followed by the error description before it exits:

ERROR: JNI Panic -- Fatal error: ...

Similarly, when JRockit runs out of native (that is, non-heap) memory, it prints this message followed by a description, and then exits:

Fatal error in JRockit ...

JRockit uses native memory for code generation, generated code, optimization, and so on. Being unable to acquire more native memory for internal operations is a serious error that cannot be rolled back or ignored easily. Terminating the JVM may be the best option under exceptional circumstances such as these.

Checking the application log (or the console) for exit-time messages is a quick way to determine if the JVM terminated voluntarily or involuntarily.

Can a core dump be created at all? The reason for not creating a core dump can sometimes be as simple as running out of disk space or quota to write the file, not having the correct access permissions to create or write a file in the directory, the prior presence of a core dump of the same name that is read-only or write-protected, or not having core dumps enabled (Unix/Linux specific). You can use the limit or ulimit commands to determine if core dumps are disabled and enable them if necessary. For example, on Linux the command:

ulimit -c unlimited

enables core dumps to be written, no matter what their size. Core dump sizes can be restricted if disk space limitations are a concern (see online documentation for more information on the man limit or man ulimit commands).

If any of these reasons is a factor, correct the problem to enable core dumps to be written. Then, if the crash is reproducible, the best way to proceed is to replicate the problem to create a core dump.

Do you have a dump file (JRockit specific)? If the crash is a JRockit crash, a dump file may have been created in the directory where JRockit was started (the file name has the format jrockit.<pid>.dump). This file is a simple text file that typically has this format: header, which contains run-time and system information such as timestamp, JVM version, garbage-collection strategy, thread system (native or thin threads), number of CPUs, total physical memory, OS version, and command-line parameters (for JVM options like heap size and GC strategy); registers; stack segment; code segment; loaded modules (with an indication of which library the crash occurred in); and thread stack trace, which contains a stack trace of the thread in which the crash occurred.

Once a relevant dump file has been found, the thread stack trace (at the end of the file) provides extremely useful information. Using keywords in the stack trace, these knowledge repositories can be searched for prior community experience: AskBEA (BEA eSupport), BEA's JRockit newsgroup, and Google (see Resources). Chances are good that the problem has been seen before, and may even have been fixed in a JRockit patch or later release.

Any Luck?
Is the crash reproducible? We are in luck if the problem happens consistently or frequently (or even infrequently, as long as the dump is somewhat reproducible). In the absence of any core or dump files, one remaining option is to try to narrow down the location of the problem code that causes the JVM crash:

Debugger/IDE: The JVM can be started in a debugger (GDB, XDB, DBX, and so on) or an IDE (like Visual Studio) to catch and handle the fatal error. This approach comes with its own caveat that the crash may not be reproducible in a debugger/IDE, but if it can be recreated, then a lot of useful information can be obtained from the tool.
Tracing: The command-line flag -Xverbose can be enabled to turn on extra tracing in the JVM. The application may have some debug flags as well, which can be enabled.
Thread dumps: Depending on the JVM version, it may be possible to get a thread dump before the process exits. HotSpot supports the (undocumented) command-line option:
```
 -XX:+ShowMessageBoxOnError;
```
The corresponding JRockit option is:
```
 -Djrockit.waitonerror
```
When the JVM is crashing, it may prompt the user: "Do you want to debug the problem?" This prompt pauses the process, thereby creating an opportunity to generate a thread dump (a stack trace of every thread in the JVM), attach a debugger, or perform some other debugging activity. However, this procedure does not work in all cases (for example, in the case of a stack overflow).
Off-site replication: If a stand-alone test case can be extracted from the application that can reproduce the problem, it can be of great help in moving the investigation along more quickly because both support and engineering organizations are freed from their dependence on the customer to try various suggestions and patches. Unfortunately, this solution is nearly always not possible because customer applications tend to be highly environment-dependent and tightly coupled to various third-party applications.
Debug builds (engineering special case): Under extreme circumstances, if gathering information about the problem proves difficult, the engineering group of the JVM vendor may be prevailed upon to provide a debug build that helps to locate the problem. Note, however, that this solution can be achieved only after contacting the vendor's support organization.

What do we know? The less information we have about the reason for the crash, the more questions we need to ask to gather additional information about the environment and application. Using experience and knowledge-base searches, identifying the problem from similar cases seen in the past may be possible.

Some key questions to ask are: What is the nature of the application? What does it do? What are its key components? What was the application doing when it crashed (JNI, JDBC, JSP, JMS, network comm)? What is the nature of the crash—how does it manifest itself—(process disappears, pop-up error dialog)? What component of the application does the crash occur in, and can the location be narrowed down to a more specific area of the system? (For example, if it happens in a JNI call, what was the last location in the application's code from which the library call was made?) What are the hardware specifications (computer make and model, number of CPUs, physical memory, swap space, and so forth)? What are the software specifications (OS version, kernel version, JVM version, and so forth)? What are the environment specifications (garbage-collection strategy, heap-size specifications, thread system in use, process size at the time of the crash, and so forth)? What applications and/or platform logs are available?

Quick Fixes and Workarounds
Here are some strategies that can help you investigate and possibly solve the problem:

Upgrade: If possible, upgrade to the latest version of the JVM supported for the application. The upgrade may or may not solve the problem, but in any case, it's always a good strategy to see what freebies you can pick up with this approach. However, some customers may not be able to take advantage of this option, especially if the application is a production system where system modifications have to be carefully controlled and hardened.
Switch JVMs: If another vendor's JVM is supported for the application, switching JVMs may help sidestep the problem. Sun's HotSpot and BEA's JRockit differ in their implementation of several key components like JNI and memory management. Problems are frequently encountered in these areas; therefore, switching JVMs can be very helpful.
Disable JIT and force interpretation (HotSpot specific): The command-line options -Djava.compiler=none -Xint force HotSpot to turn off compilation and interpret all bytecode, which may help if the problem was in hotspotting.
Disable optimization (JRockit specific): The command-line option -Xnoopt forces JRockit to turn off all HotSpot optimization (there is no way to disable compilation in JRockit). Since optimization has been a problem area in the past, this workaround may be useful to try.
Switch to a Type 4 JDBC driver: If the crash is happening in JDBC, changing from a Type 2 (native code) to a Type 4 (100% pure Java) JDBC driver may help.

JVM crashes are relatively infrequent and may signal a serious problem, which makes finding their cause both challenging and important, especially when a core dump file is not available. Perhaps these troubleshooting suggestions will help you determine where your problem lies.

About the Author
Rao Akella is a senior software engineer at BEA Systems Inc. and has worked with the JRockit backline support team and WebLogic Server QA for over a year. He has an MS degree in aerospace engineering, a B.Tech. degree in mechanical engineering, and has been part of the Silicon Valley software scene for more than a decade. Contact Rao at .