An Adventure into Java Out of Memory Error: Part 1

Background

This journey begins with a complex Java backend project I've been working on for about a year and a half. Despite this extended period, I still feel there is much more to learn. The project has a history spanning over ten years, which means it contains many convoluted design decisions, historical trade-offs, undefined behaviors, elusive bugs, and low-quality legacy code scattered throughout the codebase.

What makes the situation worse is that the original designers and most of the senior engineers left the company several years ago, leaving behind no documentation or meaningful comments. Consequently, when something goes wrong, we are on our own with no guidance from the previous team. Recently, we encountered persistent HTTP 503 Service Unavailable errors after an infrastructure update. The problem-solving process was quite challenging and full of mysteries and exceptions, but we ultimately resolved it. I believe it's worthwhile to record this journey, so here's the story.

The First Investigation

Our first step in tackling the issue was to investigate the logs and analyze monitoring data. What caused the 503 errors? By examining the access logs and Tomcat's catalina.log, we noticed that the Tomcat server had restarted several times. Comparing the server restart times with the requests that received 503 responses, we found that all the 503 errors occurred while the server was restarting. This led us to the next question: Why was the Tomcat server restarting?

AWS CloudWatch Metrics indicated abnormal memory usage: the average of Memory Used Percent was around 93%. We also found several hs_err_pid fatal error log files generated by the Java Virtual Machine (JVM) in the /usr/share/tomcat directory. It became clear that the Tomcat server was restarting because it was terminated after a Java out-of-memory error (OutOfMemoryError) occurred. This termination made the service unavailable, which was detected by our health monitoring script, triggering a restart of Tomcat. The challenging part was to determine one thing: Why did the OutOfMemoryError happen? At this stage, the fatal error log was our only available information. Therefore, we proceeded with our investigation by delving into its contents.

Analyzing the Fatal Error Log `hs_err_pid`

The hs_err_pid file is a fatal error log generated by the Java Virtual Machine when it encounters a fatal error. It provides detailed diagnostic information that can help developers diagnose and fix the problem. We can find more details about it from the official documentation.

The first part of the fatal error log showed:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:3213), pid=4000, tid=15996
#
# JRE version: OpenJDK Runtime Environment Corretto-11.0.24.8.1 (11.0.24+8) (build 11.0.24+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.24.8.1 (11.0.24+8-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#

From this, we learned that we are using OpenJDK 11, and the JVM failed during a native memory allocation (mmap) of 16 KB. The log also provided possible reasons and solutions for the OutOfMemoryError. But which one applied to our situation?

JVM Option `-Xmx`

The -Xmx option in Java specifies the maximum heap size that the JVM can allocate for the application.

From the fatal error log, we saw that the maximum size of the Java heap was set by -Xmx2847m, and there were no other JVM tuning options set, meaning the JVM was started with default configurations. Both the SUMMARY and PROCESS sections in the fatal error log provided detailed arguments:

---------------  S U M M A R Y ------------

Command Line: -Dfile.encoding=UTF-8 -Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true --module-path=/usr/share/tomcat/modules -Xmx2847m -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start

---------------  P R O C E S S  ---------------
VM Arguments:
jvm_args: -Dfile.encoding=UTF-8 -Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true --module-path=/usr/share/tomcat/modules -Xmx2847m -Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp -Djava.util.logging.config.file=/usr/share/tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager java_command: org.apache.catalina.startup.Bootstrap start
java_class_path (initial): /usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
Launcher Type: SUN_STANDARD

But why the specific value 2847? It turns out that the application is configured by an initialization script, and the number is calculated as follows:

"-Xmx$(awk '/MemTotal/ { printf "%d", $2/1024/1024 < 1 ? $2/1024/2 : $2/1024 - 1024 }' /proc/meminfo)m"

This AWK command checks if the total physical memory size is less than 1 GB. If so, it halves the memory size in megabytes; otherwise it subtracts 1024 MB (1 GB) from the total memory size. Since we are using AWS EC2 t3.medium instances, which have about 4 GB of memory per instance, the final result is about 3 GB for the JVM heap. The precise amount of available memory is listed in the hs_err_pid log file, as described later, which is 3964656 KB. The calculation is 3964656 / 1024 - 1024 = 2847.734375, and the %d format specifier truncates it to an integer, resulting in 2847.

According to this article from Red Hat, the total memory usage of a JVM process can be calculated as:

JVM memory = Heap memory + Metaspace + CodeCache + (ThreadStackSize * Number of Threads) + DirectByteBuffers + Jvm-native

Given the OutOfMemoryError and the abnormal memory usage condition, we realized that it is insufficient to leave just 1 GB for the JVM's native memory needs and other system processes. This means that the current -Xmx setting is too high, leaving insufficient memory for native allocations. We found more evidence of this in the memory usage status detailed in /proc/meminfo.

Memeory Usage Status in `/proc/meminfo`

The /proc/meminfo file provides detailed information about the system's memory usage, including how much memory is free, available, used, or cached. This article provides an excellent explanation of how to understand and use it to analyze memory issues. For more authoritative and comprehensive information, refer to The Linux Kernel documentation.

In the SYSTEM section of the fatal error log, we saw:

---------------  S Y S T E M  ---------------

OS:Amazon Linux release 2 (Karoo)
uname:Linux 4.14.336-257.566.amzn2.x86_64 #1 SMP Sat Mar 9 09:49:51 UTC 2024 x86_64
OS uptime: 5 days 18:08 hours
libc:glibc 2.26 NPTL 2.26 
rlimit (soft/hard): STACK 10240k/10240k , CORE 0k/infinity , NPROC infinity/infinity , NOFILE 65535/65535 , AS infinity/infinity , CPU infinity/infinity , DATA infinity/infinity , FSIZE infinity/infinity , MEMLOCK infinity/infinity
load average:6.34 2.54 1.55

/proc/meminfo:
MemTotal:        3964656 kB
MemFree:          114400 kB
MemAvailable:       3092 kBBuffers:              96 kB
Cached:            30356 kB
SwapCached:            0 kB
Active:          3758676 kB
Inactive:          16664 kB
Active(anon):    3756488 kB
Inactive(anon):    14232 kB
Active(file):       2188 kB
Inactive(file):     2432 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB

We observed that the available memory (MemAvailable) was just 3092 KB, meaning that the physical memory was exhausted. This confirmed our suspicion that the JVM's heap size was too large, leaving insufficient memory for native allocations.

The Solution

Knowing that the issue was due to physical memory exhaustion, we needed to lower the -Xmx value and choose an appropriate value for it. As a general rule of thumb, we should leave 25%-50% of physical memory for other processes in the system.

According to the PROCESS section:

---------------  P R O C E S S  ---------------
Heap address: 0x000000074e000000, size: 2848 MB, Compressed Oops mode: Zero based, Oop shift amount: 3
Narrow klass base: 0x0000000800000000, Narrow klass shift: 0
Compressed class space size: 1073741824 Address: 0x0000000800000000

Heap:
 garbage-first heap   total 2916352K, used 1955145K [0x000000074e000000, 0x0000000800000000)  region size 1024K, 141 young (144384K), 18 survivors (18432K)
 Metaspace       used 80907K, capacity 82397K, committed 83740K, reserved 1124352K

When the OutOfMemoryError happened, the used heap size was around 1909 MB, which is significantly less than the total heap size 2848 MB. Therefore, it is reasonable to set the Java heap size to 2048 MB (-Xmx2048m), which would leave more memory available for native allocations and other system processes.

Additionally, we noticed that the only JVM option set was -Xmx. From the Global flags part in the fatal error log, the InitialHeapSize was only about 62 MB:

[Global flags]
     intx CICompilerCount                          = 2                                         {product} {ergonomic}
     uint ConcGCThreads                            = 1                                         {product} {ergonomic}
     uint G1ConcRefinementThreads                  = 2                                         {product} {ergonomic}
   size_t G1HeapRegionSize                         = 1048576                                   {product} {ergonomic}
    uintx GCDrainStackTargetSize                   = 64                                        {product} {ergonomic}
   size_t InitialHeapSize                          = 65011712                                  {product} {ergonomic}   size_t MarkStackSize                            = 4194304                                   {product} {ergonomic}
   size_t MaxHeapSize                              = 2986344448                                {product} {command line, ergonomic}
   size_t MaxNewSize                               = 1790967808                                {product} {ergonomic}
   size_t MinHeapDeltaBytes                        = 1048576                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 5825164                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122916538                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122916538                              {pd product} {ergonomic}
    uintx ReservedCodeCacheSize                    = 251658240                              {pd product} {ergonomic}
     bool SegmentedCodeCache                       = true                                      {product} {ergonomic}
     bool UseCompressedClassPointers               = true                                 {lp64_product} {ergonomic}
     bool UseCompressedOops                        = true                                 {lp64_product} {ergonomic}
     bool UseG1GC                                  = true                                      {product} {ergonomic}

As described in the Java official documentation :

For server deployments, -Xms and -Xmx are often set to the same value.

We considered setting the initial heap size (-Xms) to match the maximum heap size (-Xmx) to avoid heap resizing during runtime, which can cause performance overhead.

Therefore, we updated our JVM options to:

-Xms2048m -Xmx2048m

This adjustment reduces the maximum heap size to 2 GB and sets the initial heap size to the same value, thereby leaving approximately 2 GB of memory for the operating system and the JVM's native memory needs.

An Unexpected Outcome

Despite applying this hotfix, the issue was not resolved completely. The 503 errors continued to occur, which puzzled us. Upon further investigation, we discovered that OutOfMemoryErrors were still happening, but this time due to a different reason. In the next part of this series, I'll explain what caused the issue and how we addressed it.

An Adventure into Java Out of Memory Error: Part 1

An Adventure into Java Out of Memory Error: Part 1

Background

The First Investigation

Analyzing the Fatal Error Log hs_err_pid

JVM Option -Xmx

Memeory Usage Status in /proc/meminfo

The Solution

An Unexpected Outcome

Analyzing the Fatal Error Log `hs_err_pid`

JVM Option `-Xmx`

Memeory Usage Status in `/proc/meminfo`