Lucee / Java Memory leak & GC issues Battle

For any Developers out there migrating Adobe Coldfusion projects to Lucee that run into memory issues like I did, here are my 2 cents.

There isn’t much about this stuff online, so I figured either noone is having these issues and this post isn’t of use to anybody, or maybe there are a few developers with the same issues and would be glad to find something online about this topic.

Project information
Migrating a stable Coldfusion web application from a Coldfusion 10 / oracle environment to a Lucee 5.2.8.50 / Mysql 8.1 environment.

Blocking issues
The Lucee instance kept crashing after Out of memory errors after a couple of hours. Excessive tweaking of the JRE params and heap size had little to no effect. Our Old Gen memory space kept filling up. Manual full GC even filled it up faster. This issue had all the signs of a memory leak.

Solution

  • Using FusionReactor to monitor the memory and heap usage, taking heap snapshots and comparing them we found some third party java classes that were initialized in our code at runtime. These classes appeared to be left behind in memory. Explicitly setting the variables containing these instances to null increased our Old Gen lifetime from 3 to 6 hours. Still not enough for a stable production application, but it’s something at least.

  • The heap snapshots also showed a lot of lucee.runtime.type.StructImpl classes. We’ve learned about the lucee structnew type argument which allows structs to use weak references that can be garbage-collected. We altered the application codebase to convert most structs to structsNew(‘weak’). This didn’t change anything for the memory leak, but made the GC have a slight positive effect. Do not use this for structs in your application scope however, or GC will clear those out as well.

  • Some part of the code was using a javaloader in runtime to load an additional jar for some specific image processing. Removing the javaloader all together loading this additional jar at startup as a lucee lib made the initial heap usage after startup lower and made our heap last longer.

  • The big bulls-eye was looking up all java.util.Treemap and java.util.LinkedHashMap instances that were sometimes used in code to make a struct-like object that preserved the order of the keys. It appears that these native objects can’t be garbage-collected. Refactoring our code to replace these objects with normal arrays and structs made our entire heap run smooth and stable. This had a huge impact!

Our application still can’t run indefinitely without eventually running out of memory, but at least our efforts increased the stability to a workable state. The top classes that occupy the heap now are lucee.commons.collection.concurrent.ConcurrentHashMapPro, but issues with this class are already picked up in the Lucee issue tracker.

I don’t know why these native java objects weren’t a problem in Adobe Coldfusion. It was also running on tomcat, but a slightly older version.
This has been a stressful witch-hunt, so i hope i can save someone else some trouble with sharing what helped in our situation.

9 Likes

Very interesting. We just ran into an issue where some of our pods will shoot from like 7GB or ram up to 17GB of ram and just start trying to do massive GC in the Old-Generation until they lock up and die (are killed by Kubernetes). I’ve been pulling my hair out all week trying to figure it out. I actually just reverted 1-month of code trying to narrow it down. I’ll definitely look through the code for some ordered-structs.

1 Like

Try the latest snapshot first?

We’re in the processing of seeing if we can upgrade.

I would suggest looking at your JDK, but here is what usually works for most instances.

-XX PermSize=1024m
-XX MaxPermSize=1024m
-XX ParallelGCThreads=30

You can safely increase the number of threads by 5 or 10 per 2 GB of memory you allocate, dependent upon processor type and processor count.

I have found that Lucee, while its ultra memory conservative, many legacy applications I have to address are written well above the normal memory scope that comes out of the box with Lucee.

I believe the permgen arguments are ignored in Java 1.8 and above, so I wouldn’t bother with that.

@bennadel are you post 5.3.35 ?

We (Pixl8) have been analysing heap dumps and believe a bug was introduced then