Memory issue with 5.3.6.61

Julian_Halliwell · June 2, 2020, 3:40pm

[Stack: OS: Windows; Java:OpenJDK 8u252-b09; Tomcat: 9.0.31]

Has anyone else had memory problems with the latest stable release 5.3.6.61?

We upgraded one of our prod machines to it from 5.3.4.85 ten days ago and it’s been fine.

But on our other one it seemed to result in a memory leak of some kind. This machine is quite busy but only during local business hours. We installed it on Sunday and all seemed well when quiet. But at 8am on Monday memory use started climbing abnormally.

Here’s what it normally looks like at that time on a weekday (running 5.3.4.85). There is 8GB allocated which is more than adequate. You can see Garbage Collection running normally, and this is the general pattern for the rest of a typical day.

5.3.4.85-0800-0859

But this is what we saw with 5.3.6.61. GC doesn’t seem to be recovering very much space:

5.3.6.61-0800-0859

I ran GC manually (using FR) but it didn’t have much effect either.

By mid afternoon it had climbed further:

5.3.6.61-1500-1559

…until by the end of the day the heap was full, the server ground to a halt and had to be restarted:

5.3.6.61-2100-2150

The same pattern played out again today more quickly so we’ve gone back to 5.3.4.85 which has been solid for over 2 months (despite being a “snapshot”).

Perhaps it’s related to this issue, although this is definitely heap memory, not metaspace. Or maybe this one?

ggriffit · August 24, 2020, 1:15am

Doesn’t look like you have seen any response. We see something similar on our side on very busy servers. Used to GC every couple hours and regain 60% or memory. Now we see it very often and only gaining maybe 20% each time.

We are also running latest java8 and tomcat 9.0.31… Our Lucee version just jumped to 5.3.6.61 from a 5.2 version that we don’t really care to bounce back to.

Julian_Halliwell · August 24, 2020, 8:44am

Thanks for chiming in with your experience, which confirms it only affects certain “busy” conditions. Unfortunately that makes it hard to identify the issue.

If you want to move on from 5.2, we’ve found 5.3.4.85 to be very solid. Don’t be put off by the SNAPSHOT label.

andreas · August 24, 2020, 2:44pm

Thanks for sharing this info Julian.

ggriffit · August 27, 2020, 3:30pm

Anyone have input if anything is being addressed in a future release around this? Is bug created? I am trying to decide next steps and backing up a release may be something we pass to QA next week for review. We always just deploy released versions in production.

ggriffit · September 14, 2020, 9:02pm

Julian, are you still running snapshot 5.3.4.85? Or have you found a newer snapshot or release candidate?

Julian_Halliwell · September 15, 2020, 6:47am

Yes, seems like we’re stuck on this version for now. Can’t really keep testing newer releases as the only way of doing so is by applying them to this particular production server at busy times, which is obviously very disruptive if they fail.

Julian_Halliwell · January 5, 2021, 2:40pm

Had another go at upgrading the troublesome server to the current stable release 5.3.4.47 and it seems the issue is still there.

But we have identified the specific area of the application which seems to be triggering it.

As before, the patch was applied on a Sunday and there were no issues until Monday morning when we started to see memory use rising and garbage collection recovering less and less.

We noticed though that jumps in memory were happening on the hour, which led our attention to a fairly complex scheduled task which, as it happens, runs on the hour during business hours only.

So we disabled the task temporarily and sure enough, memory usage stabilized around its current level. You can see all of this in the folowing FR memory graph:

As I say, the task is complex and we’re not sure which specific element is responsible, but it runs in a separate thread for one thing and performs various operations using parallel closures, so it seems likely it’s related to threading.

By accident I also discovered something strange. The task needed to be run again so we decided to do it manually so we could observe the expected rise in memory usage. Although not sure I’d have the expertise to understand it, I thought I’d try the Heap Dump extension Zac mentioned a while ago. I installed it but couldn’t see any way of accessing it or find any docs, so uninstalled it.

But when I returned to my FR graphs I saw that memory usage had dropped significantly at the exact moment of installing the extension. Note I didn’t run the heap dump tool, just installed it. I’ve replicated it since: just installing the extension clears out a bunch of memory. Perhaps installing any extension has this effect?

Not only that, but with the memory cleared we ran the hourly task again manually and this time GC seemed to be more effective afterwards as illustrated by this graph:

I’m not sure yet what to conclude. There’s clearly something wrong with Lucee versions 5.3.5+ in certain conditions, and our experience suggests it may be related to threads/parallelism.

But at least we know enough now to be able to control it, which makes testing easier.

@ggriffit does any of this make sense in your context?

Zackster · January 5, 2021, 3:14pm

the head dump is found here in the admin ( the latest snapshot has an xml bug and admin plugins don’t load, see [LDEV-3194] - Lucee) 5.3.8.119 is ok

I think installing a extension does a bit of a reset, hence the behaviour you’re seeing

Julian_Halliwell · January 5, 2021, 3:35pm

Thanks Zac. I poked around the extension and saw that the UI should be under Debugging but for some reason it wasn’t showing up there for me. Just tried it locally and yes I can see it now.

The reset is useful because it means we can avoid restarting Lucee which is disruptive.

But it doesn’t solve the issue unfortunately. We need to run that task and having re-enabled it, memory is now slowly filling up again and is not far off the ceiling. We can’t run a prod server like that so will have to revert to our known stable version 5.3.4.85 until we can refactor the task code to avoid this condition.

Zackster · January 5, 2021, 3:43pm

hmmm, the scheduled task logic was refactored to fix the DST bug (but that’s in only 5.3.8.82)

https://luceeserver.atlassian.net/browse/LDEV-1724

each scheduled task runs in a persistent background thread which sleeps most of the time,

you could try updating the scheduled task after it runs, which creates a new background thread, just bump the scheduler.log level up to trace to confirm it’s being recreated

once you have a heap dump, you can load it with Eclipse Memory Analyzer Open Source Project | The Eclipse Foundation and explore it

Julian_Halliwell · January 5, 2021, 4:05pm

I don’t think it’s the scheduling logic because we can trigger the issue by running the task manually.

Honestly I don’t know anything about heap dumps and wouldn’t know what to look for. Even Brad admits it’s not for the faint-hearted!

Julian_Halliwell · January 5, 2021, 4:09pm

I should add that, even though I’ve pointed the finger at threading I’m not sure of that either. The other server we run also does quite a bit of parallel processing (not as complex), and works fine with the latest release.

Zackster · January 5, 2021, 4:18pm

the servers are identical, same code, same versions n all?

Julian_Halliwell · January 5, 2021, 4:19pm

No, identical stack, but different applications. As I say we’ve identified that it’s a specific bit of application code that’s creating the condition, which involves a lot of parallel threads.

Julian_Halliwell · January 5, 2021, 8:53pm

I created a heap dump using the extension at a point where the heap was about to reach the max. A warning to others: doing this took over the server and prevented it from handling anything else for over half a minute!

The result was a 12GB file which I loaded into Memory Analyzer. That then completely took over my laptop for probably more than an hour as it parsed the data and spit out a “Leak Suspects” report. I’m afraid it doesn’t mean a great deal to me but the summary is as follows:

Problem Suspect 1

One instance of "lucee.runtime.CFMLFactoryImpl" loaded by "org.apache.felix.framework.BundleWiringImpl$BundleClassLoader @ 0x6005bbf08" occupies 4,094,718,480 (56.41%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentLinkedDeque", loaded by "<system class loader>", which occupies 4,094,717,824 (56.41%) bytes.

There are no other “suspects”, but there is more detail which I can share with anyone who might understand it.

bdw429s · January 5, 2021, 10:22pm

@Julian_Halliwell you can edit the init file for MAT and increase it’s RAM. I try to give it AT LEAST as much RAM as the heap dump is, if not more or you’ll have tons of GC church (MAT is written on Eclipse, and therefore Java underneath). If you can reproduce memory issues, it’s actually better to do so with a smaller heap just to make it more manageable.

The problem suspect above may be an issue, but much more info would be needed. If you want to zip your heap dump and share via Google drive, I can take a look. I don’t recommend sharing it publicly since sensitive info can be found in them.

Zackster · January 5, 2021, 10:40pm

what do you see in the application.log for entries with Reset xxx Unused PageContexts?

seems like possibly the pageContext cache is getting out of control ? It’s hardwired to be limited to only 100

github.com

lucee/Lucee/blob/5.3/core/src/main/java/lucee/runtime/CFMLFactoryImpl.java#L86


      
          /**
           * implements a JSP Factory, this class produces JSP compatible PageContext objects, as well as the
           * required ColdFusion specified interfaces
           */
          public final class CFMLFactoryImpl extends CFMLFactory {
          
          
	private static final long MAX_AGE = 5 * 60000; // 5 minutes
          	private static final int MAX_SIZE = 10000;
          	private static final int MAX_NORMAL_PRIORITY = 0;
          	private static final int MAX_NO_SLEEP = 0;
          	private static final int SLEEP_TIME = 100;
          	private static JspEngineInfo info = new JspEngineInfoImpl("1.0");
          	private ConfigWebPro config;
          	ConcurrentLinkedDeque<PageContextImpl> pcs = new ConcurrentLinkedDeque<PageContextImpl>();
          	private final Map<Integer, PageContextImpl> runningPcs = new ConcurrentHashMap<Integer, PageContextImpl>();
          	private final Map<Integer, PageContextImpl> runningChildPcs = new ConcurrentHashMap<Integer, PageContextImpl>();
          
          
	private ScopeContext scopeContext = new ScopeContext(this);
          	private HttpServlet _servlet;
          	private URL url = null;
          	private final CFMLEngineImpl engine;

perhaps this commit is the culprit?

https://luceeserver.atlassian.net/browse/LDEV-2559

Julian_Halliwell · January 6, 2021, 2:08pm

Thanks @bdw429s. I had increased the MAT’s -Xmx to 12GB, as it wouldn’t load the 11GB+ dump file with any less. But my laptop only has 8GB of RAM so it was forced to use swap space, hence the painful performance.

I generated another server heap dump today at a lower point in the memory increase cycle, but with the hourly task still having run a few times, and that’s yielded a dump file of 7.5GB. That zipped down to 1.75GB, but with 7-zip compression it’s 700MB. Really appreciate your offer and I’ve made that 7z file available to you on G-Drive privately.

Julian_Halliwell · January 6, 2021, 2:10pm

There’s nothing the application.log, even at INFO level.

I think you might be onto something with LDEV-2559 though: I actually ran the leak check again on today’s heap dump and in addition to the problem reported above it also flagged up 2 more suspects: java.lang.Thread and lucee.runtime.type.ArrayImpl

Thanks for the detective work.