Lucee Java Heap Space error on script after Lucee reinstallation

Phillyun · July 17, 2022, 2:18pm

Here’s the pattern I updated to (added the finally): Didn’t fix the underlying issue.

<cftry>
	<!--- set / loop through myData ... --->
	<cfcatch>
		<cfset logAndHandleError(cfcatch) />
	</cfcatch>
	<cffinally>
		<!--- clean up memory --->
		<cfset myData = "" />
	</cffinally>
</cftry>

@thequeue I’d be interested in knowing if you have a test-case that can be isolated / reproduced.
Right now we haven’t been able to locate the task that’s causing our issue.
I’ve disabled one that I think may be an issue but there’s just too much activity to know for sure.

Will report back if we find a root cause.

thequeue · July 17, 2022, 6:00pm

Hi Charlie,

Going through your message:
I can confirm that the memory settings are as follows in lucee/tomcat/logs/lucee-stderr.log

16-Jul-2022 17:03:48.961 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Xms512m
16-Jul-2022 17:03:48.961 INFO [main] org.apache.catalina.startup.VersionLoggerListener.log Command line argument: -Xmx1024m

I watched the video you put together addressing memory issues and using FR. While I did learn a lot about FR and how Lucee/CF uses memory, I didn’t find it useful for my specific circumstances. The charts aren’t telling me anything I don’t already know, which is, that the server is crashing due to memory issues.

I am currently going through my code trying to remove elements from the script in order to not cause the memory issue. I’m not entirely convinced this is the proper approach bu I don’t really have any other avenues to explore at this time. I think if there’s an underlying config issue, then there is no proper way to refactor code, as it’s not addressing the server-level issue (if there is one).

Is Zack’s suggestion of " this_query = ‘’ " to clear the memory a good approach? I am working with a “large” dataset (20k rows x 30 cols), but for the duration of the script’s execution, I’m using a cached query result. Data is only retrieved once at the start of the script.

thequeue · July 17, 2022, 6:30pm

I’d like to add, I’ve tested doubling the heap size to 1024/2048, from 512/1024.
I’m currently testing with 50 script iterations. I’ve received the same memory error with double the heap size.
Previously, I would run 250 iterations at 512/1024 without an error. Something is seriously wrong here!
Looking at the “profile” of the error in FR is unhelpful for me. I don’t understand the correlation between what’s displayed in the dump and my code. This is where I got to previously and downgrading Lucee versions solved the problem “magically”.

carehart · July 17, 2022, 9:53pm

Good to hear the confirmation of the 1g. Bummer to hear that even 2g doesn’t help. But since downgrading to the same Lucee version that worked didn’t help, that seems surely to suggest that what’s changed is not about Lucee, per se. (Can you confirm that is indeed the case, that downgrading to what you ran before did not help?)

As for my presentation not helping, note that it’s built upon others which I referred to as it went. I can’t know if you’re saying you already knew what those discussed. I also can’t know what you do and don’t know about fr. I get the sense you may feel generally that it’s not able to help, but I’m confident that it can and usually is.

For instance, have you confirmed which space in the heap is rising? And are you getting and outofmemory error, in any lucee or tomcat log? Or do you see it in your app or fr?

And are you saying heap goes up as you run requests repeatedly? And are you perhaps trying to do a Gc from within fr, between requests, to see if the heap is not going down then? I’m that case, something is holding memory.

And did you confirm if it’s something like sessions going up, if your app has them enabled? If you’re running a load test tool, note that if that tool doesn’t send back the session cookies, then each request creates a new session. That’s an example of where perhaps a change about your load test (then and now) is more to blame than Lucee itself.

I’ll add also that that the talk was, as I said, from 2017. And while I said that what it offered is essentially unchanged, there ARE some new things in FR that may help more on this. First I’d mentioned in that video that the memory profiling was then coming soon. That may help (especially its ability to compare a memory profile at one point in time to another), though it’s available only in the fr ultimate edition (or like any such ultimate feature its also in the trial or developer edition).

Then again, you say that ‘looking at the “profile” of the error in FR is unhelpful for me’. To be clear, that profile you now refer to would not be the memory profile I just spoke of. Indeed, there’s no “profile of errors” feature in fr. So I wonder first if you may mean the “event snapshot”. That too is new, but it helps more with app errors rather than memory errors.

Or perhaps you refer here to the available “profile” of a request, available and offered (in the ultimate edition) for any request that exceeds 3 seconds. To be clear, I did not in my talk or above suggest that the request profiling feature would help with memory problems (though I’m not saying it can’t help, indirectly.)

FR also added (some releases back) a tracking of the amount of “allocated memory” (heap) per request. That can seem fabulous to some, and it may help you. It’s tracked both within each request’s details AND as an available requests>by memory list. One thing I’d warn of is that is indeed showing the total heap ALLOCATED over the life of that request, not the MAX mem used at any one time during the request. As such, you may see a request shown with a value here of a few gigs–even when your heap max was only 1g! Do I mean to suggest that new memory tracking feature is not valuable? No, but it tends to confuse people, who think it must be mistaken.

I could also argue that with an app allocating such very large memory–even exceeding the heap–but not FAILING, that could be seen as a sign of the jvm doing its job: it IS doing GCs, clearly, which should help keep from running out of memory. Still, someone may argue (rightly) that one may want to question what is causing the request/their code or perhaps the nature of requests to lead to large memory allocation.

Again, that feature was new so I did not cover it in my talk. I’ll add a comment on the video at youtube to list a few such other things like the above, as additional tools in fr to help with memory troubleshooting. And it’s indeed time for me to reprise/expand/revise that talk.

Finally, I just commented about how if requests are COMPLETING and heap is being gc’ed, that’s good news. But you may well find that some request or requests are running AT THE TIME you run out of heap, and Lucee (or ACF) crashes. Here, @Phillyun was lamenting how the fact that FR dies with the cfml engine diminishes the value of FR. But that’s not as obvious as it may seem: besides the post crash troubleshooting video I referred to in reply to him, where I highlight the value of the FR CP email alerts (showing what requests were running at the time of trouble), I’ll also note that I did yet another talk on processing the fr logs with excel (or any spreadsheet tool). I referred to it in that memory talk also.

In that log processing video, I show a technique for finding what request/s WERE running at the crash, because the fr request log tracks BOTH the start AND end of a request (even most web servers only track the END of a request). As such, my technique shows using that info to find what requests were running at the crash, which has helped many (and I show it about monthly in my consulting). It’s at 47:30 for a few mins, here:

Hope that or something here may help. I’m sure it’s frustrating. You just want the problem solved. You may even want the lucee folks to own there’s a bug, but even then you have to give them more to go on. Zack’s trying to get that with you. I just approach things more from a diagnostic approach.

It may seem tedious, especially for those not familiar with the tools and techniques. I give away all the info to help folks wanting to do that for themselves, or again I can help folks via direct remote consulting. And I’ll repeat that if you don’t find the time valuable, you won’t pay for it–and we could of course stop as soon as you feel we’re not going to get nowhere. It’s rare that I can’t find and solve such issues.

I just find that trying to solve a knotty problem like yours via a forum as this is like playing battleship, for those old enough to get the board game reference.

thequeue · July 17, 2022, 10:13pm

I certainly appreciate the help. I’m sure that you would be able to find a solution via your consulting services, and if I had cash to spare I probably would jump on it.
I will go through your material and ideas above and hopefully can find something.

Gavin_Baumanis · July 17, 2022, 11:49pm

I think I read that you’re used the Windows Installer…
So this might not help you at all… but (unfortunately, VERY) recently I had an issue where I could not get a new STAGING server to run for than 3 or 4 minutes… It seemed as soon as I started using the CF App, it crashed Tomcat. Nothing (about the crash) in the Lucee logs / nothing in the Tomcat logs - irrespective of the logging detail.

Then, the company owner asked me What the “swap” space was. We use Linux - and I had no idea… it wasn’t anything that “I” had ever set.

Turns out the new server, did not have ANY swap space at all.
So I created a new disk, added it to the (AWS) instance - restarted TomCat - and “just like that” everything was working as expected. Of course I then had to do a little googling to work out how to ensure the swap drive was mounted and enabled as “swap” space upon restarts - but ultimately - something that had me chasing my tale around Tomcat configs / Lucee configs and my code (on and off, for over a week) - was unrelated to any of them.

Good luck - I am reading with interest - as I am learning a fair bit from the thread as it progresses.

thequeue · July 18, 2022, 2:06am

Lots to process, but let me address your questions.

My script does a lot, but basically it gets 20k rows of MySQL data and then loops through each row to do analysis collectively, and on each row. It records analysis results in MySQL.
It then continues to loop through (ideally) 199+ more times, doing slightly different analysis of the same mysql dataset (cached).
So… one mysql query… 200 loop iterations where each iteration does analysis of each row (20k).
This would take between 50-120 seconds to complete the process. And additionally, when this was working, I would have this process running concurrently 4 additional times (through scheduler), as there are a total of 100k experiments to run. It’s segmented into these 200 batches in order to not crash the server.

It was all fine with 512/1024MB heap space.

This all worked great until I uninstalled and reinstalled Lucee. I had this memory error previously and it was solved by downgrading (worked on first attempt). This solution is not solving the problem this time. The script has not changed since before the reinstall.

I can confirm I’ve now tried around 12 different versions of Lucee ranging from current to ones older than 5.3.8.189. I’ve also tried Java versions 11-17 on a couple different Lucee versions as well with minuscule improvement.

Here is a sceenshot of FR’s memory charts. Where the charts start moving is when the script is run. It crashed within 10 seconds or so.

In the browser, I usually get an IIS 500 error, but sometimes I get a Lucee Heap Space error (which is weird). And yes, I see the heapspace error in my (webroot)/WEB-INF/lucee/logs/exception.log (additionally in FR):

“ERROR”,“ajp-nio-127.0.0.1-8009-exec-5”,“07/17/2022”,“21:47:08”,“”,"Java heap space;Java heap space;lucee.runtime.exp.NativeException: Java heap space
Caused by: java.lang.OutOfMemoryError: Java heap space

And are you saying heap goes up as you run the requests repeatedly? And are you doing a Gc from within fr, between requests, and the heap is not going down? If so, something is holding memory.

Here’s the process to make the error happen.

Restart Lucee service in Windows Services
Run script in browser (scheduled services mentioned above are disabled)
Script is able to process around 25 loop iterations before crashing with Heap error
Not touching any GC controls

Did you confirm if it’s something like sessions going up

The script doesn’t use sessions and they’re disabled in its Application.cfc (which is super bare-bones)

To be clear, that profile you now refer to would not be the memory profile I just spoke of. Indeed, there’s no “profile of errors” feature in fr. So I wonder first if you may mean the “event snapshot”.

Yes I think I was looking at the profile of the request which had the error.

I’m not really understanding how I can see memory details in regards to the error in FR. I’ll watch the video you posted and investigate.

I’m truly baffled how this isn’t working for me, after working great for months.

carehart · July 18, 2022, 2:08am

@Gavin_Baumanis that’s certainly an interesting observation. Thanks for sharing it. But I’m curious: you say there was no error in your logs. As such, it’s not clear if you’re proposing this as a possible solution for the out of memory condition that @thequeue has reported in starting this thread.

And @thequeue, thanks for your last reply. Hope the analysis goes well. We’ll be here for questions or discoveries.

Phillyun · July 18, 2022, 2:16am

Indeed. We’re down to half dozen requests, one of which is suspect since it’s a daily task, not hourly or more frequent.

Re: lamenting -
While FR setting saves 30 days of archives, there are never archive zip files (so I thought). What was a happy new discovery is the crash zip being available in FR so we did have a few hours of data. This was helpful in learning of the task we narrowed in on.

Charts show memory going up from 50% to 99% (of 6GB). When we hit 98-99% CPU pegs at 100% and that’s that.

thequeue · July 18, 2022, 2:47pm

Wow. I believe I’ve solved the problem and you aren’t going to believe how simple it was to fix.
Whitespace management was disabled.
I enabled “Simple whitespace management” in the server Administrator and the script completes its execution every time now.
It reliably crashes every time when whitespace management is disabled. Whitespace management is disabled by default in fresh Lucee installs. Perhaps it should be enabled by default??

carehart · July 18, 2022, 5:09pm

Great to see you solved it. I think we all were presuming that it would not have been something like that, as you’d referred to the upgrade/downgrade of lucee being the original cause/solution.

(Or might you contend that you “did not change” that value?)

But sure, config settings (and differences about them) can be the cause of various impacts, pro or con. As for wondering if that whitespace setting should be enabled by default, that’s the kind of setting where both Adobe and Lucee have to weigh carefully those pros and cons. An experience like yours could add weight to changing the default, but such changes take a lot of consideration to make sure it would help most people far more than it would help.

carehart · July 18, 2022, 5:28pm

@Phillyun, back to your issues, when you say, “while FR setting saves 30 days of archives, there are never archive zip files (so I thought)”, are you saying you didn’t realize it would offer the zip for the logs BEFORE it crashed? Indeed it does. Glad if discovering that has helped.

And better still if you may be making progress to find what was running. Had you used the technique i mentioned (in that post you’re quoting) to find exactly what those were? Is that how you’re confirming the “suspects”? It’s just not clear. If not, again see the 47:30 mark of that fr log analysis video.

As for the heap jumping, sure, the FR logs (and graphs of them) can confirm that. And the request logs can help you know what’s running at that time. Better still for that are the fr cp alert emails if configured suitably, as I discussed in the very first video I’d mentioned.

Bottom line, you should be able to know what WAS running at the time memory is rising–if indeed any are. Sometimes there are NOT, which is also part of what I discuss in those videos.

Solving memory problems can be challenging but as I’ve argued from the first, it CAN be done, with the right tools. (It isn’t always obvious how to find the root cause, like thequeue’s whitespace issue, but memory profiling in fr or heapdump analysis in other tools might have even helped spot that, though not directly–which is why I focus so much on other things first, if they may get us to the answer.)

Phillyun · July 18, 2022, 5:41pm

FR doesn’t offer archive zips before a crash but does after a crash. IDK what setting I’m missing but we’re saving those crash zips as we make adjustments for comparisons.
Underlying cause is still TBD but we’re down to the offending task: Now it’s a matter of time to get to a root cause.

Moving those haystacks within the task to find the needle is exhausting with a 5-6 hour feedback loop but we really appreciate the FR assistance / education. My arms are tired… going to put my coding-pitchfork down for a bit and try moving haystacks later to see what turns up.

Phillyun · August 2, 2022, 12:05pm

Glad you got your issue resolved.
I finally got back to this and ended up reverting all of the code changes I made.

I mistakenly left detailed debugging turned on in the same context as the tasks run.
As far as I can tell, debugging is in-memory (doesn’t exist between restarts), so when the scheduled tasks ran overnight, the server belly-flopped by morning.

carehart · August 2, 2022, 1:52pm

Wow, @Phillyun, glad you solved it and thanks for sharing.

So both you and @thequeue found your problems to be about config settings. That’s indeed an important lesson for us all to remember–and an easy one to forget to consider in the heat of battle (especially when downgrading/reinstalling cfml engine versions, to try to explain curious “differences in behavior”).

That said, while some may be surprised to hear that having debugging output enabled was itself the cause, others will say they have seen it before.

But can you at least clarify whether it was just one long-running request whose “debug output gathering” filled memory? Or do you feel it was the debugging of many requests over time that was the problem. It certainly should NOT be that the memory used for “debug output gathering” should be unable to be GCed AFTER the requests finish.

Phillyun · August 2, 2022, 3:39pm

My current working theory (time will tell) is a combination of settings / tasks.
During the day, it’s useful to have debugging available for “everything that’s running recently” (more requests than the default).

That doesn’t work for those nightly tasks (where debugging should probably be explicitly turned off).

This meant though, there wasn’t a “single task” that was throwing Lucee over the edge-of-no-return.
Rather, it was the accumulation of many tasks over time (and the accumulated debugging logs that I believe are held in-memory) within the Lucee Web Context:

http://site.local/lucee/admin/web.cfm?action=debugging.logs

I’ll report back if we find otherwise, but the change was likely the combo of turning debugging on AND increasing Maximum Logged Requests to an impossible number for those really noisy nightly tasks to be held in memory. (edit: This working theory so far also supports the starting confusion: “not happening in production and prod has same (or more) data so what gives”)

Zackster · August 2, 2022, 4:32pm

4 posts were split to a new topic: Improving how debug logs are stored

Zackster · August 4, 2022, 4:19pm

this may help, since 5.3.8, you can also purge all the debugging logs from memory on demand

https://lucee.daemonite.io/t/purge-debugging-logs-5-3-8-82/7512

remember, anything you can do in the Lucee admin, you can do from cfml and Lucee is open source, so here’s the cookbook

Phillyun · August 4, 2022, 4:32pm

This is helpful. Thank you.

Gavin_Baumanis · August 5, 2022, 3:35am

Hi @carehart

tl;dr;
I was offering it as a suggestion of what might be wrong.
It would have been clearer / more succinct to just write.
“Ensure you have a swap disk and it is enabled, if using Linux.”

It was only an offering of; My issue: that might have helped.

What really prompted me to reply was that @thequeue stated that the issue occurred AFTER totally uninstalling and reinstalling. Which - is when we got our issues.

Granted the difference is - we created a whole new AWS instance (that didn’t include a swap disk).

I don’t pretend to know everything about Linux admin or AWS instance admin either…
But I have never previously had to worry about creating a swap disk - it was just “there” as part of the (previous) Linux install.
So when the issue was happening for us - (after 0 code changes) - it was something I didn’t think to check.

Gavin.