Lucee 5.2.x Java Heap Issues

I’ve done some investigation and filed a bug report about this issue. It’s not specific to Mura, as I have the same problem with my totally custom application.

Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!
https://luceeserver.atlassian.net/browse/LDEV-1640

1 Like

Any other Mura users out there that have been experiencing issues with your java heap slowly incrementing (heap and non-heap). Since upgrading from 5.1.1.65 I have been having constant issues. And it’s not necessarily that get heap errors, but that the CPU rails at 99% and the heap is at 85% and the non-heap is at 15% (or some similar ratio).

My logs aren’t showing any errors aside from some occasional issues with connecting to a twitter feed - nothing that I would think would lead to this.

On dev servers that don’t get any traffic, everything is stable and fine, but live servers will eventually require a restart. I’m having this issue on three servers; all running:

Win Server 2012 R2
IIS 8.5
Lucee 5.2.5.20 RC
(upgraded from 5.2.4.37 as I was needing to restart almost daily - haven’t go to 5.2/5.20-Final yet)
Mura 7 / MySQL

Any thoughts would be greatly appreciated.

T.

I was able to find some similar issues that were caused by threads that never timed out. There were 3 different issues that I found that caused this condition. They were:

  1. mssql queries that never returned
  2. regex infinite loop
  3. problems when ehcache hung

I used FusionReactor see what was going on… you can get a trial license for free. Now that I’ve used it, I think it’s an invaluable tool.

1 Like

I’ve done some investigation and filed a bug report about this issue. It’s not specific to Mura, as I have the same problem with my totally custom application. Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!

https://luceeserver.atlassian.net/browse/LDEV-1640

I added a note to the ticket to see if we can get a JVM heap dump when the issue has occurred. The same goes for anyone else experience the problem. DM me if you can get one, but don’t post it publicly since a heap dump might contain sensitive info.

1 Like

@kuc_its @rrhodescf @al3xnull Can you see my comment here:
https://lucee.daemonite.io/t/announcing-lucee-5-2-5-final/3233/20?u=bdw429s

I’m curious if that usage pattern applies to your app and if you can provide a heap dump for us to inspect.

I did see your comment and I’ve posted back to the Mura forum to see if I
can get any feedback from there. It does appear that the core coding does
utilize cachedwithin extensively.

T.

I’ve been doing some load testing on my server using Webserver Stress Tool 8 with simulations of 45 users and a random number of clicks per user over a 20 minute period.

The webserver is running:

Win 2012 R2

4 X6560 Xeon Processors / 8GB Ram

IIS 8.5

Lucee 5.2.5.20

For testing purposes max heap was set to 2GB

Running Mura 7 connecting to mySQL database

The site I’m testing ran at around a 13% average heap and 8-10% non-heap on version 5.1.1.65. It also ran at about 13% heap on 5.2.1.9 but I saw a gradual increase of the non-heap (about 1 - 2% over a 24 hr period). There haven’t been any code changes to the site.

When the simulation first starts, the java heap begins to climb; Lucee appears to reclaim much of the heap, but as the test continues the heap climbs and Lucee doesn’t claim more than about 20% of the heap (sometimes as little as 2%). At the beginning of each test Lucee reclaims about 20% of the heap. During each test, CPU was fluctuating between 66% and 99%.

On average, at the end of each 20 minute test heap was at 75-80%. When I started the initial test I had a 3% heap, 3% non-heap, (non-heap climbs to about 10% and stays steady around 10% through out the duration of the tests). If I re-initiated the test, there was an initial reclaiming of about 20% of the heap, however, If I increase the duration of the test to 40 minutes the server eventually becomes unresponsive with a 95% heap, 10% non-heap and cpu railing at 99%.

I ran the test on the same server on a simple static site with no database connections and the there was nominal heap growth and the heap was fully reclaimed.

1 Like

Simulating use case of approximately 4200 clicks per hour (same test as above but more of a real world example) on Lucee 5.2.6.35-SNAPSHOT and this issue seems to be resolved… the heap does climb, but is actively reclaimed and the non-heap doesn’t get above 12%.

Unfortunately it appears I may have been premature in my assumption that this issue has been resolved. Same use case scenario running over a 6 hour period resulted in a heap size of 76% and a non-heap size of 8%… This issue persists in in 5.2.6.35 - the non-heap growth issue appears resolved.

Thanks for this update. We’ve flagged this for taking a closer look again, and comparing your results with others. Stay tuned. @21Solutions @micstriit

76% of how much? That is not necessarily bad or “too much”.

What happens after a day or two or three? Does it crash or does it keep
running in the 70% level?

Server eventually becomes unresponsive with the CPU railing at 99%

Thanks to the help of FusionReactor and a whole lot of testing I finally determined the root of the cause of my issues (thanks in part to this post sending me down the cfhttp path - https://lucee.daemonite.io/t/upgrade-to-lucee-5-2-3-35-results-in-gc-overhead-limit-crashes/2766).

Essentially I had two components (Twitter and a Emergency Broadcast System) that were retrieving json files via cfhttp. These connections were never closing and every new session opened another connection, which grew the heap and eventually resulted in the server becoming unresponsive.

@kuc_its So not a Lucee issue? Please confirm. Thanks.

These components hadn’t changed and I ran into the issue when I upgraded to
Lucee 5.2. So not necessarily an issue with Lucee, but something changed in
how cfhttp connections are handled between 5.1 and 5.2.

T.

Hi there. Just thought I’d add to this discussion… No real new data other than to say that, like everyone else here, my experience migrating up from Lucee 5.0.0.252 to 5.2.5.20 on Linux CentOS 7 was really horrible. Things initially looked stable after the install, but after about 6 hours, memory problems emerged and then the servers stopped entirely. Luckily, I only ever update 2 of our 4 web servers in the cluster, so I could fail these boxes out of production in our load balancer quickly and avoid downtime.

We’ve been on 5.0.0.252 since 2016, amazingly stable. After moving to 5.2.5.20, two different web servers crashed within hours. Sure, we might have messy cfhttp sessions (as suggested above), but shouldn’t those be handled similarly between 5.0 and 5.2? Or at least noted and warned? In my view, 5.2 is NOT production ready. This is a major, major issue.

Anyone running 5.2.6 seeing any issues with cfthread? We are still running 5.2.5.20 and want to make sure it is rock solid before we update Lucee.

Thanks in advance!

We are migrating our e-commerce platform to Lucee 5.2 too.
I tested 5.2.6.60 and latest snapshot (5.2.7.56) on windows server 2008R2 (just used the vivio installer with JRE8).
Both versions crashed within the hour where same code on Railo 4.2 (currently used in production) is stable for ages and is even faster on Railo 4.2 (how is that possible after 2-3 years of lucee development, i hoped lucee 5 will give me performance boost instead)

I hope the Lucee team is eager enough to fix this problem asap, it’s very important for many great lucee users out there. This problem is open for several months (at least) now. I think it’s time to tackle this thing.

If you guys tell me what you need exactly (logs, heapdump from FR?) then i will make sure you get it soon.

the first step to solving any problem is understanding what the problem is

it’s your code base, so have you identified which specific bits (i.e. requests, tec) of your
code run slower under 5? What do you mean by slower, 5% 50%