Lucee 5.2.x Java Heap Issues

Julian_Halliwell · December 12, 2017, 7:41pm

I should just add we did a clean install of 5.2.5 having removed our previous 4.5 installation.

kuc_its · December 17, 2017, 10:45pm

5.2.5.0 ran stable for 3 days with only moderate heap elevation. Decided to go to the 5.2.5.20-RC. Resulted in requiring two server restarts in 48 hours. Issue didn’t appear to be heap, but excessive CPU. The heap showed it had climbed to 83%, but the actual server memory usage was less than 2gb but the CPU running at 90%.

Nothing in the error logs regarding heap issues, only a couple of errors referencing a failed feed connection (twitter).

Any ideas? At this point I think I’m going back to 5.1.1.65 (the last stable version I was running).

sbleon · January 3, 2018, 9:23pm

I’ve done some investigation and filed a bug report about this issue. Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!

https://luceeserver.atlassian.net/browse/LDEV-1640

kuc_its · January 4, 2018, 1:00am

About 3 weeks ago I updated my servers from 5.1.1.65 to 5.2.4.37; within 2 days I was restarting the server because the JAVA heap was being consumed - it appears that threads were not being removed. After reviewing the forums it appeared that that going back to a previous version of 5.2 wasn’t really an option as many users reported similar issues.

I decided to give the 5.2.5.0 snap shot a go which appeared to be stable over a 72 hour period and decided it might be better to move right up to the 5.2.5.20-RC; this resulted in two server restarts within 48 hours - the issue wasn’t the heap but excessive CPU consumption; nothing in the error logs pointing to a cause.

I have decided to downgrade back to 5.1.1.65 (which ran rock solid for months). No changes to the code base on the server.

Server specs:

Windows Server 2012
IIS 8.5/Lucee 5
Running Mura 7 websites (remote mySQL server)
4gb of ram

Anyone running into similar issues? Any thoughts?

T.

Redtopia · December 18, 2017, 6:18pm

I’m running the latest version (5.2.4.37) and it’s incredibly unstable. I have to restart almost every day. In my case, what I’m seeing is that db queries are not timing out (using mssql driver, not jTDS), and when that happens, the calling request is not timed out either.

I use FusionReactor to help figure out what is going on, and I filed a bug here: [LDEV-1622] - Lucee

Julian_Halliwell · December 18, 2017, 9:32pm

As mentioned elsewhere, after experiencing serious memory issues with both 4.5 and 5.2.35, we have found 5.2.5.20-SNAPSHOT to be very stable indeed for the past 2-3 months.

I hope I’m right in expecting the 5.2.5.20 RC and eventual stable releases to be identical to the SNAPSHOT, although I notice the .lco files for the first two differ slightly in size.

kuc_its · December 28, 2017, 1:02pm

Unfortunately this hasn’t resolved my issues. 5.2.5.20 is better, but the heap still eventually climbs to a point where the CPU is railing at 99% - I’m not getting any GC errors (or any other errors from that matter), just a full heap and an unresponsive server due to excessive CPU.

I should note that I am running Mura 7 from MySQL on Win Server 2012 RS with IIS 8.5

sbleon · January 3, 2018, 9:25pm

I’ve done some investigation and filed a bug report about this issue. It’s not specific to Mura, MySQL, or Windows, as I have the same problem with my totally custom application on a Linux server with an MSSQL datasource.

Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!
https://luceeserver.atlassian.net/browse/LDEV-1640

al3xnull · January 4, 2018, 1:01am

Anyone else noticing the release candidate (5.2.2.70-RC) causing issues with heap size which eventually hangs Tomcat? Was running Mura 6.2 on 5.2.2.70-RC with OpenJDK 1.8.0-131 and it seemed that every hour (sometimes sooner) the service would go down and review of the logs would show heap size issues.

I don’t have much to go on as I have no idea what is causing it, but I seem to get it across all of our sites with Tomcat / Lucee / Mura. I usually have had Lucee set to auto update on the release channel without issue until this latest release which seems to be the cause for the majority of our sites to dive every hour or so.

rrhodescf · July 12, 2017, 10:27pm

Yes! We experienced the same thing on several Windows servers we had running 4.2.2.70 RC. Tomcat appeared to hang, slowing the servers to a crawl – and we struggled just trying to stop Lucee.

Ultimately, we had to reboot every server, just to get control of the situation.Downgrading to Lucee 5.2.1.9 solved the problem for us - at least so far.

It’s been about 30 hours since the downgrades, and all is working fine.

<phew!>

kuc_its · December 28, 2017, 7:03pm

This appears to be an ongoing issue (with Mura sites at least). I’ve tried several releases up to 5.2.5.20RC and the server eventually becomes unresponsive (due to the heap), but no GC or heap errors showing in the logs.

5.2.1.9 was the last stable version in our situation as well.

Have you had any success with upgrades since posting?

T.

sbleon · January 3, 2018, 9:25pm

I’ve done some investigation and filed a bug report about this issue. It’s not specific to Mura, as I have the same problem with my totally custom application.

Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!
https://luceeserver.atlassian.net/browse/LDEV-1640

kuc_its · January 4, 2018, 1:03am

Any other Mura users out there that have been experiencing issues with your java heap slowly incrementing (heap and non-heap). Since upgrading from 5.1.1.65 I have been having constant issues. And it’s not necessarily that get heap errors, but that the CPU rails at 99% and the heap is at 85% and the non-heap is at 15% (or some similar ratio).

My logs aren’t showing any errors aside from some occasional issues with connecting to a twitter feed - nothing that I would think would lead to this.

On dev servers that don’t get any traffic, everything is stable and fine, but live servers will eventually require a restart. I’m having this issue on three servers; all running:

Win Server 2012 R2
IIS 8.5
Lucee 5.2.5.20 RC
(upgraded from 5.2.4.37 as I was needing to restart almost daily - haven’t go to 5.2/5.20-Final yet)
Mura 7 / MySQL

Any thoughts would be greatly appreciated.

T.

Redtopia · January 3, 2018, 12:41am

I was able to find some similar issues that were caused by threads that never timed out. There were 3 different issues that I found that caused this condition. They were:

mssql queries that never returned
regex infinite loop
problems when ehcache hung

I used FusionReactor see what was going on… you can get a trial license for free. Now that I’ve used it, I think it’s an invaluable tool.

sbleon · January 3, 2018, 9:21pm

I’ve done some investigation and filed a bug report about this issue. It’s not specific to Mura, as I have the same problem with my totally custom application. Please vote for, comment on, and watch the issue so that we can get some traction and get it fixed!

https://luceeserver.atlassian.net/browse/LDEV-1640

bdw429s · January 3, 2018, 9:25pm

I added a note to the ticket to see if we can get a JVM heap dump when the issue has occurred. The same goes for anyone else experience the problem. DM me if you can get one, but don’t post it publicly since a heap dump might contain sensitive info.

bdw429s · January 4, 2018, 6:02pm

@kuc_its @rrhodescf @al3xnull Can you see my comment here:
https://lucee.daemonite.io/t/announcing-lucee-5-2-5-final/3233/20?u=bdw429s

I’m curious if that usage pattern applies to your app and if you can provide a heap dump for us to inspect.

kuc_its · January 5, 2018, 12:17am

I did see your comment and I’ve posted back to the Mura forum to see if I
can get any feedback from there. It does appear that the core coding does
utilize cachedwithin extensively.

T.

kuc_its · January 13, 2018, 11:05pm

I’ve been doing some load testing on my server using Webserver Stress Tool 8 with simulations of 45 users and a random number of clicks per user over a 20 minute period.

The webserver is running:

Win 2012 R2

4 X6560 Xeon Processors / 8GB Ram

IIS 8.5

Lucee 5.2.5.20

For testing purposes max heap was set to 2GB

Running Mura 7 connecting to mySQL database

The site I’m testing ran at around a 13% average heap and 8-10% non-heap on version 5.1.1.65. It also ran at about 13% heap on 5.2.1.9 but I saw a gradual increase of the non-heap (about 1 - 2% over a 24 hr period). There haven’t been any code changes to the site.

When the simulation first starts, the java heap begins to climb; Lucee appears to reclaim much of the heap, but as the test continues the heap climbs and Lucee doesn’t claim more than about 20% of the heap (sometimes as little as 2%). At the beginning of each test Lucee reclaims about 20% of the heap. During each test, CPU was fluctuating between 66% and 99%.

On average, at the end of each 20 minute test heap was at 75-80%. When I started the initial test I had a 3% heap, 3% non-heap, (non-heap climbs to about 10% and stays steady around 10% through out the duration of the tests). If I re-initiated the test, there was an initial reclaiming of about 20% of the heap, however, If I increase the duration of the test to 40 minutes the server eventually becomes unresponsive with a 95% heap, 10% non-heap and cpu railing at 99%.

I ran the test on the same server on a simple static site with no database connections and the there was nominal heap growth and the heap was fully reclaimed.

kuc_its · January 17, 2018, 1:11pm

Simulating use case of approximately 4200 clicks per hour (same test as above but more of a real world example) on Lucee 5.2.6.35-SNAPSHOT and this issue seems to be resolved… the heap does climb, but is actively reclaimed and the non-heap doesn’t get above 12%.