GC server crashes in the last couple of days

Nick_Batt · August 16, 2018, 6:28pm

Hi there, just started experiencing server overloads and ultimately crashes on a working production server on Amazon EC2 Linux.
The error is GC related and it appears to have run out of memory.
I ran a couple of server updates recommended security fixes. In the last couple of days. I upgraded to the latest stable, but server load was still high and very slow page loads, so tried going back down to Lucee 5.2.1.9 but after a couple of hours I started getting GC errors and the site went down.

Apache Tomcat/8.0.35
Java 1.8.0_92 (Oracle Corporation) 64bit
Architecture 64bit
Dont really know what do try next.

Thanks

modius · August 16, 2018, 10:42pm

What are your GC settings? Or better yet what LUCEE_JAVA_OPTS settings are you using in general?

Zackster · August 17, 2018, 7:56am

First thing to try is updating your Java to v181

Nick_Batt · August 17, 2018, 11:29am

Thanks, I just did a yum install java 1.8 (latest version I could get from Amazon) which seems to have fixed the issue, much lower CPU load seems to have been stable for several hours now.
Now been OK all day, server load went from av 1.7 to 0.03

Nick_Batt · August 22, 2018, 8:31am

Problem came back so increased Java memory allocation from 256mb/512 to 1gb/2gb.
Seems to have sorted it, though not sure why it would suddenly require more resources, server has been running smoothly for 18 months.
Would an Amazon security update make that much of an impact?

dnando · August 22, 2018, 11:52am

You could look at https://www.jclarity.com/ Censum to gain insight into GC tuning. I’ve downloaded their free trial in the past and found that very helpful to point me in the right direction.

You’ll also find Fusion Reactor helpful to pinpoint what code you could refactor to keep memory usage down. In my experience, out of the box GC settings do not handle the generation of large numbers of CFCs in a loop well at all, (or at least they didn’t in past version of Java). The problem I experienced was that the GC process, with out of the box settings, didn’t clean up objects quickly enough and they got promoted to the PermGen memory space. Java memory allocation increased in a stair-step fashion day by day until an out of memory error occurred.

Your issue may almost certainly be something else. The above is simply illustrative of how to go about debugging a memory issue. My solution was to use a struct instead of an object in a long-running query operating on a growing dataset, change the JAVA_OPTS settings to get GC to run more frequently/aggressively (which takes more server resources but is sometimes the right tradeoff), and to increase memory to keep total memory usage long term at about 50% capacity - if I remember correctly (which I may not … ). In any case, with memory, I was shooting for optimal reliability, so I wanted plenty of reserve.

sebgmc · September 5, 2018, 9:11am

Hi @dnando, do you have any specific examples for changing the LUCEE_JAVA_OPTS? We usually just change the min and max allocated memory and change the GC to ParallelOld. Love to hear other viable tuning options!

dnando · September 5, 2018, 10:46am

The process looks like this: enable GC logging on production machine, analyze logs, attempt GC mods, rinse and repeat. In the past, Mike Brunt posted on his work in this space on his cfwhisperer blog. After digging into this space in depth, my sincere advice as a do it yourself sort of guy is to use Censum and take advantage of this team’s current expertise. JVM GC has changed significantly over the years, and optimized settings depend on your application and the loads it encounters. This is one case where generic, most likely out of date advice is not applicable.