Felix causing memory leak on one server

kuc_its · February 23, 2018, 12:45pm

I have one server that eventually crashes with a GC Overhead error. It appears to be related to FelixDispatchQueue, FelixFrameworkWiring, FelixStartLevel. Catalina says each of theses threads have been started but can’t be stopped. Webserver requires a daily restart. This is only happening on this one server.

Win Server 2012 R2 / IIS 8.5 / Lucee 5.2.5.20

Pertinent Error details

23-Feb-2018 07:22:15.758 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixDispatchQueue] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Unknown Source)
 org.apache.felix.framework.util.EventDispatcher.run(EventDispatcher.java:1118)
 org.apache.felix.framework.util.EventDispatcher.access$000(EventDispatcher.java:55)
 org.apache.felix.framework.util.EventDispatcher$1.run(EventDispatcher.java:102)
 java.lang.Thread.run(Unknown Source)
23-Feb-2018 07:22:15.762 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixFrameworkWiring] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Unknown Source)
 org.apache.felix.framework.FrameworkWiringImpl.run(FrameworkWiringImpl.java:172)
 java.lang.Thread.run(Unknown Source)
23-Feb-2018 07:22:15.763 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixStartLevel] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Unknown Source)
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:283)
 java.lang.Thread.run(Unknown Source)

Any ideas, thoughts, suggestions.

T.

Terry_Whitney · February 23, 2018, 5:50pm

Its not just a Lucee issue, it appears to be a windows issue. Multiple ACF Servers on Windows, starting on version 7 on up, all have memory leaks. The most common solution is to just reboot the servers once a day.

tsucaet · March 1, 2018, 8:57am

I have exact the same issues when I upgraded to Lucee 5.2.5.20 on my Centos 7 server, so it’s definitely not a Windows issue. (see also https://lucee.daemonite.io/t/lucee-5-2-x-java-heap-issues/3240)
I even created a new Droplet (NGiNX/1.12.2 , Apache Tomcat/8.5.24, Java 1.8.0_152) on Digital Ocean and moved some Mura websites to the new machine, after a few days the memory leak happens also on my new machine. Even a daily restart is not a solution because sometimes it crashes after a few hours. I spent hours investigating the problem but I can’t find anything useful. I’m not a JVM expert but this problem starts happening after a Lucee upgrade (5.2.5.20). I can’t permit a daily restart for a long period, so I hope this can be fixed soon.

Zackster · March 1, 2018, 10:10am

have you tried the latest snapshot release? 5.2.7.23-SNAPSHOT

change the update provider to snapshots in the server admin under services / update

tsucaet · March 1, 2018, 10:47am

Tnx for the suggestion, I upgraded one of my server to the 5.2.7.23-SNAPSHOT and disabled the daily restart. I will post feedback in a few days.

Zackster · March 1, 2018, 10:49am

hope it works!

out of interest, what was the previous version you were using prior to 5.2.5.20 ?

tsucaet · March 1, 2018, 12:18pm

I don’t know exactly at which version the problems started but I did some rollbacks to different versions of 5.2.x and if I’m not mistaken also to 5.1.x versions without any luck. First I thought it had to do with upgrading some Mura websites form v6.2 to v7.0, that was the reason I deployed a new VM from scratch.

kuc_its · March 5, 2018, 8:05pm

Doing some testing… I noticed this same error on another server, but no GC issues.

I came across a Mura post regarding the MySQL connector and aftert upgrading the user ran into the issue of the connector not working, but after downgrading back to the original version he was running he started having GC Overhead issues. He reinstalled Lucee and the the GC issues went away.

I did upgrade the MySQL connector and downgraded on the server I’m having issues with.

tsucaet · March 14, 2018, 12:15pm

Hmm, the problem still occurs. After about 5 days my server throws again a “502 Bad Gateway”. No errors found in the log files (catalina.out). So it seems that the bug still exists, any other suggestions? Tnx

kuc_its · March 21, 2018, 11:57am

Apparently the “bug” has been fixed in 5.2.7 - so for me, it’s wait until there is a release of 5.2.7; I guess we’re almost there. Hopefully this resolves my heap issues once and for all.

Just want to note that there have been a lot of issues with 5.2 .x (5.1.x ran from months and months without so much as a hiccup); so much so that my higher ups suggested looking at Coldfusion again… just saying that when the budget handlers are suggesting we spend money…

bdw429s · March 21, 2018, 1:48pm

Do you have a link to the bug in question that you think is affecting you?

Also, you mentioned a GC overhead memory error in the OP and said you thought it was related to the Felix threads. Do you have some specific research (like heap dumps) to support that connection, or did you simply see some logs in the console about Felix and figured perhaps it was related? I ask since I’ve seen all manner of similar (this thread didn’t stop) logs in Tomcat but I’ve never tied it to a specific memory leak before.

You also mentioned a “502 Bad Gateway” error in your next to last message. Can you confirm if that was also accompanied by the GC overhead errors as well or was that another issue? I ask that because a bad gateway error is a super duper generic error that means nothing more than an upstream server isn’t responding in a timely fashion. That bad gateway error doesn’t necessarily point to any particular cause unless you’re also diagnosing the server on the back end at the same time to see why it’s stopped responding.

tsucaet · March 22, 2018, 8:13am

I can’t confirm the ‘502 Bad Gateway’ message was related to the GC overhead, I just restart Lucee and didn’t investigate the issue althought there were no error messages in the log files.

My server is running on 5.2.7.42-SNAPSHOT for about 9 days now without any issues. So it seems that in one of the 5.2.7.x snapshots the bug has been solved!

kuc_its · March 22, 2018, 12:01pm

I believe the heap issues I was running into were as a result of upgrading the MySQL connector in Lucee which resulted in killing all of my database connections. I downgraded to a previous version and got my connections back, but started having heap issues resulting in daily restarts being required. At the time I had just resolved a similar heap issue on another server that was the result of cfhttp calls. Combing the logs I saw a large number of Felix errors and warnings and thought this might be the culprit. Having since compared this to my other server logs, I realize that Felix was most likely not the issue (Same errors/warnings in both logs)

I’ve just gotten a clean install of Lucee (5.2.5.60) on the server I was having issues with and will report back if this has resolved my heap issues. I had come across some posts on the Mura forum where people had reported a similar issues with the mySQL connector after upgrading and downgrading and a clean install was the only solution that worked for them.

bdw429s · March 22, 2018, 1:28pm

Thanks for the updates. It sounds like a lot of guessing and blind testing has been going on which makes it difficult to actually pinpoint the issues. If you keep getting “GC Overhead” errors on the latest snapshot of Lucee, then you’ll want to pull a heap dump of the JVM to analyze and see what is filling up your memory.

IamSigmund · March 27, 2018, 3:11pm

Hi all. Your humble Product Manager here. I wanted to chime in on this one. @kuc_its - I’ll echo Brad’s latest comment–if you could get a heap dump at some point, that would be most helpful. For the record, I’ve been memory-tuning CFML systems for decades, and the vast majority of memory problems are application-specific (or system-specific, which may be more relevant to your case, given that it may be DB driver-related). That is, they’re very hard to track down in general testing, so again, something more specific like a JVM dump would be great, if you can get it. But regardless, please do keep us posted on what you’re seeing.

Regarding your comment about the budget handlers, I’d be more than happy to speak with someone at your company about “sponsored” (paid) fixes. That should end up being immensely more valuable than the substantially higher costs that would have to be paid to Adobe in licensing fees. Let me know if you have questions/comments about this.

kuc_its · March 27, 2018, 6:17pm

I’ve upgraded to 5.2.6.60 - clean install and thus far things seems solid - There was a server restart in that time (Windows updates), but it appears the memory is being reclaimed and things are running smoothly. I’ll give it a few more days and see if anything goes awry.

In regards to the budget handlers - my response was exactly that; “If we’re going spend money, then lets get paid support from Lucee”. My comment may have come off more glib then I had intended; I had just come out of a meeting with the budget handlers.

The MySql connector issue seemed to coincide with the CFHTTP issue; we had a bit of a time resolving issues around that and connections being left open; still determining where that’s going south (code, Lucee, Mura or a combination).

Also running Mura 7 and discovering some gotchas like using non-asynchronously loaded components (essentially calling the component from within the theme) which seems to load a new version of the component every time without releasing the previous connection, which of course leads to the heap issue.

It’s been a fun few months ;o)

If I find I’m having GC issues again, I post a JVM dump.

T.

kuc_its · March 31, 2018, 12:22am

So it is much better, but the heap still climbs; about 10-15% per day until the server needs a restart. Attached is a thread dump for Lucee for the server in question. Wondering if it’s associated to the item at line 169. Any direction you could give me would be greatly appreciated.

Thanks,

T.

threaddump.log (61.2 KB)

Terry_Whitney · April 11, 2018, 3:27pm

This maybe a redundant question, but are you sure that you have 64 bit java installed.

kuc_its · April 11, 2018, 3:43pm

Yep - definitely have 64 bit installed.

kuc_its · April 26, 2018, 11:12am

Updated to the 5.2.7.61 RC and so far so good. Would have usually needed a restart by now.