ControllerThread high CPU / Lucee 6.2.1.122 clean install stability issues

Guys I really need help with these couple issues we are seeing since changing from Lucee 6.0.3.1 to 6.2.1.122.

Just for context: as part of upgrading to 6.2.1.122 I upgraded Java from 11 to 21 and Tomcat from 9 to 11. Everything worked fine on dev but when applied to our prod servers, under production load Lucee would see what appears to be random issues:

  • Lucee starts OK but then falls over within the next hour. No errors posted to catalina.out, just normal entries. It just starts throwing 500 errors on each site.
  • Lucee starts partially OK. Say roughly 100 sites will load fine but then another 100 will throw 500 errors. Again, no errors written to catalina.out.

To try fix this I opted for a clean install i.e. uninstalled lucee, deleted /opt/tomcat then reinstalled Lucee keeping the supplied Java and Tomcat.

Sadly that didn’t help and we are still seeing the exact same stability issues. Again, catalina.out is clean, only normal entries. No errors posted when the 500 errors start.

One other thing we are seeing that appears odd is really high cpu bursts lasting 30 minutes on average, on the hour, every hour, for roughly 48 hours. When looking at the threads, it is the Lucee Controller thread. Once it settles after 48hrs the server falls over shortly after and we have to restart it again.

Sometimes we have to restart lucee 2 or 3 times within an hour until it comes up properly and stays up.

Before this upgrade/clean install, our prod servers were perfectly stable.

The only thing I haven’t tried is deleting the .CFConfig.json file and manually enter all the settings again. No idea if this will help or not.

Don’t forget to tell us about your stack!

OS: Ubuntu 22.04.5 LTS
Java Version: 21.0.7 (Eclipse Adoptium) 64bit
Tomcat Version: Apache Tomcat/11.0.6
Lucee Version: 6.2.1.122

1 Like

That is what I’d try to do because it can be done quickly:

  • try switching to latest Lucee realease canditate 6.2.2.66
  • try switching/downgrading Java to JRE 17

Here are the steps:

  • Do a full backup of your server
  • download a Java 17 jre and place it directly besides the working jre directory (e g jre_17)
  • copy the lib directory to lib_6.2.2.66 and remove the single Lucee Jar file in it
  • download the Lucee 6.2.2.66 jar file to the created directory
  • make sure the above new directories (and files recursively in those) have the correct file permissions (compare it to your working directories)
  • stop Lucee
  • switching the working directories with the new directories by renaming them
  • restart Lucee
  • Rollback if needed by stopping Lucee and undoing the directory renaming described above.

This is just a quick shot in the dark to see if your setup might get more stable and get some further informations.

@andreas curious, why why java 17?

Trying the 6.2.2.66-RC is a good idea tho

there must be some clues in the lucee logs…

If you haven’t already, I’d be bumping the application.log level to INFO

Do you have a stacktrace for the controller thread? Would be good to know what it’s doing during the high cpu

We have some additional logging in 6.2.2.62 for when the controller thread is taking longer than 20s

https://luceeserver.atlassian.net/browse/LDEV-5670

Thanks so much, mate.

Here’s a stacktrace just now during high CPU:

java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
java.io.UnixFileSystem.hasBooleanAttributes(Unknown Source)
java.io.File.isFile(Unknown Source)
lucee.commons.io.res.type.file.FileResource.isFile(FileResource.java:657)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1397)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.commons.io.res.util.ResourceUtil.deleteFileOlderThan(ResourceUtil.java:1407)
lucee.transformer.dynamic.DynamicClassLoader.cleanup(DynamicClassLoader.java:311)
lucee.transformer.dynamic.DynamicInvoker.cleanup(DynamicInvoker.java:383)
lucee.runtime.engine.Controler.control(Controler.java:496)
lucee.runtime.engine.Controler.control(Controler.java:281)
lucee.runtime.engine.Controler$ControlerThread.run(Controler.java:124)

Here’s what it looks like in our monitoring. The hourly peaks from 8am are where it started after a Lucee restart:

1 Like

@Zackster Nothing for sure, just a shot into the dark to try to isolate and rule out stuff, but sincerely I don’t have any further insights into Lucee anymore (unfortunately I haven’t been as involved as I wanted to be with the Lucee development lately)

1 Like

Ok, it’s this bug

https://luceeserver.atlassian.net/browse/LDEV-5668

This only occurs when your server has been running more than 30 days

I believe you can try stopping the server, purging your dynclasses folder (lucee-server\context\dynclasses) to get 30 days without problem, also per /WEB-INF/content/dynclasses, you are running in multi context mode I’m assuming?

How many sites are you running under that single install?

Out of interest, how many files are under that folder??

1 Like

Yeah nice. That certainly seems to fit, except for the ā€œhas been running 30 daysā€ part. We see this happening 2hrs after Lucee startup.

We are running 200 sites on each server (give or take a few) under fair load. Server instance is 8CPU 64GB RAM. We are seeing this across 3 servers.

The only files are under /opt/lucee/tomcat/lucee-server/context/dynclasses/lucee/invoc/wrap/v4 (188 total count)

We were in multi-mode in Lucee 5 but then once single-mode was first supported on Lucee 6.x we changed to single and have been running single since.

I checked a few sites for WEB-INF dynclasses but that folder wasn’t present. Under WEB-INF all we have is WEB-INF/lucee/:

drwxr-xr-x 20 coldfusion coldfusion 4096 Jul 4 2024 .
drwxr-xr-x 3 coldfusion coldfusion 4096 Feb 27 2018 …
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 bin
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 cache
drwxr-xr-x 10 coldfusion coldfusion 4096 Jul 3 2024 cfclasses
-rw-r–r-- 1 coldfusion coldfusion 2485 Jun 14 2024 .CFConfig.json
drwxr-xr-x 4 coldfusion coldfusion 4096 Feb 27 2018 components
drwxr-xr-x 4 coldfusion coldfusion 4096 Jun 14 2024 context
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 customtags
drwxr-xr-x 2 coldfusion coldfusion 4096 Mar 6 2018 debugger
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 deploy
-rw-r–r-- 1 coldfusion coldfusion 5 Feb 27 2018 exe-log
-rw-r–r-- 1 coldfusion coldfusion 5 Jun 14 2024 exeLog
-rw-r–r-- 1 coldfusion coldfusion 96 Feb 27 2018 .htaccess
-rw-r–r-- 1 coldfusion coldfusion 36 Feb 27 2018 id
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 lib
-rw-r–r-- 1 coldfusion coldfusion 13 Jun 14 2024 lib-hash
drwxr-xr-x 6 coldfusion coldfusion 4096 Feb 27 2018 library
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 locales
drwxr-xr-x 2 coldfusion coldfusion 4096 Sep 19 2023 logs
-rw-r–r-- 1 coldfusion coldfusion 8359 Feb 27 2018 lucee-web.xml.cfm
drwxr-xr-x 3 coldfusion coldfusion 4096 Jun 3 2018 remote-client
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 scheduler
drwxr-xr-x 2 coldfusion coldfusion 4096 Jun 14 2024 security
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 storage
drwxr-xr-x 2 coldfusion coldfusion 4096 Jul 4 2024 temp
-rw-r–r-- 1 coldfusion coldfusion 22 Jun 14 2024 version
drwxr-xr-x 2 coldfusion coldfusion 4096 Feb 27 2018 video

To make sure I ran a ā€˜locate’ and the only path that turned up was /opt/lucee/tomcat/lucee-server/context/dynclasses

Interestingly, last week when I uninstalled Lucee, deleted /opt/lucee and did a clean install, this same bug started within 2hrs of starting up.

Thanks so much again for the help.

Thanks for the useful feedback, reckon I’ve figured out the root problem

That’s awesome. You’re very welcome! I just read your updated notes on that bug ticket.

So what’s next steps as far as we are concerned? Is there a way we can hotfix this issue once that bug ticket is resolved?

It will be in the next RC, which we are actively working on

2 Likes

Awesome. Sounds good.

We will apply it as soon as it’s good to go.

this has been fixed in 6.2.2.90-SNAPSHOT

I know it’s urgent, but please test locally rather than going straight to prod

Thanks, mate. This is now on our dev servers. Will report back in a couple days.

1 Like

Guys just confirming this patch sorted out the high CPU issue on our busy prod servers. Looking much better.

2 Likes

@Pondrey I very much appreciate your update because this was the main issue I was waiting for to be resolved before upgrading to 6.2 in my own production server.

1 Like