CPU near zero, Memory to the sky

Hi all,
I’m running a pair of Google Compute Engine VMs with the following configuration :

Intel(R) Xeon(R) CPU @ 2.20GHz / Installed RAM 16.0 GB
Windows Server 2022 Datacenter Version 21H2
IIS 10.0.20348.1
Lucee 5.2.9.31
Servlet container : Apache Tomcat/8.5.35
Java : 1.8.0_192 (Oracle Corporation) 64bit

Request timeout is set to 50 seconds and no concurrent requests are queued.
Application Listener is set to “Mixed”.
SQL target is a remote SQL server 15.0.2095.3(web edition).
Lucee SQL 6.2.2.jre8 Application is used for datasources.

Lucee-server.xml settings

<default-resource-provider arguments="lock-timeout:1000;" class="lucee.commons.io.res.type.file.FileResourceProvider"/>
<resource-provider arguments="lock-timeout:10000;case-sensitive:false;" class="lucee.commons.io.res.type.http.HTTPSResourceProvider" scheme="https"/>

Servers are dedicated to API calls, meaning a lot of requests (say million) per day.
Most of the calls require SQL requests and return a limited amount of data.
Those machines are behind a GCE load balancer with a simple probe (cfm template) called every 5sec. for heath status.
The typical usage of the application should not be CPU harrassing, still, going between 10-80% would be “normal”. I see on the Overview page (when all is ok) JVM Heap size 46% and non heap 10%.
(when serving ok) Common daemon Service Runner is consuming up to 5GB, IIS worker around 2OOMB. Overall CPU in the 10% Memory 50%.

I noticed a particular behavior that I cannot explain.
Looking at the task manager I can see the “Commons Daemon Service” Runner Going high on memory while its CPU usage stay null and flat. Also a IIS Worker process that seems to follow the same beahaviour (NO CPU, high mem).

When this behaviour occurs, the probe can either respond “normally” or run into a timeout, several times in a quite short (5-10 seconds) time, causing the LB to “missdirect” the API requests to the failing host.

As you’ve already guessed, I’m not a sys/dev-op [at all] so that simple advices / tracks / configuration check would be greatly appreciated. I suspect we face a “too many threads” issue, like described here : https://lucee.daemonite.io/t/server-possibly-crashing-due-to-too-many-threads/5129/20 but this alone may not be relevant.

(and please excuse my English & bleeding eyes)

Hey!

Given you’re using a rather older version of Lucee / tomcat and java, I’d suggest firstly testing and then upgrading to 5.3.9

We’ve done a lot of work on Lucee since 5.2.9, your problems have probably already been solved.

z

1 Like

Zackster, thanks for the super fast response.
we surely want to do that, but that’s a thousand templates/components/functions we have to test.
In the meantime, do you believe there is something in the configuration/settings we may consider to improve the situation ?

P.S : I’ve tried to install your Analysis application on my local machine, but I’m not sure how/if it works on a IIS setup … Debug is on but the only results I get are from Lucee’s admin panel (accessing via Tomcat port 8888). (please redirect me to the proper help/section if needed)

not really, all the improvements are inside the Lucee engine as opposed to settings

did you enable debugging for the web context?

1 Like

Yes I did (also a template), but I realise I probably missed a web context setup, sorry for that, will do & inform.

this is all easier in 5.3.9 too :slight_smile:

4 Likes

Hey Zakster,
we’ve done a bunch of tests and 5.3.9.166 seems to be deployable w/o noticeable side-effects.
We did run in an install/config problem due to ‘old’ lucee-[w.x.y.z].jar version (see here) but mostly all seems fine.

We do have some questions regarding tomcat/java upgrade.
Targetted servers are running under tomcat 8 gen (8.5.11 & 8.5.35), pretty ancient JREs (1.8.0_121 / 1.8.0_192) not to mentions SQL Jdbc …
We want to upgrade “step by step” so that we can monitor anything going wrong.
Still, do you see here something that MUST be updated before we go live ?

Finally, I wonder if the upgrade process for such ‘ancient’ version should no be done by uninstalling Lucee as a whole and installing the latest stable packaged (lucee/tomcat9/java) version.

Your guidance is gold …
Thanks in advance.
A.

I’d be going a fresh install, less moving parts

1 Like

Thanks Zackster, you rock.
Will settings remain (like mappings, Datasource, Schedulerd tasks) or is there a way to save/restore them afterwards ? [I saw that the “upgrade from install” wasn’t recommended] …

you can either copy/diff the lucee-server.xml and lucee-web.xml.cfm config files into the new installs, or use CFconfig to export / import

1 Like

Well, things didn’t go exactly as expected …
We noticed a drastical raise of response times.
Something like 2 seconds more (*) per page (VS 5.2.9.31), like if those 2 seconds were systematic, even for our probe call (the simpliest cfml template you can find : just setting a variable and display “OK”).
Is that ringing a bell to you ?

(*) I mean : 2080 ms V.S 80ms … yes, that far … :cold_sweat:

We start to believe that the problem may come from “outside” Lucee. re: boncode ?

I’m not clear if you’re saying Lucee itself is using memory or if you’re saying IIS is. I don’t know what the commons daemon service is, but looking at your operating system’s task manager will NOT show you a useful number in regards to how much memory Java is using.

You need to get ahold of FusionReactor here. It is going to give you much better information on memory and what is actually in use. Keep in mind “allocated” heap is not the same as “used” heap!

It is common for reverse proxies such as Boncode to have a 1 to 2 second delay due to networking issues. The fix is very simple however.

1 Like

You’re a wizard … We continue to dig but you definitely nailed a large part of it (Tomcat address ::1) ! Thanks !

1 Like

I will not close right now as we will perform monitoring this night and tomorrow before raising the victory flag. But I can’t wait to say thanks to both of you for the incredibly fast and efficient help !
I owe you a (bunch of) beer(s) when traveling in Paris area !

1 Like

Haven’t been to Paris since the last CFCamp… which speaking of, was just announced today for the 2023 season! (June 22/23rd) I hope to make it to Munich and perhaps so will you :slight_smile:

2 Likes

… might be an option (Munich) :slight_smile:
About our config; we went live this morning on a single server.
Things are better, we gained 1 sec. but still have hundreds miliseconds extra delay.
Wondering now if SQL driver (latest embbed with 5.3.9.166) might be the bottleneck.
You mentionned Fusionreactor but I wonder how it will weight on a production server perfs; is that “acceptable” ?
BR

that’s easy to test, call a page which doesn’t make a db request?

1 Like

W/O db request we can see there’s a delta.
15ms VS 30ms (thanks to bdw429s’ Tomcat fix: was 15-30ms VS 2sec+).

edit
Remarquably, the response time is way more variable on 9.166 [from identical ~15ms up to double ~30ms] hard to tell with such tiny values … so many factors interfer on the response time …

Also, we notice that CPU usage seems pretty higher (30-50% area VS 15-30%)

it all depends on your code I guess?

how does a simple file doing just <cfoutput>#now()#</cfoutput> compare?

1 Like