CPU near zero, Memory to the sky

Antoine_BAPST · October 18, 2022, 3:14pm

Hi all,
I’m running a pair of Google Compute Engine VMs with the following configuration :

Intel(R) Xeon(R) CPU @ 2.20GHz / Installed RAM 16.0 GB
Windows Server 2022 Datacenter Version 21H2
IIS 10.0.20348.1
Lucee 5.2.9.31
Servlet container : Apache Tomcat/8.5.35
Java : 1.8.0_192 (Oracle Corporation) 64bit

Request timeout is set to 50 seconds and no concurrent requests are queued.
Application Listener is set to “Mixed”.
SQL target is a remote SQL server 15.0.2095.3(web edition).
Lucee SQL 6.2.2.jre8 Application is used for datasources.

Lucee-server.xml settings

<default-resource-provider arguments="lock-timeout:1000;" class="lucee.commons.io.res.type.file.FileResourceProvider"/>
<resource-provider arguments="lock-timeout:10000;case-sensitive:false;" class="lucee.commons.io.res.type.http.HTTPSResourceProvider" scheme="https"/>

Servers are dedicated to API calls, meaning a lot of requests (say million) per day.
Most of the calls require SQL requests and return a limited amount of data.
Those machines are behind a GCE load balancer with a simple probe (cfm template) called every 5sec. for heath status.
The typical usage of the application should not be CPU harrassing, still, going between 10-80% would be “normal”. I see on the Overview page (when all is ok) JVM Heap size 46% and non heap 10%.
(when serving ok) Common daemon Service Runner is consuming up to 5GB, IIS worker around 2OOMB. Overall CPU in the 10% Memory 50%.

I noticed a particular behavior that I cannot explain.
Looking at the task manager I can see the “Commons Daemon Service” Runner Going high on memory while its CPU usage stay null and flat. Also a IIS Worker process that seems to follow the same beahaviour (NO CPU, high mem).

When this behaviour occurs, the probe can either respond “normally” or run into a timeout, several times in a quite short (5-10 seconds) time, causing the LB to “missdirect” the API requests to the failing host.

As you’ve already guessed, I’m not a sys/dev-op [at all] so that simple advices / tracks / configuration check would be greatly appreciated. I suspect we face a “too many threads” issue, like described here : https://lucee.daemonite.io/t/server-possibly-crashing-due-to-too-many-threads/5129/20 but this alone may not be relevant.

(and please excuse my English & bleeding eyes)

Zackster · October 18, 2022, 3:28pm

Hey!

Given you’re using a rather older version of Lucee / tomcat and java, I’d suggest firstly testing and then upgrading to 5.3.9

We’ve done a lot of work on Lucee since 5.2.9, your problems have probably already been solved.

z

Antoine_BAPST · October 18, 2022, 3:39pm

Zackster, thanks for the super fast response.
we surely want to do that, but that’s a thousand templates/components/functions we have to test.
In the meantime, do you believe there is something in the configuration/settings we may consider to improve the situation ?

P.S : I’ve tried to install your Analysis application on my local machine, but I’m not sure how/if it works on a IIS setup … Debug is on but the only results I get are from Lucee’s admin panel (accessing via Tomcat port 8888). (please redirect me to the proper help/section if needed)

Zackster · October 18, 2022, 3:44pm

not really, all the improvements are inside the Lucee engine as opposed to settings

did you enable debugging for the web context?

Antoine_BAPST · October 18, 2022, 3:50pm

Yes I did (also a template), but I realise I probably missed a web context setup, sorry for that, will do & inform.

Zackster · October 18, 2022, 3:53pm

this is all easier in 5.3.9 too

Antoine_BAPST · November 15, 2022, 8:41am

Hey Zakster,
we’ve done a bunch of tests and 5.3.9.166 seems to be deployable w/o noticeable side-effects.
We did run in an install/config problem due to ‘old’ lucee-[w.x.y.z].jar version (see here) but mostly all seems fine.

We do have some questions regarding tomcat/java upgrade.
Targetted servers are running under tomcat 8 gen (8.5.11 & 8.5.35), pretty ancient JREs (1.8.0_121 / 1.8.0_192) not to mentions SQL Jdbc …
We want to upgrade “step by step” so that we can monitor anything going wrong.
Still, do you see here something that MUST be updated before we go live ?

Finally, I wonder if the upgrade process for such ‘ancient’ version should no be done by uninstalling Lucee as a whole and installing the latest stable packaged (lucee/tomcat9/java) version.

Your guidance is gold …
Thanks in advance.
A.

Zackster · November 15, 2022, 8:53am

I’d be going a fresh install, less moving parts

Antoine_BAPST · November 15, 2022, 8:56am

Thanks Zackster, you rock.
Will settings remain (like mappings, Datasource, Schedulerd tasks) or is there a way to save/restore them afterwards ? [I saw that the “upgrade from install” wasn’t recommended] …

Zackster · November 15, 2022, 9:00am

you can either copy/diff the lucee-server.xml and lucee-web.xml.cfm config files into the new installs, or use CFconfig to export / import

Antoine_BAPST · November 15, 2022, 6:44pm

Well, things didn’t go exactly as expected …
We noticed a drastical raise of response times.
Something like 2 seconds more (*) per page (VS 5.2.9.31), like if those 2 seconds were systematic, even for our probe call (the simpliest cfml template you can find : just setting a variable and display “OK”).
Is that ringing a bell to you ?

(*) I mean : 2080 ms V.S 80ms … yes, that far …

Antoine_BAPST · November 15, 2022, 6:56pm

We start to believe that the problem may come from “outside” Lucee. re: boncode ?

bdw429s · November 15, 2022, 6:59pm

I’m not clear if you’re saying Lucee itself is using memory or if you’re saying IIS is. I don’t know what the commons daemon service is, but looking at your operating system’s task manager will NOT show you a useful number in regards to how much memory Java is using.

You need to get ahold of FusionReactor here. It is going to give you much better information on memory and what is actually in use. Keep in mind “allocated” heap is not the same as “used” heap!

It is common for reverse proxies such as Boncode to have a 1 to 2 second delay due to networking issues. The fix is very simple however.

github.com/Bilal-S/iis2tomcat

Requests much slower than expected through IIS/AJP connector

opened 05:30PM - 07 May 20 UTC

closed 07:19PM - 07 May 20 UTC

bhartsfield

I have a Windows Server 2016 VM with IIS and a couple of instances of Lucee 5.3.…5+92 (using Commandbox). I am using the 1.0.41 connector to connect two IIS sites to their respective commandbox/Lucee instances. What I am seeing with that setup is that going directly to the underlying commandbox instances is much faster than sending the request through IIS and the AJP connector. On average, a request directly to the underlying server is about 300ms while, on average through IIS/AJP is about 1.4 seconds. The boncode settings are minimal. ``` <Settings> <Server>localhost</Server> <Port>8009</Port> <EnableRemoteAdmin>False</EnableRemoteAdmin> <EnableHeaderDataSupport>False</EnableHeaderDataSupport> <ForceSecureSession>False</ForceSecureSession> <AllowEmptyHeaders>False</AllowEmptyHeaders> <ResolveRemoteAddrFrom>HTTP_X_FORWARDED_FOR</ResolveRemoteAddrFrom> <PacketSize>65531</PacketSize> </Settings> ``` As are the web.config settings. ``` <handlers> <add name="BonCode-Tomcat-CFC-Handler" path="*.cfc" verb="*" type="BonCodeIIS.BonCodeCallHandler" preCondition="integratedMode" /> <add name="BonCode-Tomcat-CFM-Handler" path="*.cfm" verb="*" type="BonCodeIIS.BonCodeCallHandler" preCondition="integratedMode" /> </handlers> ``` I have zero proof the the slowness is being caused by the connector or that it is being caused by IIS but it seems to be one of them (or a combination of both). Any thoughts on how to speed this up or track down the actual cause?

Antoine_BAPST · November 15, 2022, 7:12pm

You’re a wizard … We continue to dig but you definitely nailed a large part of it (Tomcat address ::1) ! Thanks !

Antoine_BAPST · November 15, 2022, 7:56pm

I will not close right now as we will perform monitoring this night and tomorrow before raising the victory flag. But I can’t wait to say thanks to both of you for the incredibly fast and efficient help !
I owe you a (bunch of) beer(s) when traveling in Paris area !

bdw429s · November 15, 2022, 10:01pm

Haven’t been to Paris since the last CFCamp… which speaking of, was just announced today for the 2023 season! (June 22/23rd) I hope to make it to Munich and perhaps so will you

Antoine_BAPST · November 16, 2022, 8:41am

… might be an option (Munich)
About our config; we went live this morning on a single server.
Things are better, we gained 1 sec. but still have hundreds miliseconds extra delay.
Wondering now if SQL driver (latest embbed with 5.3.9.166) might be the bottleneck.
You mentionned Fusionreactor but I wonder how it will weight on a production server perfs; is that “acceptable” ?
BR

Zackster · November 16, 2022, 8:43am

that’s easy to test, call a page which doesn’t make a db request?

Antoine_BAPST · November 16, 2022, 11:46am

W/O db request we can see there’s a delta.
15ms VS 30ms (thanks to bdw429s’ Tomcat fix: was 15-30ms VS 2sec+).

edit
Remarquably, the response time is way more variable on 9.166 [from identical ~15ms up to double ~30ms] hard to tell with such tiny values … so many factors interfer on the response time …

Also, we notice that CPU usage seems pretty higher (30-50% area VS 15-30%)

Zackster · November 16, 2022, 12:05pm

it all depends on your code I guess?

how does a simple file doing just <cfoutput>#now()#</cfoutput> compare?