onError() and long requests - fails to get stack

We’re got Lucee 5.3.6, on Tomcat 9 and Open JDK 11 all working on Linux.

Sometimes we get a request timeout - the front end web server (Nginx) logs a 500, and our Application.cfc’s onError() is called.

This does two things - logs the message too application.log and sends an email with the full details.

Recently, we’re seeing 500s in the access log, and an entry in application.log, but no email is sent.

I was able to catch it in the act today and exception.log contains a stack like this :

request /api/index.cfm (/......../api/index.cfm) has run into a timeout (timeout: 50 seconds) and has been stopped. The thread did start 50240ms ago.
        at java.base/java.lang.Thread.getStackTrace(Thread.java:1606)
        at lucee.runtime.CFMLFactoryImpl.createRequestTimeoutException(CFMLFactoryImpl.java:610)
        at lucee.runtime.util.PageContextUtil.checkRequestTimeout(PageContextUtil.java:210)
        at lucee.runtime.PageContextImpl._doInclude(PageContextImpl.java:835)
        at lucee.runtime.PageContextImpl.doInclude(PageContextImpl.java:818)
        at application_cfc$cf.udfCall1(/Application.cfc:344)
...

That line in Application.cfc is where it does include error.cfm

function onError(required any Exception, required string EventName) {
    writelog(text='Error : #arguments.exception.message#',type="error");
    include 'website/error.cfm';

We used to get these emails e.g. onError’s run time continued after a request timeout and sent an email. That was on a very similar stack but Lucee 5.2.9

Could something have changed ? How can I be sure to log all the details of a timeout error ? Are there some things you shouldn’t do in onError ?

there was a change to the timeout handling for onError in 5.3.6.15, but you should have that in 5.3.6

https://luceeserver.atlassian.net/browse/LDEV-2756

Yup, we’re on a later 5.2.6 than .15

Does it happen with 5.3.8-RC2?

I’ll try and find out, but because our application is Dockerised, and there are a few intermediate (internal) layers, it’s time consuming.

I’ve only had medium luck with a local test case inside our app - it seems this.requestTimeout only effects the Taffy reset/start up, not subsequent randompage.cfm files; but that might have been Friday afternoon human error …

I was not able to get a reliable test.

I could make hundreds of CFHTTP calls rapidly. I could run multiple of those (test pages) at once. No issues.
Wait “some few minutes” and reload one single test page, one of the calls would fail.

The average call time over ~175 CFHTTP calls is only ~15ms which is about as quick as you can do a meaningful database select and return JSON !

To fix the error reporting we did this trick to extend the timeout during onError : dans.blog: Forcing your Application.onError event to run to completion in ColdFusion

We spent ages faffing, but what seems (no errors for a week vs several per day) to cure the underlying issue (as opposed to the error reporting) was either

  • disable keep-alive on the destination host (‘keepalive_timeout 0’ in Nginix)
  • remove DNS from the CFHTTP call. As it happens, we can just send all our calls to “http://localhost” and put in a “Host” header for the site we want.