Understanding why Lucee stopped responding

We run a couple of load balanced AWS servers in separate data centres
running Windows and Lucee 5.0.0.254. Our session data is stored in an RDS
MySQL database. This morning, both servers stopped responding at
approximately the same time. We noticed that they were still responding on
port 8888 and we could use the Lucee server admin pages while our
applications were not responding. We could also use applications that don’t
use session management. Restarting Lucee returned services back to normal.

The only related logging we could find was Lucee’s timeout log, attached.

We noticed the same issue on a Test server later this morning running
Windows 8.1 and the same version of Lucee. We have no idea what triggered
the issue in any of these cases.

I don’t think we’re doing anything special… has anyone else noticed this
behaviour? Any suggestions what may have caused the issue and what we can
do to prevent this impact if it happens again?

Thanks,
Simon

Hi Andrew,

No restart and the zone doesn’t appear to have changed for ages. The
connection timeout is set to one minute. The best simulation I have
mustered is to switch off MySQL, but that returns the expected “Communications
link failure The last packet sent successfully to the server was 0
milliseconds ago. The driver has not received any packets from the server.”
and comes back to life when MySQL is restarted. Perhaps something to do
with the maximum connections (10 for us) at the time? I notice an “auto
reconnect” option for data sources, but this option doesn’t seem to be
encouraged.

Also, the timeout.log file shows records immediately after the issue
started, but the servers were unresponsive for a good 15 minutes until we
rebooted the servers. Would the fact that port 8888 responded normally be a
sign that the issue may have started with a session hiccup, but manifested
as a problem with the Boncode connector (v 1.0.28)?

Simon

Hi Simon,

Did MySQL get restarted, either by yourself or by the RDS maintenance
window? If you’re running a multi-AZ RDS instance it would have rebooted
with failover to the standby instance, which would be somewhere else, maybe
Lucee was using a connection from the pool that could no longer connect to
the RDS instance? What connection timeout do you have set against the
datasource you are using for the session management? If it was set to say
20 minutes then it could be it is just holding on to a dead connection.

Kind regards,

Andrew
about.me http://about.me/andrew_dixon - mso http://www.mso.net - Lucee
Association Member http://lucee.orgOn 5 October 2016 at 02:45, Simon Goldschmidt <@Simon_Goldschmidt> wrote:

We run a couple of load balanced AWS servers in separate data centres
running Windows and Lucee 5.0.0.254. Our session data is stored in an RDS
MySQL database. This morning, both servers stopped responding at
approximately the same time. We noticed that they were still responding on
port 8888 and we could use the Lucee server admin pages while our
applications were not responding. We could also use applications that don’t
use session management. Restarting Lucee returned services back to normal.

The only related logging we could find was Lucee’s timeout log, attached.

We noticed the same issue on a Test server later this morning running
Windows 8.1 and the same version of Lucee. We have no idea what triggered
the issue in any of these cases.

I don’t think we’re doing anything special… has anyone else noticed this
behaviour? Any suggestions what may have caused the issue and what we can
do to prevent this impact if it happens again?

Thanks,
Simon


Get 10% off of the regular price for this years CFCamp in Munich, Germany
(Oct. 20th & 21st) with the Lucee discount code Lucee@cfcamp. 189€
instead of 210€. Visit https://ti.to/cfcamp/cfcamp-
2016/discount/Lucee@cfcamp

You received this message because you are subscribed to the Google Groups
“Lucee” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to lucee+unsubscribe@googlegroups.com.
To post to this group, send email to lucee@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/lucee/7bce32ab-dc38-4228-83d6-bc620c58caff%40googlegroups.com
https://groups.google.com/d/msgid/lucee/7bce32ab-dc38-4228-83d6-bc620c58caff%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

My best guess at the cause is an invalid configuration of a Connector in
the server.xml file. We had specified MaxThreads=“1000” (with the incorrect
capitalisation… should have been maxThreads). If the issue was triggered
by the number of concurrent threads passing a threshold, it stands to
reason that the load would have been distributed to the other server, which
failed soon after the first in the same way. Having corrected the
configuration, we haven’t seen a repeat of this issue.
Simon