Our setup is a Lucee cluster with a multi-AZ Aurora (MySQL) database. Whenever the Aurora cluster does a failover, which has occurred 3 times aver the last few months, our datasource connections go stale but do not fail/close. The result we see are Lucee requests timing out and our systems go unresponsive until we reboot the servers.
Our datasources use a “Connection timeout” of 1 minute and “Auto reconnect” set to true.
Does anyone have any tips for a successful strategy to reconnect stale database connections in this situation?
OS: Windows Server 2016
Java Version: 1.8
Tomcat Version: 8.5
Lucee Version: 5.2.5
Brad Wood talked just yesterday on the MOD podcast about having found this, and raised the issue here: https://luceeserver.atlassian.net/browse/LDEV-3124, with at least one fix that is being tested. While he raised it regarding mssql, it seems part of the issue is generic to any jdbc call. I’ll leave you to look into that, or perhaps he or others may offer more here.
Ah, and there was another related issue, and news of them being rolled into a 5.3.8 release. See Brad’s tweet from Monday:
Thanks Charlie… looks like Lucee 5.3.8 will address this issue. In the meantime, we have added a CloudWatch alarm that calls a page with the following line if it detects a number of unhealthy hosts:
It may bring services back if we run into this issue again and shouldn’t do any harm.