Resource leak in DatasourceConnectionPool 5.3.10

We recently updated some apps from 5.3.8.201 to 5.3.10.97. After updating, our apps no longer recover from interrupted database connection. A restart is required to get the app to reconnect to the (MSSQL) database.

Stack traces show many threads waiting for on a lock in lucee.runtime.db.DatasourceConnectionPool.getDatasourceConnection(DatasourceConnectionPool.java:95) . Logs may also show [java.lang.ThreadDeath] but not always.

Looking into the lucee db pool code, it seems when getDatasourceConnection() decides to create a new connection, it first calls _inc() to increment the connection count. If loadDatasourceConnection() throws a page error, _dec() is called. But if the thread is killed by a timeout it seems that _dec() never gets called and so we lose a connection slot. If enough connection slots are lost lucee will not be able to recover when the db comes back.

There is some cleanup code that runs periodically lucee.runtime.db.DCStack.clear() but it can only clean up connections that were previously opened successfully and added to the pool. Also, stack traces often show clear() blocked, it uses the same lock as getDatasourceConnection() .

I was able to fix this issue by patching getDatasourceConnection() so that it only calls _inc() after the connection is successfully opened. This is not a perfect solution since it could allow the ConnectionLimit to be temporarily exceeded.

If lucee 6 is going to use a new connection pool? If so this issue may be moot. Otherwise I can submit a PR if this is found to be a valid bug.

Here is a test script I used to replicate the issue.

<cfcontent type="text/plain" reset="true">
<!--- 
This is a test script that checks if lucee can recover from the db going down.
to test:

1. Set #Application.DSN# to a working MSSQL data source name.  
2. open this page and check that "result = db connection is ok" for all threads
3. block port 1433 with a firewall to simulate loss of db 
4. refresh the page and all threads should show a timeout or db error after 15 seconds or so
5. remove the firewall port block, refresh the page and all threads should show ok again. 

You may need to disable any queries in your Application.cfc for this test to work
--->
<cffunction name="checkDatabase" access="private" output="false" returntype = "boolean">
    <cfargument name="dsnToUse" type="string">
    <cfset dbWorking = false >
    <cftry>
        <cfquery name="databaseCheck" datasource="#arguments.dsnToUse#" timeout="2">
            SELECT 1
        </cfquery>
        <cfset dbWorking = databaseCheck.RecordCount EQ 1>
        <cfcatch>
            <!--- failure of some sort, don't attempt to log here --->
        </cfcatch>
    </cftry>
    <cfreturn dbWorking>
</cffunction>

<cfscript>
startTime2 = Now().getTime()
results = []
//num threads should be set to  2x the max connections of the datasource
variables.numThreads=20

for (i=1; i <= variables.numThreads; i++) {
     results[i] = 'timed out'
}

//start the threads
for (i=1; i <= variables.numThreads; i++) {
    writeoutput("testThread#i# starting" & Chr(13) & Chr(10));
    thread action="run" name="testThread#i#" threadNum=i results=results{
       /* this should stop the thread after 5 seconds... */
       /* requestTimeout=5 will prevent the threadDeath error but lucee still will not reconect... */
        setting requestTimeout="5";
        sleep(100);
        if( checkDatabase("#Application.DSN#"))
        {
            arguments.results[arguments.threadNum] = "db connection is ok"
        }
        else
        {
            arguments.results[arguments.threadNum] = "db connection error"
        }
    }
}
sleep(1000);
cfflush
for (i=1; i <= variables.numThreads; i++) {
    //this will kill kill the threads after 15 seconds, if they are not done
    thread action="join " name="testThread#i#" timeout="15000"{}
}
for (i=1; i <= variables.numThreads; i++) {
    writeoutput("testThread#i# result =  #results[i]#"& Chr(13) & Chr(10))  
}
elapsed = Now().getTime()-startTime2
writeoutput("Thread tests done in #elapsed#ms"& Chr(13) & Chr(10))
</cfscript>

3 Likes

nice bug report and investigation!

yes, lucee 6.0 uses apache commons pool2, but 5.3 is still being maintained

I created a ticket linking back here [LDEV-4358] - Lucee