Active datasource connections spiking

Hi all, I’ve started using Lucee in place of Adobe Coldfusion recently, and its been functioning well, with the exception that every so often, one or more datasources’ active connections starts to skyrocket. This occurs on both a MSSQL and Oracle database connection.

Default settings allow infinite connections. This resulted in over 1700 active connections to a datasource, and this number didn’t decrease, and caused an out of memory error, killing Lucee.

I had since tried setting connection limit to 100, which was abided by, but due to these connections not dropping, new requests would hang until one became available, or a page timeout occurred. They never became available, and this required a restat of Tomcat.

No other database connection properties were changed before this issue started occurring. I’ve been experimenting with setting limits and timeouts but to no avail.

The cause of this active connection ballooning is unknown. No obvious cause yet.

There are many variables, so hope this helps: Running Tomcat 9.0.107, Lucee 6.2.1.122.

I hope someone can help with likely settings to stop this from occuring, but obviously it depends on the cause!

EDIT: Just some additional info - there was aprox 700 hits in the 20 minutes before the datasource active connection count maxed at 100. These queries are basic and should close the connection quick enough that this much traffic during this 20 minutes should not cause issues.

Cheers, Dan

Getting an unexpectedly large amount of traffic would be my first guess. So how are you measuring this count of “aprox 700 hits in the 20 minutes before the datasource active connection count maxed at 100”? That may be under-counting.

If you may relying on Google analytics or the like, beware those don’t track visits by agents/clients that don’t execute Javascript.

And if you’re adding code to track visits, beware if you’re adding it only to the app(s) that you presume “get any significant traffic”. Any or all sites could be being pounded by spiders, bots, hackers, and more–especially AI bots that have become both pernicious and numerous.

You can measure traffic many other ways, from web server access logs to tomcat access logs to fusionreactor and more. If you even compared just the SIZE of such log files, from before there was trouble to now, you might spot a clear increase. But even if there’s none, it could be a spike specifically in calls to cfml pages WITHOUT then calling the static files typically accessed by normal browsers calling such pages.

I appreciate you’re thinking instead this is some Lucee bug or config tweak needed. Or you may contend “this didn’t happen with CF”, but the burst of automated traffic could simply be new since your cutover. I’ve seen that before.

Or perhaps someone else will offer you a better, more targeted solution.

In my case, we’ve got nginx logs. I suspect the actual load is minimal (35 req per min is a few users clicking around forms, which is fairly typical for us). I am concerned that connections aren’t being closed when they would otherwise be. In the migration from Adobe Coldfusion to Lucee, the hosts also changed, but as far as I know, the rest of the infrastructure used remains the same.

I would love to narrow down why this happens and why connections aren’t closed. I was curious and checked the database itself to determine number of connections. At the time of the apparent 100 active connections, the database of the according data source indicated no more than two connections! This makes me think something on the Lucee side is misbehaving.

I’ll keep an eye on my tweaked config - today I’ve activated “validate before use” on the data source, and set a socket timeout of 300000ms. I’ll report back if the issues still occur. Cheers.

I get it that you think it’s a Lucee issue (as I already said). While you await others to somehow affirm or better diagnose that, I don’t think you’re yet out of diagnostics to consider.

First, you’ve said now you “suspect the actual load is minimal”. Will you be checking those nginx logs? If so, do make sure also that ALL sites are being logged. I’ve seen that catch people out.

And even in your confirming “no more than two connections!” in the db, that may not be the proof it seems as again it depends on how you’re determining that.

Moving on, note that fusionreactor has a free 14 day trial. Or there’s also the now-open source seefusion, which should suffice for the simple count of requests over time we seek (and more). Either of those should take only minutes to implement–and should add virtually no overhead.

BTW, it could also be cfthreads that are making the db connections. If you use those in your code, perhaps something about HOW your code runs on Lucee (for the same incoming load) might be triggering them to run in an unexpected way. I don’t think SF tracks cfthreads distinctly, but FR does (as “transactions” rather than requests).

Finally there’s also Zac’s Lucee Performance Analyzer (which analyzes Lucee debug logs). I’ve not seen statements on its potential overhead, but since you think you have little load it may well prove beneficial for you, beyond this one challenge. (I don’t see that it reports on cf threads). Perhaps @Zackster can clarify.

Again, I realize that because things worked on cf but now suffer, it would seem this “must be a Lucee bug or config issue”. We’ll see if someone else spots that for you, with a solution or work around.

Until then, I just don’t think your only option is to start fiddling with knobs, though I appreciate that temptation. :slight_smile:

1 Like

Which database driver(s) are you using? Are you using threads, ORM, and/or cftransaction’s?

We have had similar issues over several versions. See PostgreSQL DB connections not being released with 42.7.5 JDBC driver. We found some of it was due to database related code in long running requests or in threads that end up with hung / unclosed connections, some of it due to a specific JDBC driver for PostgreSQL, and some due to external long queries / materialized view updates creating long locks / timeouts of Lucee code.

Between upgrading to latest release candidate, updating our PostgreSQL JDBC driver to latest version, and some refactoring of our code, we’ve eliminated 99.99% of the issue. In addition to what Charlie said, looking at these areas might point you to root cause. Unless tens/hundred of thousands of visitors are hitting your site at the exact same second / minute, it would be unlikely to be the source. Most likely some code that is causing abandoned / hung connections.

1 Like

You may well be onto something for Dan to check into, @psarin.

But as for your concluding statement it would not be so unusual for someone to find they’ve been blasted by tens of thousands of requests per minute. And I mean even just to Lucee (or cf) requests, not even necessarily also static content served by the web server.

To that last point, Dan never replied to the suggestion that his nginx logs could prove or disprove my contention. Even better would a tool tracking Lucee requests specifically. It seems at least worth ruling out.

That should at least be easier than all other diagnostics and workarounds–though the latter will be valuable if it’s not the cause, of course.

Critical point, using an older version is always battling tech debt, 6.2 is in maintenance mode and only getting bug fixes.

If you are running an older version, upgrade if you are having problems, you can review the change log too

https://download.lucee.org/changelog/?version=6.2