Intermittent long delays between running cfquery when switching datasource mid-request

OS: Windows
Java Version: 11
Tomcat Version: 9
Lucee Version: 6.2.6.19

Dear fellow Devs,

We are experiencing regular (~twice a week) occasions when our Lucee servers will see a sudden rise in blocked threads and an associated degradation of performance. The blocked threads accumulate until usually a restart of Tomcat is required to bring it back to good performance.

This began after updating from Lucee 5 up to 6.2 a few months ago, and recent updates have possibly reduced the impact of these issues in that the servers seem better able to cope with the large number of blocked threads but it still impacts the application enough to require Tomcat restarts.

In Fusion Reactor we can see that the blocked threads are all stuck at a similar point - they are blocked between having finished one cfquery against a datasource, and then running another query against a different datasource. Once the second query is run then the request will go off and complete. This shows up like the below in Fusion Reactor profiling:

100.0% - 59.848s lucee.runtime.tag.Query._doEndTag(Query.java)
100.0% - 59.848s lucee.runtime.tag.Query.executeDatasoure(Query.java)
100.0% - 59.848s lucee.runtime.db.DatasourceManagerImpl.getConnection(DatasourceManagerImpl.java)
100.0% - 59.848s lucee.runtime.config.ConfigWebImpl.getDatasourceConnectionPool(ConfigWebImpl.java)
100.0% - 59.848s lucee.runtime.config.ConfigImpl.getDatasourceConnectionPool(ConfigImpl.java)
100.0% - 50.444s Waiting on lock <0x599bf1cb> (a java.lang.String)

These are the typical settings on the datasources for connection limits:

Connection Limit: inf
Connection idle timeout (in minutes): 20
Connection timeout: inf
Max idle connections: default
Min idle connections: default

Any advice on troubleshooting or possible causes would be very gratefully received, thank you.

While someone else may readily recognize the issue and offer a solution, until then it may help to hear what kind of db (driver) you’re using. And are the two dsn’s using the same driver? Are they on the same server as each other? (That last question may be irrelevant, but worth asking.)

(And are you saying the request where it happens always hangs up or is it sporadic? If it does always happen, have you confirmed what happens if you ONLY have a query to the “second” dsn?)

Finally, while you’ve shared the fr profile, what would be more valuable is the stack trace, taken while the request is hung. It’s one of the buttons on the right of a running request in FR.

You may see it indicate that the current request is blocked by another. It will show the jvm thread id. You can then go back to the running requests list to find it: the thread ID is listed a couple columns left of the url for the request, and stack trace that.

It might help if you’d share the first few lines of those stack traces.

Hi @carehart, thanks a lot for taking time to respond.

We are using the SQL Server JDBC driver (v12.6.3). The datasources do use the same driver and they will be on the same server, though it will sometimes be a different sql server instance on that server.

The issue does happen regularly, occurring every 3 to 5 days. Once the degradation begins to show then it typically deteriorates over the next day until we need to restart Tomcat.

The stack trace on a stuck thread is below.

"ajp-nio-0:0:0:0:0:0:0:1-8009-exec-33" Id=553246 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3dabf4b5 
   java.lang.Thread.State: TIMED_WAITING
        at java.base@11.0.17/jdk.internal.misc.Unsafe.park(Native Method)
        - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3dabf4b5
        at java.base@11.0.17/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
        at java.base@11.0.17/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
        at java.base@11.0.17/java.util.concurrent.LinkedBlockingQueue.poll(Unknown Source)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:99)
        at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:31)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:913)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:960)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:491)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
        at java.base@11.0.17/java.lang.Thread.run(Unknown Source)

It shows as TIMED_WAITING on @3dabf4b5, when I click on the “@3dabf4b5” reference then FR show me a page titled “Filtering threads which have acquired or are waiting to acquire lock <0x3dabf4b5>” which showed around 50 threads in the same “TIMED_WAITING on @3dabf4b5” state. Those threads have recorded waiting times that go as far back as 3 days (which is when Tomcat was last restarted), with the most recent having a waited time of 30 minutes.

Perhaps this accumulation of blocked threads over time leads to the critical situation where the server becomes degraded?

Well, yes, an accumulation of blocked threads would be bad…and yet not always in an immediately apparent way, as some requests would not conflict with whatever is leading to such blocking, which is why some requests may run but not others, at least for some time.

Anyway, that @ object identifier alone (that it reports the thread being blocked on) does not itself tell us enough to know what it is or what other thread holds it.

In that case, what’s needed next is instead a thread dump, which is the point in time listing of stack traces FOR ALL threads. Then we may see what thread IS HOLDING that lock on that object. That’s usually the root cause problem. The thread holding the lock could be another request, or it may be a a thread of some other type.

And FR can help you either of two ways here. First, it offers either a means to get a complete thread dump of all threads (under resources>threads and choose “stack trace all”), or you can get a subset of that being ONLY request threads, from the “stack trace all” button offered at the top of the requests>activity page where you found the running/hanging requests.

Then, too, that output (either stack trace) can be switched between “rich” and “plain” using buttons at top-left of the stack trace. One is easier to look at but may be more challenging to share. See first if you might find the answer in the default rich view.

If not, then before you may share either here, note that it would be hundreds if not thousands of lines of text. Please attach either version as a file instead.

And the Lucee devs may even propose you should open a support ticket instead of expecting them to participate in this back and forth here. I’ll let them speak for themselves.

I’m like a member of a search and rescue team, trying to get you out of the canyon, to the paramedics who may be waiting at the ambulance or perhaps can be reached at the hospital. I might be skilled at diagnosing/treating many ailments, and can help “stop the bleeding”, but some injuries may prove to require another level of care. :slight_smile:

diagnosed, working on robust solution

https://luceeserver.atlassian.net/browse/LDEV-5967

1 Like

There you go. :slight_smile: So it seems, from the very robust diagnosis in that ticket, that you’ve been working on this matter previously, right Zac?

Is it that you’re confident you don’t need Phil’s thread dump for confirmation that it’s the same issue? Just trying to save him (and you) from needless work.

I have a bit more to say, but it’s to everyone reading along rather than to Zac, so I’ll offer it as a new reply.

1 Like

And to my previous note’s concluding analogy, the details in that ticket show the critical need sometimes (and the great value) of a specialist surgeon.

And more than just surgery, that ticket indeed reads like heart surgery research…with the difference being that it’s a surgical procedure which would then affect everyone who might update to “that version of the heart”, to stretch the analogy. :slight_smile:

<PSA>
As such it gives a glimpse into the very tough and important role performed by Zac and others on the team.

Those who use Lucee (especially in production) should really take this to heart and give financial support to the Lucee project. This stuff doesn’t “just happen”, and there are not a thousand magical elves keeping things running!

The same ia true of most projects/products we use, of course. With open source projects, at least you can have clear insight into the process–if you do look. Like in a hospital, some scenes are not for the squeamish.

Unlike a hospital, there’s no national health care service nor insurance companies nor even societies providing financial support “for the public good”.

</PSA>

1 Like

Thank you very much @carehart - I’ll get the full thread stack trace the next time we experience the issue, even if it is just for my own learning considering Zac’s post.

I couldn’t agree more with your PSA.

And thank you @Zackster, I really appreciate the update and the link to the issue.

1 Like

Can you try out 6.2.8.9-SNAPSHOT and let me know how you go?

@carehart thanks for the kind words mate!

Hi @Zackster, unfortunately we wouldn’t be able to deploy a snapshot to Production, and we don’t see the issue in our non-prod environments which we assume is just due to the very low volume of traffic there.