Server possibly crashing due to too many threads?

Jason.Weible · February 6, 2019, 5:43pm

Okay, we can definitely test it and see if it makes a difference.

Some questions:

What would be the best practice for correctly stopping threads that have run for too long though?
How long will a thread run (if it has a timeout of 6000 seconds) before Lucee says it needs to be killed?
How will that look to the user?

bdw429s · February 6, 2019, 5:50pm

Good question. Micha’s response in the past has been that you should tune the slow parts of your app so they’re not slow. Obviously! I would recommend having a timeout on any external operations such as HTTP or cfquery calls so they don’t get hung. They are very hard to kill. If a thread is processing CFML code but just slow or in a big loop, you can kill it via FR manually which uses the interrupt method first and ONLY the sledgehammer approach behind several warning confirmations. I would start by finding out what is running long and finding out why. FusionReactor makes this VERY easy with the “longest running requests” and “slow requests” pages mixed with their amazing request profiling.
It would take 6,000 seconds. (a little over an hour and a half) or am I misunderstanding what seems like a really simple question? I would hope you have no pages in your app taking 1.5 hours to complete though!
Just like any page that takes a while to complete. Again, not sure I understand the question. A page that takes 1 second, 10 seconds, 30 seconds, or 5 days work the exact same. The browser sits there until it hears back from the server! Well, actually your proxy server or browser will likely time out before 5 days, but you get the picture.

Jason.Weible · February 6, 2019, 5:58pm

Okay, I see. I just wasn’t sure if there was some behind the scenes handling of request death for Lucee if the request timeout was so high.

We do use timeouts on our queries and http calls, so that should be fine. Generally the app is pretty tuned, so we shouldn’t have any issues if this works.

If it does work, we can just keep an eye on FusionReactor for stuff that is taking too long and adjust it accordingly.

We’re going to update two servers and set a request timeout of 1 hour, and see if it makes any difference in the total number of threads being created on those servers. I’ll let you know how it goes after we observe it for a day or so! Thanks!

bdw429s · February 6, 2019, 6:10pm

The way Adobe at least used to do it was a thread was responsible for asking if it had timed out. Which of course, doesn’t do any good if a thread is busy waiting on an external IO task. Lucee has a Controller thread that runs 24/7 and monitors all running threads like big brother. If it sees a thread that has outlived his timeout, Lucee murders him, I believe using Thread.stop() or similar. There is not any more or less overhead depending on your request timeout it’s just a matter of how long big brother lets you live before the hammer falls.

Also, get all your servers on the latest stable. I think there were some stability issues in earlier versions of 5.2 which may not be affecting you, but it’s best not to have too many factors at play.

Jason.Weible · February 11, 2019, 5:27pm

Your suggestion seems to have worked. We rolled the change out to a couple of production servers last week and watched it, then slowly rolled it out to others. The updated servers haven’t had any more crashing issues, so it seems to be working! Thanks for the tip!

We went ahead and updated to the latest release as well.

bdw429s · February 11, 2019, 5:39pm

@Jason.Weible Ok, so now that you’ve found that it seems to be the case that Lucee’s request timeout was creating zombie threads from the web connector, here is another step you can try. This is a sort-of hidden feature in Lucee that you can enable via an environment variable or Java system prop.

lucee.async.request.handle=true

It will run every HTTP request in a separate async thread that is fired off from the original HTTP thread. The benifit is the Request Timeout can hammer threads all day long but it won’t ever hammer on the web connectors threads directly and won’t turn them into zombies. The downside of this is it really screws up monitoring tools like FusionReactor since they only look at the main HTTP thread so you’ll stop getting stats of JDBC connections and your stack traces will all just show the threads are simply waiting for the “real” thread to complete. If you really want to have your request timeouts back and you’re willing to live with how this setting affects monitoring tools, you can use it.

I would personally recommend keeping your request timeouts at bay and tuning any slow pages, but I wanted to make sure you were aware of this option. Plus, I was waiting to talk about it until we knew whether it would apply to you.

Jason.Weible · February 11, 2019, 5:55pm

Okay, I’ll keep it in mind. We rely pretty heavily on FusionReactor to tune stuff, so that might actually do more harm than good.

Jason.Weible · February 13, 2019, 4:54pm

Okay, so we’re not completely out of the woods just yet. We’ve had a one server crash each of the past two days. Same symptoms as before, a high number of ajp-nio-8009-exec-XXX threads showing up in FusionReactor after Tomcat becomes unresponsive. Things are generally better overall, though!

We made the request timeout something ridiculous (like an hour) in the Lucee Server Administrator previously. We do have a few places in the code where we were overriding the previous (20 second) timeout via . Should we eliminate those as well? I’d imagine they function the exact same way as the admin setting.

We also have places where we utilize cflock. Could that be contributing? I wasn’t sure about if that behaves in the same manner or not.

Just for fun, and in case it helps, here’s what our thread state graph looks like from FusionReactor. The crash occurred right at the spike ~8:15am. The graph from the day before on the other server looks very similar.

Also, how should our Boncode connector definition in the Lucee/Tomcat/conf/server.xml file look? Right now it’s pretty simple, but I see there are some parameters you can add around request timeouts there as well, which I don’t think we’ve ever had to do in the past with Railo.

<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" packetSize="32768" />

Jason.Weible · February 19, 2019, 6:01pm

Still seeing some intermittent crashes which seem to be thread related. This one happened today.

Same symptoms as the one from last week.

Good for a while
Threads slowly rise normally with load
Something happens to spike TIMED_WAITING threads
Tomcat becomes unresponsive, all threads timeout and get stuck in WAITING
No new threads can be created without a restart of Lucee/Tomcat

awebster · June 19, 2019, 3:44am

Did you ever determine the cause? I have a new Linux instance running on Amazon with Lucee 5 and MySQL. Lucee seems to stop responding daily. Other non CFM pages respond. The process appears to still be running but responds to web requests with service unavailable. I execute a restart via the command line and it resrarts quickly.

Jason.Weible · June 19, 2019, 5:19pm

No, we still haven’t been able to solve it. It definitely seems to be load related. Summer is a slow season for us, and we’re not having the issue now.

In addition, it doesn’t seem to happen on Windows 2008 servers, and I think also Windows 2012 servers were okay. It started happening with Windows 2016 version.

harryf · April 28, 2021, 1:16pm

I have the same issue as @awebster right now. Any updates on this issue?

awebster · April 28, 2021, 1:31pm

Not completely. I increased the memory and it stopped happening daily. I have some WordPress sites running MySQL on the same server and every so often, Lucee crashes and records an out of memory error. I believe the issues is those WordPress sites getting hammered by bots and running up the memory usage so Lucee cannot get what it needs at a certain time. I think the added memory helped but is not the end solution. Luceee now stops 1-2 times a month.

Zackster · April 28, 2021, 1:32pm

if you install my performance analyzer extension, there’s a thread report which will let you know what the state of the threads are

harryf · April 28, 2021, 2:00pm

FusionReactor monitor is paid. Are there any free alternatives?

I have t3.medium with 4GM of RAM, with Lucee set to min. 1GB, max. 2GB. Is that sensible?

Thanks, @awebster and @Zackster.

(Tomcat crashed yesterday afternoon with around 400 users on the site. Same today around lunchtime. Would be useful to be able to see what is actually happening as there are no ‘CFERROR’s generated. Apache and all other services seem stable.)

awebster · April 28, 2021, 2:18pm

I am running on a t2.large which has 8 GB of ram.

Zackster · April 28, 2021, 5:08pm

Which version of Lucee are you both running?

TBH I wouldn’t be so concerned about how much RAM, the question is, why are you seeing I’m assuming, stuck threads?

I’d recommend trying out 5.3.8.170 as it’s 5.3.8 is about to go stable and it’s super solid.

There have been some additional fixes since 5.3.8 RC1 which address some memory issues with sessions.

Graphs are pretty but don’t tell you much, this is what the thread report looks like in my Free Performance Analyzer plugin, you can sort the table by clicking on the headers and the stack traces give useful clues to what’s causing problems.

River_Bender · June 7, 2021, 8:06pm

That is why we moved all of our WP sites behind Cloudflare. I helps stop a lot of junk traffic.

harryf · October 31, 2021, 9:46am

Just a big thank you to Zac for the performance tool.

Facing memory issues again so going to try to use this to explore why contexts can’t handle Lucee using up to only 1GB of memory off-peak.

OOWS · January 16, 2024, 4:15pm

Does anyone know if this still works in Lucee?
If so, where do I set this variable? I am running a default Lucee install (lucee-5.4.3.15) on an Ubuntu server.

Thanks for any help.