503 Service Unavailable Request Errors

Dino · July 24, 2024, 9:54pm

Hello there,

I installed and tested Lucee on an Almalinux VPS back in January, and migrated my customer websites over to this new production server. Everything was working great until last week. The Apache web server started throwing 503 errors for all website page requests… so I rebooted and all was fine for a day or so, then same issue. Nothing in error logs of any substance (that I could find). I’m no Lucee or Tomcat expert, but could there be an issue with the two communicating? ie: Tomcat times out/does not respond and apache throws a 503?

I’m at a loss, since everything worked so well for months… and my customers are less than happy

I’m running Almalinux 8.1 (linux 4.18.0), Apache 2.4.37 and Lucee 5.4.4.38 (Tomcat 9.0.80, Java 11.0.20.1).

For what it’s worth, this issue seemed to appear after I did an Apache security update… not sure if that is coincidence

Any guidance is greatly appreciated.

Cheers… Dino

Dino · July 25, 2024, 12:30am

Digging a bit deeper, I did find these in the apache error logs for several customer websites:

[Wed Jul 24 14:48:04.837683 2024] [proxy:error] [pid 102486:tid 139624112465664] (111)Connection refused: AH00957: HTTP: attempt to connect to 127.0.0.1:8888 (*) failed
[Wed Jul 24 14:48:04.837728 2024] [proxy_http:error] [pid 102486:tid 139624112465664] [client XX.X.XXX.XX:53686] AH01114: HTTP: failed to make connection to backend: 127.0.0.1

Looks like Tomcat was not responding. Unfortunately I was out of town when these issues happened, and was forced to reboot remotely from my phone. But I’m back home today, and will try to troubleshoot a bit more when it happens again.

Any thoughts are welcome.

Cheers… Dino

Dino · July 25, 2024, 2:39pm

Now that I’m back in the office, I’ve been digging through the Catalina logs, and found these entries:

24-Jul-2024 10:50:38.386 INFO [Thread-4] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler [“http-nio-8888”]
24-Jul-2024 10:50:38.389 INFO [Thread-4] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler [“ajp-nio-127.0.0.1-8009”]
24-Jul-2024 10:50:38.390 INFO [Thread-4] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
24-Jul-2024 10:50:43.638 WARNING [Thread-4] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [Thread-39957] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
java.base@11.0.20.1/java.lang.Object.wait(Native Method)
lucee.commons.io.retirement.RetireOutputStreamFactory$RetireThread.run(RetireOutputStreamFactory.java:73)
24-Jul-2024 10:50:43.680 INFO [Thread-4] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler [“http-nio-8888”]
24-Jul-2024 10:50:43.688 INFO [Thread-4] org.apache.coyote.AbstractProtocol.stop Stopping ProtocolHandler [“ajp-nio-127.0.0.1-8009”]
24-Jul-2024 10:50:43.704 INFO [Thread-4] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler [“http-nio-8888”]
24-Jul-2024 10:50:43.705 INFO [Thread-4] org.apache.coyote.AbstractProtocol.destroy Destroying ProtocolHandler [“ajp-nio-127.0.0.1-8009”]

Looks like Catalina is shutting down due to a runaway thread. So I need to figure out what this means, and how to prevent it: “The web application [ROOT] appears to have started a thread named [Thread-39957] but has failed to stop it.”

Any ideas?

Cheers… Dino

carehart · July 25, 2024, 5:36pm

Dino, I’m afraid I don’t have a ready solution for your problem, though we should be able to get to one–and I have one guess for you, based on what you’ve shared. More on that in a moment.

1] First, I do want to correct what seems a misconception from your last comment (as you struggle to find SOME explanation for the problem). Those log lines don’t reflect “a runaway thread”, nor does it imply Lucee necessarily “shutting down” on its own.

What I’m saying is that those lines (about possible “memory leaks”) appear pretty much WHENEVER one shuts down Lucee…and even CF. To prove the point, I literally just started and ended a Lucee service I have installed, and got those very lines:

25-Jul-2024 11:50:59.903 INFO [Thread-33] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["http-nio-8888"]
25-Jul-2024 11:51:00.703 INFO [Thread-33] org.apache.coyote.AbstractProtocol.pause Pausing ProtocolHandler ["ajp-nio-127.0.0.1-8029"]
25-Jul-2024 11:51:00.710 INFO [Thread-33] org.apache.catalina.core.StandardService.stopInternal Stopping service [Catalina]
25-Jul-2024 11:51:05.887 WARNING [Thread-33] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixResolver-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:

Note that was ME manually stopping the service…and which I had just started. And not a single request had been run against it. (I realize someone else may chime in and say, “no, I NEVER see such lines in my catalina logs”.) I’m simply saying I find them to be VERY common (and on CF just as readily as Lucee).

And FWIW they’re triggered by a listener that tomcat enables by default, in the server.xml, called JreMemoryLeakPreventionListener. And it seems that when a shutdown starts and it immediately detects a thread it couldn’t stop, it reports this…but since Tomcat is shutting down, it’s a red herring.

If it’s of any value, you can find log lines in other files reflecting that the service is being stopped (and I’d think manually, rather than some automatic “shutdown”), which I can see such as in my commons-daemon log for today (in the same folder as the catalina logs), commons-daemon.2024-07-25.log, reporting:

[2024-07-25 11:50:59] [info]  [10732] Stopping service...
[2024-07-25 11:51:06] [info]  [10732] Service stop thread completed.
[2024-07-25 11:51:07] [info]  [27424] Run service finished.
[2024-07-25 11:51:08] [info]  [27424] Apache Commons Daemon procrun finished.

2] BTW, if you feel confident that YOU are not (and no one is) manually bringing Lucee down, here’s one possibility that may not be reflected in the Lucee/Tomcat logs: since you’re running on Linux, it COULD be the Linux “out of memory killer” mechanism that is detecting your box running low on memory…in which case it (the OS) chooses to terminate the process using the most (whether it’s the true CAUSE of the OS memory limit being reached).

You can search to find more on that. But as for finding the Linux logs that will help you know if that IS happening, check out this SO discussion.

Maybe that’s what’s happening. And in that case of course the next question would be “why is memory on the box reaching the limit”. And it could be Lucee contributing to it, or something else (some people run a db on the same machine as Lucee or CF, and that could be the main contributor).

And this OOM killer problem would also beg the question: how much memory is on the machine? Or might it be a vm? or a container? or a commandbox server? And how much heap is Lucee/Tomcat configured to use, etc. As I’ll discuss next, some monitoring tools can help you get to the bottom of the USE of memory within Lucee/CF.

3] Otherwise, perhaps none of the above answers your main question: why is Apache reporting 503 errors, or why do you see that Apache can’t reach Lucee via the attempt to proxy to 8888? (which suggests you’ve opted not to use the AJP connector)

In that case, I’ll say first that there can always be any of many reasons that Lucee (or CF) is not responsive to a request passed to it from a web server like Apache (or IIS). And sadly the logs (in Lucee or CF and tomcat) often fail to explain the reason. It could even be that they are indeed “running” at the time…but just not being “responsive” to the request passed in from the web server.

And in most such cases, your best next step will usually be to implement some Lucee- (or CF-) specific monitoring to better understand WHY it may be that cfml requests are not responding in a moment. That could be FusionReactor or JVM tools (or the PMT in the case of CF).

Again, if the CFML engine is UP but not processing new CF requests, don’t presume such a monitor “can’t help” understand what’s going on in a moment. And some of the tools can be tracking things over time and could be useful even AFTER the CFML engine is restarted, to know what was going on BEFORE that. That may be most important in your case, whether Lucee is going down on its own or someone/something is restarting it.

Such monitoring tools vary in how easily they are implemented and how they work (as well as cost, with many being free and even FR having a 14-day free trial). I realize you didn’t want to hear “you may need to enable monitoring”, and you may not want to “dig through such stuff”.

Maybe someone else will offer some magic pill that happens to be the right antidote for you. My experience is that since there are so many things that can go amiss, the most valuable use of time is using such tools to find the problem and solve it. I appreciate that when one is new to any such tools (even free JVM tools), it can be a hassle figuring how to set them up, use them, and connect the dots (of what they may or may not say) to your actual problem and its resolution.

4] That’s where having someone experienced with the tools help you implement and consider them. I do that, but sadly I can’t do it for free (as I’m busy enough doing it on a consulting basis for folks). I am leaving all the above for you to consider, and I hope it may help (or at least point you in a better direction than your last post was headed). And of course I’m happy to answer questions or offer more as may seem appropriate here.

But if you remain stuck and “just need this solved”, I offer such remote screen-share consulting help, and we may find and resolve things in less than an hour or two. More on my rates, approach, satisfaction guarantee, and online calendar at Charlie Arehart's Server Troubleshooting Consulting Services. There’s also support from others here, including Rasia (which includes the team behind Lucee).

Sorry for the long-winded answer, but since you’re stuck and no one else has replied, I thought it better to offer what I could, and well, wordy is “how I roll”. If such things were easy, smart folks like most here wouldn’t need help, or would get ready answers in places like this. Some challenges just aren’t that easy…though once we find the cause it may well be simple to resolve.

Dino · July 25, 2024, 9:06pm

Hey Charlie,

Your name is very familiar… I think we’ve both been lurking around since the early days of Allaire. Coldfusion has certainly gone through it’s growing pains since then.

With regards to your first point, I figured that out not long after my post. Those “memory leak” messages appeared when I rebooted the server (so as it was shutting things down). I also found a lot of references to that on the InterWeb, ie: Tomcat throwing the same message on shutdown.

Other than that, nothing that jumps out at me in the logs. I’m running 16GB RAM on the alma VPS, and that should be plenty with only a dozen or so sites, most fairly simple custom CMS sites using a postgres backend.

I’m much more handy building linux servers and such. Heard about Tomcat over the years, but never really knew much about it until this year… let alone such strange terms as catalina and coyote. But I like a good challenge, and never too old to learn, heh.

I don’t like to change too many variables at a time, scientific method and all that. I started by upgrading my Lucee version to the latest 5.4.6.9 last night. Then upgraded the postgres driver. Now just waiting for it all to hang again (or hopefully not). Then maybe upgrade to 6 over the weekend if needed. And maybe throw another 6GBs RAM at it, based on your message… can’t hurt, right?

What would you recommend for monitoring tools?

What’s ultimately puzzling is that it all ran for a good 6 months just fine, and I haven’t changed anything on the web side in a good while… other than to upgrade apache. But that’s the nature of the linux beast, or so I’ve learned over the years.

If I tire of beating my head against the wall, I will indeed come knocking on your door, and thanks for that.

And btw… I like long winded answers

Cheers and thanks… Dino

carehart · July 25, 2024, 9:58pm

Good to hear back, and there’s more I could say in reply, but let me at least address “why the recent hangups”, when “it all ran for a good 6 months just fine”.

That happens to many folks. And I’d assert that if truly “nothing changed” on the server (or nothing to impact this), the next most likely root cause is a change in the nature and/or volume of traffic to the server (your dozen sites).

It’s easy to fall into a trap of assuming that traffic is steady, but all it takes is some spider (or hacker) to trawl/crawl one or more sites for perhaps all pages, or all variants pages (perhaps deirben) driven by a query string changing values) to generate 1000x traffic in a day, or an hour, or heck even a minute. And they may do it repeatedly, day after day, out of the blue. Then another could, and so on, likely at different times.

And that can stress lucee (and tomcat and your os) but also things your app talks to (databases, apis, and other network calls)…all of which could be like the toilet overflowing down into the kitchen of Lucee, which gets the first blame. (Of course the same can happen with CF or any web app tech). These are problems I help solve almost daily, with hundreds of folks each year.

As for what best to monitor with, I’m partial to FusionReactor. It would make quick work of answering most if not all of these questions. But other tools have their place.

Dino · July 26, 2024, 12:12am

Indeed, makes sense. I guess that we all hope for things to purr along nicely, once the work of setting up and streamlining is done. I do block a metric ass-ton of abusive IPs on my firewall, almost daily, but that’s a cat and mouse game that nobody wins.

I could definitely use a cf monitor tool, something I’ve been putting off for far too long. Back in the day I wrote a simple tool that would display cf connections per app, but that is long obsolete now, ha.

Cheers and thanks… Dino

carehart · July 26, 2024, 1:03am

Yes, blocking by ip’s generally proves to be a tedious game of whack-a-mole. But you may find that blocking by abusive user-agent values to be far more productive. Many have, including recently with spates of huge traffic from a chatgpt bot. You can block by user agent easily in your web server, whether iis or nginx or, in your case, apache.

Should you just block anything that includes “bot” in the user agent? No, because you may want visits from googlebot, so as to serve your content when people may search for it. Same “spider” or “crawler”.

That said, you can analyze your web server logs (such as per day) to find what unique user agents make the most calls, and decide based on that info.

Do some clients lie about their user agent? Sure, they may say they’re just a normal browser agent, though an automated spider. Or they may lie and say they’re googlebot, to avoid being blocked.

But the point is that MANY DO find in such analysis of web server logs (or indeed fusionreactor request logs) that there become clear excesses from some user agents that DO identify themselves distinctly and can be blocked, which has made a world of difference for some. So this is not just some theoretical possibility. It’s even been discussed here in recent months (about the chatgpt bot–whether it’s legitimately them or not).

Finally, any such automated requests will also trigger lucee (or cf) to create a new SESSION (and run session start code) ON EVERY REQUEST, which can exacerbate heap/memory and other problems. Same with client variable processing, if enabled in an app, which could lead to excessive db traffic if they’re stored there. This is something I discussed further in a presentation from several years ago, still at my site, Presentations by Charlie Arehart. Do a browser find there for the word spiders, though other topics may also interest readers.

Dino · July 26, 2024, 2:16pm

I’ve always been a bit nervous about blocking user agents, but maybe something to look into, thanks.

I went poking around the InterWeb for Coldfusion monitoring tools… not really much available, other than FusionReactor and SeeFusion. I installed SeeFusion which gives me a memory snapshot which looks fine right now. I’ve been keeping an eye on DB activity with pgAdmin (with system stats extension)… no load of any significance there. I’ll also keep an eye on the relevant logs. So now I wait and watch.

Thanks for all your good information, much appreciated.

Cheers… Dino

carehart · July 26, 2024, 5:34pm

Yes, seefusion is a helpful option, and now open source. It differs from fr in many ways–not least of which is installation and configuration. Still, for many, it’s a good option.

You say you searched for more and found not much beside these. I’d said in my first post that there are also jvm tools (free and builtin or easily added). I covert them in others talks I’ve done, including one in May. Search my presentations page I’d offered, for monitoring.

Finally, yes, db monitoring can be very important. You’ll find on my page that I did a talk on that specifically earlier this year. To be clear, I always post my slides (which are detailed), and I link to any video recording that may exist.

Hope your own tuning efforts bear fruit.

kenricashe · July 26, 2024, 6:59pm

@Dino, I successfully fought the bots by enabling Cloudflare with a custom user-agent filter:

kenricashe · July 26, 2024, 7:15pm

Also are you by any chance an InMotion Hosting customer like me? I ask because you mentioned AlmaLinux VPS and that’s exactly what I have. Due to those being Virtuozzo systems, the swap file is disabled with no option for enabling. That causes problems sometimes when WHM Backup (if you have that enabled) runs every night. I don’t want to pay for more memory when it’s only an occasional issue, so my solution for now is a shell script that runs every minute via cron and restarts Lucee if it has stopped.

#!/bin/bash

if ! /opt/lucee/lucee_ctl status | grep -q "is running"; then
	echo "$(date)"
	echo "Lucee is not running!"
	echo
	/opt/lucee/lucee_ctl start
	echo
fi

Dino · July 27, 2024, 1:43am

That’s a handy script, for sure!

I host in Canada, so likely not the same company.

Cheers and thanks… Dino

Dino · July 27, 2024, 1:46am

Bots gone wild indeed, ha!

I’ve been using fail2ban (apache-badbots) to manage the nasty bots. Not ideal, since the bots have already consumed system resources once their logged. But better than nothing for now, until I look into something better… such as blocking them before they get to the machine.

The bad bot list is getting so long… where do you find your updated bad bot list?

Thanks for the feedback.

Cheers… Dino

kenricashe · July 27, 2024, 6:08pm

I answered that question in my post.

And Cloudflare is the something better you need, also covered in my post.

Dino · July 28, 2024, 8:49pm

Thanks for the information and suggestions… think I’m on the right track now.

Cheers and thanks… Dino