Periodic issue leading to timeout errors

Simon_Goldschmidt · August 2, 2018, 10:34pm

We had an incident where our Lucee server started returning a bunch of timeout errors that are not related to any page request and for which we can’t find an explanation. There were 100 or so similar errors over 50 mins, then the rate reduced for a few hours before we restarted the server. We last saw this a month ago and when we didn’t restart the server, the issue progressed to IIS returning 403 responses (dynamic IP filtering restrictions that should not have been triggered) and then 503 responses. Restarting the server makes the issue disappear for a month or two. Stack trace is below… was hoping someone might make sense of this and have some suggestions for preventing a recurrence. We’re using Lucee 5.2.5 on Windows 2016.
Thanks… Simon

“ERROR”,“Thread-1610239”,“08/01/2018”,“00:51:53”,“controler”,"stop thread (11548) because run into a timeout (no path available).;java.lang.Throwable;java.lang.Throwable
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitLatch(NioEndpoint.java:1114)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitReadLatch(NioEndpoint.java:1116)
at org.apache.tomcat.util.net.NioBlockingSelector.read(NioBlockingSelector.java:184)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:235)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:216)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.fillReadBuffer(NioEndpoint.java:1241)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.fillReadBuffer(NioEndpoint.java:1221)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.read(NioEndpoint.java:1158)
at org.apache.coyote.ajp.AjpProcessor.read(AjpProcessor.java:1309)
at org.apache.coyote.ajp.AjpProcessor.readMessage(AjpProcessor.java:625)
at org.apache.coyote.ajp.AjpProcessor.receive(AjpProcessor.java:587)
at org.apache.coyote.ajp.AjpProcessor.refillReadBuffer(AjpProcessor.java:695)
at org.apache.coyote.ajp.AjpProcessor$SocketInputBuffer.doRead(AjpProcessor.java:1433)
at org.apache.coyote.Request.doRead(Request.java:581)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:326)
at org.apache.catalina.connector.InputBuffer.checkByteBufferEof(InputBuffer.java:642)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:349)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:152)
at lucee.commons.io.IOUtil.copy(IOUtil.java:283)
at lucee.commons.io.IOUtil.copy(IOUtil.java:72)
at lucee.commons.io.IOUtil.toBytes(IOUtil.java:943)
at lucee.commons.io.IOUtil.toBytes(IOUtil.java:937)
at lucee.runtime.net.http.HTTPServletRequestWrap.getInputStream(HTTPServletRequestWrap.java:256)
at lucee.runtime.net.http.HTTPServletRequestWrap.getReader(HTTPServletRequestWrap.java:355)
at lucee.runtime.type.scope.FormImpl.initializeUrlEncodedOrTextPlain(FormImpl.java:251)
at lucee.runtime.type.scope.FormImpl.initialize(FormImpl.java:131)
at lucee.runtime.PageContextImpl.formScope(PageContextImpl.java:1175)
at lucee.runtime.type.scope.UndefinedImpl.reinitialize(UndefinedImpl.java:611)
at lucee.runtime.type.scope.UndefinedImpl.initialize(UndefinedImpl.java:589)
at lucee.runtime.PageContextImpl.initialize(PageContextImpl.java:522)
at lucee.runtime.CFMLFactoryImpl.getPageContextImpl(CFMLFactoryImpl.java:171)
at lucee.runtime.engine.CFMLEngineImpl._service(CFMLEngineImpl.java:1061)
at lucee.runtime.engine.CFMLEngineImpl.serviceCFML(CFMLEngineImpl.java:1039)
at lucee.loader.engine.CFMLEngineWrapper.serviceCFML(CFMLEngineWrapper.java:102)
at lucee.loader.servlet.CFMLServlet.service(CFMLServlet.java:51)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:504)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342)
at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:486)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:790)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1459)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
"

carehart · March 15, 2021, 4:25pm

Hey, @Simon_Goldschmidt, I know this was from a couple of years ago, and no one ever replied. But did you ever resolve this? I am finding others experiencing this (and not even in Lucee or ACF, but in pure Tomcat), and I’ve not yet found anyone with an answer.

If you don’t have one/can’t recall, I’d understand. If I do find one, I will try to remember to come back here and share what I learned.

Simon_Goldschmidt · March 15, 2021, 9:19pm

Hi @carehart. Our solution at the time was to restart the servers regularly. I think this occurs as a result of the server running out of memory.

Next, we tried running the pagepoolclear() function each night, but this didn’t seem to take memory requirements back to a base line. We also tried reducing Tomcat’s Java “Maximum memory pool” but the memory used by the Apache Commons Deamon Service Runner application would grow to over double this size and keep growing until the server would run short on memory. Eventually, we just threw more memory resources at our servers to keep the issue at bay.

At the moment, we have two identical servers behind a load balancer, one running Lucee 5.3.7 and one 5.3.8. The 5.3.7 server uses a steady 1.5Gb of memory but the 5.3.8 server is up to 2.2Gb and growing daily.

Simon

carehart · March 15, 2021, 10:09pm

Interesting. I would not have connected this problem to a memory one at all, but I realize that’s your recollection, at least of how you all regarded it. FWIW, I am helping someone with the problem whose heap is reported at the time to be only 20% used, which is why I wouldn’t think it was related–but I suppose somehow it could be.

And sure, restarting is always “an answer”, but not one that many want to hear if it can be avoided. I’ll keep digging, and if I learn more, again I’ll try to share it here for folks who may find this in the future.

Zackster · March 15, 2021, 10:27pm

Hmmm, which 5.3.8 SNAPSHOT is this happening with? Is it heap or non-heap memory?

If it’s not the latest 5.3.8.159, which is pretty much 5.3.8-RC2, aside from a few minor outstanding regressions from much earlier in the 5.3.8 cycle, could you give it a try?

If it’s still happening with the latest SNAPSHOT, can you file a bug, as this is obviously a 5.3.8 cycle regression.

@carehart which specific version of Lucee are you battling with?

carehart · March 15, 2021, 11:05pm

Hi, Zac. In my case, it’s not someone using Lucee but ACF. But FWIW, I’d said in my first note that as I went looking for other info on the problem, “I am finding others experiencing this (and not even in Lucee or ACF, but in pure Tomcat), and I’ve not yet found anyone with an answer.”

That’s the oddity to me, that those who report the problem as being due to Tomcat (always, at its root) also never have had any answer to the problem raised. I have found at least 3 SO topics of this sort, and still others.

So when I found this one from Simon, which never had a reply, I wondered if he may have found a solution. To be clear, I’m not inclined to think the problem is a Lucee one (or an ACF one), but a Tomcat one, though I could be wrong. Or it could be influenced by Tomcat configuration (in the nio connector).

And it may not even be so much a “bug” but instead an instance of requests having some characteristic, leading to this state, which Tomcat is “throttling” intentionally. Of course, some will want to solve whatever is that root cause problem, while others may be happy to “raise the throttle” to allow more in.

In the ACF case I am working on, what happens is that a large number of requests (posts to CFC’s, FWIW) end up in this state, so that FR CP “request quantity” alert fires–and the thread dump in the emails shows clearly all those requests in the awaitlatch on the nio endpoint (the same “org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.awaitLatch” shown in Simon’s stack trace).

And I’m saying it’s my searching for other references to THAT which keeps ending up with posts (in SO) where people raise it and no one has a solution.

Simon_Goldschmidt · March 15, 2021, 11:09pm

Hi @Zackster, we’re using 5.3.8.139RC and non-heap seems to be behaving differently:

We’ll update to 5.3.8.159 and monitor. I expect it’ll take at least a week before we know if this behaviour is different.

Terry_Whitney · March 16, 2021, 1:48pm

Windows Servers have multiple triggers for scheduled tasks that effect memory, disk io and performance. You need to go through and disable all the scheduled tasks on production servers and set aside a time once every few days to reboot them and run the tasks on boot - delayed.

carehart · March 25, 2021, 1:56pm

Thanks Terry, but that seems a bit out of left field. No one mentioned scheduled tasks. And while Simon mentioned being on Windows I had not. But then you’re referring to memory so I gather you’re really responding to what he (and then Zac) wrote, about possible connections to memory.

I’ll say again that I think memory issues are a red herring: or at least I can say that my situation does not at all a memory issue, in that heap use is only 30% at the time of this burst of hanging requests. The key point was that my issue shows the same stack trace Simon shared originally: requests specifically hanging on that same CountDownLatch.await which is preceded by a line showing org.apache.tomcat.util.net.NioBlockingSelector.read, which is preceded ultimately by a line doing a org.apache.coyote.ajp.AjpProcessor$SocketInputBuffer.doRead.

These seem to clearly point to the AJP connector as a key factor. (And again in my case the stack trace further shows the requests doing a post to a CFC: coldfusion.xml.rpc.CFCServlet.doPost. Simon’s does not show that, but it may not be significant.)

As perhaps another reason I can say it’s not seemingly a memory issue: in the case I am working with at least, other requests DO run. It’s just that about 40 requests do hang this way. (And FWIW in this case all the requests are from one ip address/useragent address, all starting within about 30 seconds of each other. I have pointed out to my client that that could suggest a problem in their client code, perhaps causing an excessive burst of requests automatically, but of course it could be the end user hitting refresh of clicking a link over and over.)

And in that it does effect only some requests and this one user, it’s also not a crippling problem. But it does lead to FR CP email alerts when it happens, so the client and I do notice it. As I said in my first response, I’ve been looking for anyone else discussing a problem with the same stack trace. Simon’s post above was one I found. The only others I found were about Tomcat (not CF), and none of them had any explanation, either.

I am wondering if this could be some Tomcat or AJP limit that might get reached, in some way where it then queues requests (again in the AJP connector…not Tomcat or Lucee or ACF, as that’s so clearly indicated in the stack traces, Simon’s above and my client’s).

Hope that helps someone in the future, even if it may not get us any closer in the near term.

Zackster · March 26, 2021, 12:07pm

how is 5.3.8.159 going?

Terry_Whitney · March 26, 2021, 12:55pm

Charie - First off - I admire your work on ColdFusion, books, and contribution to our “DEAD LANGUAGE”

That being said, the guy is running Windows Server running out of resources every X days for N minutes. Its well worth looking at the OS as a MONTH is a weird number for JVM to suddenly go into garbage collection mode and error with “no path available”

Running "schtasks /query /s localhost > c:\list.txt " would show what ready and what is running and what is scheduled to run. Even if its not the cause of the issue, it would be well worth disabling unused services as well as looking at eventviewer to see if something is triggering during the crash.

Simon_Goldschmidt · March 26, 2021, 9:07pm

I haven’t observed any major difference with 5.3.8.159. It seems to find its balance using more memory than 5.3.7. Currently, they are using 2G and 1.8G respectively. I notice that 5.3.7 is using 100% of its non-heap memory, even with less memory overall, while 5.3.8 is using 83%. The good news is that we haven’t seen a recurrence of the errors since increasing the resources in the servers.