Blocked requests each minute on 5.3.8.x under load

This is related to LDEV-3710 https://luceeserver.atlassian.net/browse/LDEV-3710

I’m trying to determine if another ticket should be raised for the issue, beyond that ticket

Don’t forget to tell us about your stack!

OS: Oracle Linux
Java Version: OpenJDK 11.0.7
Tomcat Version: Tomcat 9.0.52
Lucee Version: Lucee 5.3.8.x (both perform same), and 5.3.7.47 doesn’t show this consistent issue

application environment
webroot serves a few thousand domains through a couple of applications.
webroot contains several applications, the majority of the traffics sessions are from the root, standard sessions (essentially holding one session variable), in memory with a 2hr 1min timeout.
The admin application uses db storage for sessions, with a 2hr 1min timeout, 4,000 or fewer records at any time.

problem
Getting requests blocked every minute. this appears after running under load for several hours to several days.

current state
running ok under 5.3.7.47, cannot keep 5.3.8.x under load for an extended period of time

To help diagonose, after items are built up, I adjust load:
Load Balancer is turned on (to send health checks, of which Cloudflare sends quite a few)
Load Balancer is set to 0 (so normal traffic isn’t routed to me)

Getting requests blocked every minute. (the minute does appear to shift, depending on how long the block lasts or if a thread dump occurs)

These requests should run sub 20ms, but during this time, they can regularly exceed 1s.

Originally I thought was somehow related to too many sessions being cleaned up, as fusion reactor showing destroying sessions was a consistent indicator of when it would occur, however I was able to repro, while only showing 20 active sessions in fusion reactor, and still seeing the issue (session count had been high during testing the night before).

following a restart, while under load for 3 hours, all servers are showing 119 - 120,000 active sessions.

If I set a 5.3.7.x server to 0 traffic and a 5.3.8.x to 0, so both are just receiving frequent health checks; the health checks on 7.x stay below 20, while the 8.x stays below 20 until it has a spike each minute where they exceed 100ms, and this spike remains consistent with the timing of session cleanup.

If I switch scope logging from error to info, or vise versa, it appears to clear up for a period of time (for a shorter amount of time than the original buildup).

If I catch it in profile

thread cpu time is around 2ms
total query time under 10ms
relations has a WEbRequest is a full blue line with a very small jdbcrequest block at the far right.

100.0% - 347ms java.lang.Thread.run(Thread.java)
	51 Collapsed Methods (show)
		100.0% - 347ms Waiting on lock <0x613b4d0b> (a lucee.runtime.PageSourcePool)

The longer the server runs, the larger the spikes get, while active sessions stays fairly steady. If we leave the 5.3.8.x in load, we would have to restart within 2 days as the spikes will get into several seconds delay. While we only have had to restart 5.3.7.x for scheduled maintenance.

If I turn runtime protection on and down to where I can get a full thread dump, they consistently contain the items below.
(quantity of blocked requests varies each time, but always same place)

"ajp-nio-0:0:0:0:0:0:0:1-8009-exec-235" Id=1713944 BLOCKED on lucee.runtime.PageSourcePool@2aa69129 owned by "Thread-1650386" Id=1720538 
State:	BLOCKED
Deamon:	true
Blocked:	8/00:00:06.492
Waiting:	60/00:23:52.592
CPU:	00:00:00.129
User:	00:00:00.100
Memory:	(22,959,656 B)
lucee.runtime.MappingImpl.getPageSource(MappingImpl.java:399)
- waiting on <0x2aa69129> (a lucee.runtime.PageSourcePool held by thread 1720538, Thread-1650386)
lucee.runtime.MappingImpl.getPageSource(MappingImpl.java:394)
lucee.runtime.config.ConfigImpl.getPageSources(ConfigImpl.java:984)
lucee.runtime.config.ConfigImpl.getPageSources(ConfigImpl.java:891)
lucee.runtime.PageContextImpl._execute(PageContextImpl.java:2448)
lucee.runtime.PageContextImpl.executeCFML(PageContextImpl.java:2421)
lucee.runtime.engine.Request.exe(Request.java:45)
lucee.runtime.engine.CFMLEngineImpl._service(CFMLEngineImpl.java:1179)
lucee.runtime.engine.CFMLEngineImpl.serviceCFML(CFMLEngineImpl.java:1125)
lucee.loader.engine.CFMLEngineWrapper.serviceCFML(CFMLEngineWrapper.java:97)
lucee.loader.servlet.CFMLServlet.service(CFMLServlet.java:51)
javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
jdk.internal.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
com.intergral.fusionreactor.j2ee.filterchain.WrappedFilterChain.doFilter(WrappedFilterChain.java:134)
com.intergral.fusionreactor.j2ee.filter.FusionReactorRequestHandler.doNext(FusionReactorRequestHandler.java:772)
com.intergral.fusionreactor.j2ee.filter.FusionReactorRequestHandler.doHttpServletRequest(FusionReactorRequestHandler.java:344)
com.intergral.fusionreactor.j2ee.filter.FusionReactorRequestHandler.doFusionRequest(FusionReactorRequestHandler.java:207)
com.intergral.fusionreactor.j2ee.filter.FusionReactorRequestHandler.handle(FusionReactorRequestHandler.java:809)
com.intergral.fusionreactor.j2ee.filter.FusionReactorCoreFilter.doFilter(FusionReactorCoreFilter.java:36)
jdk.internal.reflect.GeneratedMethodAccessor81.invoke(Unknown Source)
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
com.intergral.fusionreactor.j2ee.filterchain.WrappedFilterChain.doFilter(WrappedFilterChain.java:71)
jdk.internal.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
com.intergral.fusionreactor.agent.filter.FusionReactorStaticFilter.doFilter(FusionReactorStaticFilter.java:54)
com.intergral.fusionreactor.agent.pointcuts.NewFilterChainPointCut$1.invoke(NewFilterChainPointCut.java:42)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java)
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197)
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:542)
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357)
org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:433)
org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893)
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1726)
org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x149fad43> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
java.lang.Thread.run(Thread.java:834)
"Thread-1650386" Id=1720538 RUNNABLE 
State:	RUNNABLE
Deamon:	true
Blocked:	0/00:00:00.000
Waiting:	0/00:00:00.000
CPU:	00:00:01.211
User:	00:00:01.200
Memory:	(89,112,272 B)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:52)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:53)
lucee.commons.collection.LongKeyList.add(LongKeyList.java:42)
lucee.runtime.PageSourcePool.clearUnused(PageSourcePool.java:169)
lucee.runtime.MappingImpl.clearUnused(MappingImpl.java:300)
- locked <0x2aa69129> (a lucee.runtime.PageSourcePool)
lucee.runtime.functions.system.PagePoolClear.clear(PagePoolClear.java:93)
lucee.runtime.functions.system.PagePoolClear.clear(PagePoolClear.java:85)
lucee.runtime.functions.system.PagePoolClear.clear(PagePoolClear.java:65)
lucee.runtime.engine.Controler.doClearPagePools(Controler.java:447)
lucee.runtime.engine.Controler.control(Controler.java:362)
lucee.runtime.engine.Controler.control(Controler.java:244)
lucee.runtime.engine.Controler.access$000(Controler.java:58)
lucee.runtime.engine.Controler$ControlerThread.run(Controler.java:113)

Hi, because I’m seeing “lucee.runtime.PageSourcePool.clearUnused” in your stacktrace, your issue seems to have to do with [LDEV-2904] - Lucee
Where a half-baked solution was done, which makes sure the pageSourcePool size now keeps growing, and no actual cleaning up is done. While the cleaning threads are still running, they just loop to find items to throw away, and then don’t throw them away. That happens every minute.
Regards, Paul

I discussed this with @micstriit, he has a fix in mind.

He’s on a well deserved holiday for a few weeks and we will address this when he gets back

2 Likes

Yes, I noticed the exact same thing when I was helping Matt with this via E-mail. All of the logic in that method is pointless every minute when it runs because the line that actually removes the page source objects is commented out but I don’t understand why the entire method wasn’t just commented out.
What I couldn’t figure out however is why the LongKeyList class recursed so many times but I don’t quite understand the relationship of the page source files. They must be in some sort of hierarchy.

I think it maybe related to the approach which splits generated classes across classloaders due to inherent limitations with how java works

A proposed mr that should fix the issue by avoiding synchronized section. This from my understanding of the code should work correctly, using atomic operations.

From what i read longKeyList is an ordered hashmap, we can probably be way more efficient on this part. I’ve pondered with the usage of an LRU for the PageSourcePool which will avoid having to reinvent the wheel here.

just wanted to see if there was any further updates on this? I saw that it appeared to be tagged for 6, but am little concerned about being stuck on 5.3.7. until then.

I’ve created a new ticket to track this…

https://luceeserver.atlassian.net/browse/LDEV-3854

adding comments / PRs / whatever under deployed tickets gets lost in the noise

1 Like

there have been some additional 5.3.8 SNAPSHOTs since the stable release

plz try 5.3.8.213-SNAPSHOT