Cfthread

wwilliams · September 14, 2017, 5:27pm

Hey Guys, love the new dev forums … nice work! Having not used cfthread in years ( and since moving to Lucee ), I have a few questions below:

Lucee Version 5.2.3.35

QUESTIONS

Other than physical server limitations, are there any limits to the number of concurrent threads?
In the admin under Tasks there is a limit to the number of threads, is this just for mail / scheduler / background tasks?
In the 5.2.4 RC I see a few thread related memory leak fixes involving cfcontext ( I think ) … could these issues cause issues with just a straight use of cfthread running cfhttp GET stuff too?

REASON FOR QUESTIONS

We have been running Lucee for well over a year with little or no issues, until last week. We have recently started using cfthread to grab a ton of XML data concurrently. Our Lucee server just stopped responding with no errors anywhere other than a long running process that had the thread stopped ( Normal behavior ). The only changes we have made to this set of scripts / backend server is adding the cfthread tag around a cfhttp. LAST BIT OF INFO: We always cfthread type JOIN to ensure everything gets finalized and closed out. Never seen any issues or errors.

Thanks in advance and keep up the great work!

bdw429s · September 14, 2017, 6:21pm

I know this is an easy-out answer, but can you do some testing on the latest snapshot of Lucee? LIke you mentioned, we have found and fixed a couple memory related issues with threads in the last couple of sprints. It’s possible they could be affecting you.

Secondly, grab a copy of FusionReactor which will give you a ton of information about what’s going on with your server. Memory levels, number of threads, what each thread is doing,etc. You can get a dev license of FR for $199.

There’s not a limit that I know of. Threads themselves don’t really take any resources unless they’re doing something.
I think so, but don’t quote me. @Igal or @Gert probably knows for sure.
I’m not super familiar with the specifics, but I know it had to do with resources that were being used inside the thread that weren’t garbage collected properly when the thread quit running and was placed back in the pool. It would eventually fill up the heap memory and crash the server.

bdw429s · September 14, 2017, 7:13pm

@wwilliams Another quick question-- are you having a number of long running requests that are timing out and getting killed by Lucee?

wwilliams · September 14, 2017, 8:16pm

Hey Brad, Thanks for the response!

A1 - I have been messing with the setting in question 2 like a crazy person ( Now I feel a bit dumb ). The server in question didn’t have much RAM as it is just a backend processing server that does X, Y and Z little jobs. No users touch that box, just scheduled tasks. Last week the box totally ran out of RAM ( There was an out of memory error in the exception / application log ). We doubled it that day and we thought we had solved the issue. Yesterday morning, the Lucee process just froze. Plenty of disk space, RAM and everything else looked good. Even weirder, the service was showing as running.

A2 - Nope, there was literally only 1 error in the log showing 1 long running process ( 50 whole seconds ) and then the server appears to have stop responding. With that said, it did appear that the server would start working for a bit ( long enough to schedule a batch ). Here is how it works and what we saw:

SCENARIO: The server looks for things to do every 1 minute. If there is a gap ( lets say we turn the machine off for 5 minutes ) the scripts sends a mail saying it has started the scheduling process again. So usually the machine never sends this message unless of a prob / restart. We got the following emails yesterday:

Schedule Failure (Auto Reset) - 4:44 AM
Schedule Failure (Auto Reset) - 4:51 AM
Schedule Failure (Auto Reset) - 5:01 AM
Schedule Failure (Auto Reset) - 5:15 AM ( After we restarted Lucee )

The timeout in the log occurred around 4:40 AM and was on a CFHTTP request and we see this is when the box hung. Our HTTP heartbeat software shows the box down from 4:41 AM - 5:15 AM. We had someone looking into this at 4:05 AM. We never saw the server respond.

I know this is way more than you wanted to here, but I would love to help clear up this issue in anyway that I can. I usually don’t put SNAPSHOTS on our production machines. Let me know if you think that is our best option in this case.

Thanks again!

bdw429s · September 14, 2017, 8:40pm

Can you reproduce this on a non production machine to test?

Also, FusionReactor. Get it. You need to dig in and look under the hood when the server isn’t responding. What threads are active? What does a thread dump show? What is memory usage at? etc.

psarin · September 14, 2017, 9:34pm

Maybe this is related to my issue as well. We use cfthread a fair amount as well.

https://lucee.daemonite.io/t/upgrade-to-lucee-5-2-3-35-results-in-gc-overhead-limit-crashes/2766

wwilliams · September 15, 2017, 1:17am

Brad - The problem in reproducing it is volume. I have upgraded one of our production backend servers to run the 5.2.5.12 SNAP. I rebooted both boxes ( 1 is running 5.2.3.35 ) and I’ll watch the heap usage on both machines. I’ll update you once I have more info.

psarin · September 19, 2017, 1:58pm

Anyone have any updates to this? Have been sequentially downgrading (now at Lucee 5.2.1.9) to try to get to a stable build. The builds above that one all seems to crash after about 2-5 days, requiring a restart.

wwilliams · September 20, 2017, 3:33pm

Just as an update, we are seeing positive results with Lucee 5.2.5.12 SNAP. We did have a hardware issue that required a reboot on our server running 5.2.5.12 SNAP ( Not related to Lucee obviously ), but the new thread HEAP memory clean up appears to be doing a better job than 5.2.3.35.

NOTE: There still is a bit of a memory leak somewhere as the HEAP size does increase, yet at a much lower rate.

Julian_Halliwell · September 20, 2017, 4:38pm

This is good to hear. Thanks for letting us know.

wwilliams · September 20, 2017, 4:51pm

One thing I thought of based on your error logs is the 600 second timeout. If you look at the most recent code in the thread fix by @micstriit you will see that a thread timeout is set to 5*60000 ( Or 5 minutes ). The issue here might be with the code executing correctly within Lucee 5 or the difference between Lucee 5 and Lucee 4s handling of long running threads. Also you will see that the priority is downgraded for any thread running longer than 10 seconds ( Just some food for thought ). We implemented a 10 second default timeout on all HTTP requests just in case.

Side Note for @micstriit Do we need to be concerned about your comment on line 199 of the LDEV-1473 fix on the following:

Hope this helps!

Julian_Halliwell · September 27, 2017, 1:00pm

Hi @wwilliams, how are things after a week (assuming you’ve kept 5.2.5.12 running)?

wwilliams · September 27, 2017, 1:33pm

All good here. Have you tried it? We had that hardware failure last week and rebooted the machines after we applied patches on Monday. I’m running 5.2.5.12-SNAP on 2 production backend machines ( Using CFTHREAD ) and 5.2.3.35 on 3 other front ends ( no CFTHREADS ) with no issues. We are still seeing those weird errors:

https://lucee.daemonite.io/t/application-log-error-http-nio-8888-exec-xx-ajp-nio-8009-exec-xx/2768

@bdw429s or anyone else know of a way to get to the memory / heap info? Would love to have a way to monitor this without having to login to the admin. Also, are there any other things I can watch to see the health of the server / lucee ( See issues before there is a larger problem )?

Julian_Halliwell · September 27, 2017, 2:22pm

Thanks. Haven’t tried it yet, but encouraged to do so despite it being a snapshot.

wwilliams · September 28, 2017, 4:02pm

Thanks Julian!