Hey Guys, love the new dev forums … nice work! Having not used cfthread in years ( and since moving to Lucee ), I have a few questions below:
Lucee Version 18.104.22.168
Other than physical server limitations, are there any limits to the number of concurrent threads?
In the admin under Tasks there is a limit to the number of threads, is this just for mail / scheduler / background tasks?
In the 5.2.4 RC I see a few thread related memory leak fixes involving cfcontext ( I think ) … could these issues cause issues with just a straight use of cfthread running cfhttp GET stuff too?
REASON FOR QUESTIONS
We have been running Lucee for well over a year with little or no issues, until last week. We have recently started using cfthread to grab a ton of XML data concurrently. Our Lucee server just stopped responding with no errors anywhere other than a long running process that had the thread stopped ( Normal behavior ). The only changes we have made to this set of scripts / backend server is adding the cfthread tag around a cfhttp. LAST BIT OF INFO: We always cfthread type JOIN to ensure everything gets finalized and closed out. Never seen any issues or errors.
I know this is an easy-out answer, but can you do some testing on the latest snapshot of Lucee? LIke you mentioned, we have found and fixed a couple memory related issues with threads in the last couple of sprints. It’s possible they could be affecting you.
Secondly, grab a copy of FusionReactor which will give you a ton of information about what’s going on with your server. Memory levels, number of threads, what each thread is doing,etc. You can get a dev license of FR for $199.
There’s not a limit that I know of. Threads themselves don’t really take any resources unless they’re doing something.
I think so, but don’t quote me. @Igal or @Gert probably knows for sure.
I’m not super familiar with the specifics, but I know it had to do with resources that were being used inside the thread that weren’t garbage collected properly when the thread quit running and was placed back in the pool. It would eventually fill up the heap memory and crash the server.
A1 - I have been messing with the setting in question 2 like a crazy person ( Now I feel a bit dumb ). The server in question didn’t have much RAM as it is just a backend processing server that does X, Y and Z little jobs. No users touch that box, just scheduled tasks. Last week the box totally ran out of RAM ( There was an out of memory error in the exception / application log ). We doubled it that day and we thought we had solved the issue. Yesterday morning, the Lucee process just froze. Plenty of disk space, RAM and everything else looked good. Even weirder, the service was showing as running.
A2 - Nope, there was literally only 1 error in the log showing 1 long running process ( 50 whole seconds ) and then the server appears to have stop responding. With that said, it did appear that the server would start working for a bit ( long enough to schedule a batch ). Here is how it works and what we saw:
SCENARIO: The server looks for things to do every 1 minute. If there is a gap ( lets say we turn the machine off for 5 minutes ) the scripts sends a mail saying it has started the scheduling process again. So usually the machine never sends this message unless of a prob / restart. We got the following emails yesterday:
Schedule Failure (Auto Reset) - 4:44 AM
Schedule Failure (Auto Reset) - 4:51 AM
Schedule Failure (Auto Reset) - 5:01 AM
Schedule Failure (Auto Reset) - 5:15 AM ( After we restarted Lucee )
The timeout in the log occurred around 4:40 AM and was on a CFHTTP request and we see this is when the box hung. Our HTTP heartbeat software shows the box down from 4:41 AM - 5:15 AM. We had someone looking into this at 4:05 AM. We never saw the server respond.
I know this is way more than you wanted to here, but I would love to help clear up this issue in anyway that I can. I usually don’t put SNAPSHOTS on our production machines. Let me know if you think that is our best option in this case.
Brad - The problem in reproducing it is volume. I have upgraded one of our production backend servers to run the 22.214.171.124 SNAP. I rebooted both boxes ( 1 is running 126.96.36.199 ) and I’ll watch the heap usage on both machines. I’ll update you once I have more info.
Anyone have any updates to this? Have been sequentially downgrading (now at Lucee 188.8.131.52) to try to get to a stable build. The builds above that one all seems to crash after about 2-5 days, requiring a restart.
Just as an update, we are seeing positive results with Lucee 184.108.40.206 SNAP. We did have a hardware issue that required a reboot on our server running 220.127.116.11 SNAP ( Not related to Lucee obviously ), but the new thread HEAP memory clean up appears to be doing a better job than 18.104.22.168.
NOTE: There still is a bit of a memory leak somewhere as the HEAP size does increase, yet at a much lower rate.
One thing I thought of based on your error logs is the 600 second timeout. If you look at the most recent code in the thread fix by @micstriit you will see that a thread timeout is set to 5*60000 ( Or 5 minutes ). The issue here might be with the code executing correctly within Lucee 5 or the difference between Lucee 5 and Lucee 4s handling of long running threads. Also you will see that the priority is downgraded for any thread running longer than 10 seconds ( Just some food for thought ). We implemented a 10 second default timeout on all HTTP requests just in case.
Side Note for @micstriit Do we need to be concerned about your comment on line 199 of the LDEV-1473 fix on the following:
All good here. Have you tried it? We had that hardware failure last week and rebooted the machines after we applied patches on Monday. I’m running 22.214.171.124-SNAP on 2 production backend machines ( Using CFTHREAD ) and 126.96.36.199 on 3 other front ends ( no CFTHREADS ) with no issues. We are still seeing those weird errors:
@bdw429s or anyone else know of a way to get to the memory / heap info? Would love to have a way to monitor this without having to login to the admin. Also, are there any other things I can watch to see the health of the server / lucee ( See issues before there is a larger problem )?