Hey Brad, Thanks for the response!
A1 - I have been messing with the setting in question 2 like a crazy person ( Now I feel a bit dumb ). The server in question didn't have much RAM as it is just a backend processing server that does X, Y and Z little jobs. No users touch that box, just scheduled tasks. Last week the box totally ran out of RAM ( There was an out of memory error in the exception / application log ). We doubled it that day and we thought we had solved the issue. Yesterday morning, the Lucee process just froze. Plenty of disk space, RAM and everything else looked good. Even weirder, the service was showing as running.
A2 - Nope, there was literally only 1 error in the log showing 1 long running process ( 50 whole seconds ) and then the server appears to have stop responding. With that said, it did appear that the server would start working for a bit ( long enough to schedule a batch ). Here is how it works and what we saw:
SCENARIO: The server looks for things to do every 1 minute. If there is a gap ( lets say we turn the machine off for 5 minutes ) the scripts sends a mail saying it has started the scheduling process again. So usually the machine never sends this message unless of a prob / restart. We got the following emails yesterday:
Schedule Failure (Auto Reset) - 4:44 AM
Schedule Failure (Auto Reset) - 4:51 AM
Schedule Failure (Auto Reset) - 5:01 AM
Schedule Failure (Auto Reset) - 5:15 AM ( After we restarted Lucee )
The timeout in the log occurred around 4:40 AM and was on a CFHTTP request and we see this is when the box hung. Our HTTP heartbeat software shows the box down from 4:41 AM - 5:15 AM. We had someone looking into this at 4:05 AM. We never saw the server respond.
I know this is way more than you wanted to here, but I would love to help clear up this issue in anyway that I can. I usually don't put SNAPSHOTS on our production machines. Let me know if you think that is our best option in this case.