I’ve had several instances of servers becoming unresponsive, timing out all requests. Restarting Lucee usually doesn’t help.
While trying to figure it out by adding log statements, I did something that seemed to fix it, but I’m not sure how that could be…possibly something in the way cffile works?
Here’s what I did:
I was trying to grep for messages in my log file, but grep only said “binary file matches”…after investigating, I found that there were some NULL characters in the file (bad error message returned from an external http request, somehow). I disabled that logging message to make sure my logfile was clean text. Then I stopped the server and renamed the log file with nulls to a backup file and restarted the server. After that, the timeout errors stopped.
Looking at the logging function (a custom tag), I found code like this:
<cfparam name="OutputPath" default="#ExpandPath( '/' )#">
<cfparam name="OutputFile" default="trace.txt">
<cflock type="exclusive" timeout="120">
<cfif Not(FileExists(fullFilePath))>
<cffile action="APPEND" file="#OutputPath#\#OutputFile#" output="Creating new trace file..." addNewLine="Yes">
</cfif>
... code to set currentTime, remoteAddr, thisScript, etc...
<cfdirectory action="list" directory="#OutputPath#" name="MyDirectory" filter="#OutputFile#" >
<!--- START LOG RESET --->
<cfif MyDirectory.Size GTE "30000000">
<cffile action="APPEND" file="#fullFilePath#" output="Rolling over trace file..." addNewLine="yes">
<cffile action="delete" file="#fullFilePath#">
<cffile action="APPEND" file="#fullFilePath#" output="========== #currentTime# ==========" addNewLine="Yes">
</cfif>
<!--- END LOG RESET --->
<!--- START LOG RECORD --->
<cfif Trim(attributes.OutputLine) NEQ "">
<cffile action="APPEND" file="#fullFilePath#" output="#currentTime# [ #remoteAddr# | #thisScript# | #threadName# | #thisSessionID#] - #attributes.OutputLine#" addNewLine="Yes">
</cfif>
<!--- END LOG RECORD --->
</cflock>
The first thing that pops out at me is the 120 second lock. That could explain the page timeouts since every page calls this logging function in onRequestStart, at minimum.
But what could be causing the lock not to release?
The “LOG RESET” section is suspect, for one… I doubt it’s doing the right thing, but I don’t know exactly what it is doing…
cfdirectory
is being called on the root directory of the site. I don’t know what MyDirectory.Size
is supposed to be…when I inspect it, it’s always increasing (around 14 million in this case), but it doesn’t match the actual size of the directory (du -hs
shows 1.9G). It doesn’t match the amount of file entries in the directory either (330). Maybe it’s the total of the files in the directory but not subdirectories?
Whatever it’s doing, it doesn’t seem like the best idea, performance-wise, to be querying the directory contents/size on every single log message (many times per request)…but it’s been that way forever, so that’s probably not the culprit of the timeouts. Plus, I never deleted the old log file …it’s still in the same directory with a new name.
…which leaves the cffile
tag… is it possible that when cffile
opens a file for “append” that it actually needs to read the file in text mode and the embedded null characters could be screwing it up? I see a “readBinary” mode, but no “writeBinary” or “appendBinary”
As far as I can tell, this is the only code in the system that accesses that file. Nothing explicitly reads the file, unless it is happening in cffile action="append"
behind the scenes.
Is this a possibility? Any other ideas?
Thanks,
-Partap
OS: Windows Server 2019
Java Version: 11.0.3 (AdoptOpenJDK) 64bit
Tomcat Version: 9.0.19
Lucee Version: 5.3.7.47