Is this a bug from Railo 4.2.1.008 and is it fixed in Lucee 4.5?

Shane_Curless · May 19, 2016, 1:48pm

Hi,

My team currently has Railo 4.2.1.008 running on a production server. We
are experiencing an issue with the server locking up and requiring a
restart of Jetty.

Right now it happens roughly every 9 hours or so, but this interval has
been getting shorter and shorter as the service gains more users.

I believe I have tracked down the issue to be related to logging - Our
stderr log file shows this:

Wed May 18 19:49:23 EDT 2016-107 timeout after 10006 ms (10000 ms) occured 
while accessing file 
[/var/www/html/v2.ims-login.com/WEB-INF/railo/logs/login.log]
Wed May 18 19:49:23 EDT 2016-107 conflict in same thread: on 
/var/www/html/v2.ims-login.com/WEB-INF/railo/logs/login.log
java.lang.NullPointerException
        at 
railo.commons.io.retirement.RetireOutputStreamFactory$RetireThread.run(RetireOutputStreamFactory.java:43)
Wed May 18 19:49:23 EDT 2016-108 conflict in same thread: on 
/var/www/html/v2.ims-login.com/WEB-INF/railo/logs/login.log
Wed May 18 19:49:23 EDT 2016-108 conflict in same thread: on 
/var/www/html/v2.ims-login.com/WEB-INF/railo/logs/login.log

I’m not sure if the timeout and conflict messages are related in any way to
the RetireOutputStreamFactory messages, but they do happen around the same
time.

When this happens, the server is unresponsive - Will accept a connection,
but just sits there not giving any response and eventually the connection
times out, and only restarting Jetty gets it going again, to which end we
have a monitor set up to automatically restart it if the connection times
out, which is far from an ideal solution for a production system.

Based on this post
here: Redirecting to Google Groups - It seems
someone else had the same issue some time ago but it was never resolved for
him.

Is this a bug that existed in Radilo 4.2.1.008, and has it been fixed in
Lucee 4.5?

P.S. I am also a Java developer, so any answers related to the Railo/Lucee
source won’t be a problem for me.

andrew · May 19, 2016, 2:08pm

Hi Shane,

First port of call would be to check the Lucee JIRA to see if you can find
if the bug has been logged and if so, what the status of the ticket is:

https://luceeserver.atlassian.net/secure/Dashboard.jspa

If not, then please raise a ticket for it and it can be progressed from
there. If you are able to find and fix the bug yourself then a pull request
would also be appreciated.

Kind regards,

Andrew

Shane_Curless · May 19, 2016, 2:18pm

I found LDEV-750, which is a ticket regarding a very similar problem in
Lucee 5, same error messages except for the ticket creator it happens on
startup, and references another “failed to flush writer” error.

It does seem from both my issue and his that the issue is stemming from an
NPE in RetireOutputStream/RetireOutputStreamFactory.

Shane_Curless · May 19, 2016, 4:29pm

What is the purpose of the following check in ResourceLockImpl?

if(t==Thread.currentThread()) {
//aprint.err(path);
Config config = ThreadLocalPageContext.getConfig();
if(config!=null)
SystemOut.printDate(config.getErrWriter(),"conflict in same thread: on 
"+path);
//SystemOut.printDate(config.getErrWriter(),"conflict in same thread: on 
"+path+"\nStacktrace:\n"+StringUtil.replace(ExceptionUtil.getStacktrace(new 
Throwable(), false),"java.lang.Throwable\n","",true));
return;
}

The problem appears to be possibly related to this, because it is called in
the process of getting a RetireOutputStream for logging.

Grant_Griffith · September 19, 2016, 12:34pm

Any luck fixing this issue? I am seeing this most mornings when the load
picks up. Usually a Lucee restart resolves it for that day, but sometimes
it does happen multiple times a day here.

Lucee Version: 4.5.3.020

Grant

Shane_Curless · September 19, 2016, 2:32pm

As far as I am aware this issue hasn’t been fixed. My team had to resort to
commenting out all cflog tags and script calls.

Zackster · August 8, 2019, 5:01pm

just seen this with 5.3.3.60-RC…