We have been hosting a large number of sites since 2015 but we have found in the last year that the servers have developed two issues.
Firstly we are finding that lucee service sometimes (often) will completely die if we change anything in the server admin panel (for example editing a datasource or changing the performance/caching config).
The symptoms evinced are:
- the lucee admin panel hangs on the save
- requests to all of the the sites result in timeouts
- whenever it happens we look into the logs and find these things:
“Apache Native” warnings in the catalina.out
04-Jul-2018 11:09:46.186 INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[mod_cfml] ERROR (sent to client): 503: Time Between Contexts has not been fulfilled. Please wait a few moments and try again.
04-Jul-2018 11:09:38.218 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixDispatchQueue] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
04-Jul-2018 11:09:38.218 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixFrameworkWiring] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
04-Jul-2018 11:09:38.219 WARNING [127.0.0.1-startStop-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [ROOT] appears to have started a thread named [FelixStartLevel] but has failed to stop it. This is very likely to create a memory leak.
At this point we have to stop and start the lucee_ctl service and then the second problem happens.
LUCEE ERRORS ON APPLICATIONS STARTING
The applications all error in the first few responses. It is as if Lucee is trying to serve pages when it had not finished firing up the application.
A refresh or two and the servers run veru smoothly until the next time we touch the config.
A common error is the one below where lucee can not even process an include without erroring:
The Error Occurred in
/var/www/html/some-installation/some-website/Application.cfc: line 4
3: // pull in boilerplate
4: include “…/application-base.cfm”;
6: // called by the application-base code
called from /var/www/html/some-installation/some-website/Application.cfc: line 1
Java Stacktrace lucee.runtime.exp.NativeException: String index out of range: -18
Some notes on the servers
The servers in question are:
3 servers all clones of each other load balanced through HAproxy.
5GB+ free of disk space
45+ medium-traffic websites (haproxy reports each server handling about 50K sessions an hour)
45+ low traffic CMS applications
Lucee 18.104.22.168 (upgrade being tested, before you mention it )
Remote Mysql 5.5 DB cluster
Has anyone seen this before?
Could it be we have too many applications on the same host? when we are not changing any configs the servers tend to be pretty stable the issues seem to only occur once we make a change in the admin panels.
Many thanks for any input