5.4.1.8 stops logging while in a Docker container

AdamCameron · August 11, 2023, 9:36pm

I’ve calved this off the similar-sounding thread “5.3.9 stops logging while in a docker container” at @Zackster’s behest. Also copy and pasting from same topic on Slack (no further input on that):

Is there a chance that work done on https://luceeserver.atlassian.net/browse/LDEV-4035 might result in the logger’s config being reloaded without being asked?

Since upgrading to 5.4.1.8, we are experiencing our logs stopping logging INFO records, after successfully logging some for a while.

All these logs are configured via org.lucee.cfml.Administrator’s updateLogSettings method, in onApplicationStart. This was to work around the issue that unless one registers the log (either via the above or in Lucee Admin), the writeLog(file=xxx) will not log anything lower than ERROR due to [I do not recall, but it came out in an earlier discussion on LDEV-4035 or similar].

When we start the app it’s fine for a few min, but then it again ceases logging anything lower than ERROR.

This is new behaviour since we upgraded from 5.3.9.166 last week. It started exactly then. And when we rolled back to 5.3.9.166, it stopped again.

My colleague has pointed me to [that 5.3.9 thread, above].

So looks like it had been an issue, was “fixed” at least for a while, but has regressed? Or the fix did not address some edge-cases.

Another consideration here. We have three identical containers; the issue only occurred on one of them. It’s not the heaviest traffic one, but it’s the one that is doing the most logging, given the nature of its usage. And it happened reproduceably on that.

I could not replicate in our dev environment either.

I am not in the position to troubleshoot further as we’ve rolled the container back. And I won’t be rolling it forward again.

Zackster · August 12, 2023, 7:08am

A couple of questions

One of your team asked on slack about updating settings on every single request, and that’s mentioned above, are you still doing this?

Do the lucee-server.xml or lucee-web.xml still reflect your desired settings once the problem starts?

Using the admin API for logging, do the returned values also match your expectations?

Log files are a shared I/O intensive resource, which is why we recommend configuring them once in the web or server context settings, reconfiguring (blowing away) them every request is asking for problems.

Define the logs in the admin and reference them via log, which then uses the shared, configured pooled log file adapters.

Circling back to docker, is this a case of works on my machine but fails in prod / with load?

AdamCameron · August 12, 2023, 10:16am

Cheers for coming back to me @Zackster . I can answer most of these.

One of your team asked on slack about updating settings on every single request, and that’s mentioned above, are you still doing this?

We’ve given you a bum steer here. It’s done once at application start. This is a work around for the Lucee bug introduced between 5.3.9.133 and 5.3.9.166 (I think) wherein writeLog(file="x", level="y", ...) stopped working for any level below ERROR because Lucee was resetting the default log level to ERROR under the hood. This was your recommended way of addressing this.

NB: one idiosyncratic point in our code here is we’re using file="fileName", not log="logName". However this has always been the case.

Do the lucee-server.xml or lucee-web.xml still reflect your desired settings once the problem starts?

I can’t answer that as I did not check at the time (and like I said, the containers have been rolled back now). But I did check the UI in Lucee admin and that was reflecting the correct settings, yes.

Using the admin API for logging, do the returned values also match your expectations?

I cannot test this. We have rolled the containers back, as I said. I cannot try anything new here.

Define the logs in the admin and reference them via log, which then uses the shared, configured pooled log file adapters.

We’re dealing with containers mate. Cattle vs pets. There is no “configuring stuff via the admin”. That is an archaic notion.

Circling back to docker, is this a case of works on my machine but fails in prod / with load?

I’ll repeat what I said in from the post you are replying to:

Another consideration here. We have three identical containers; the issue only occurred on one of them. It’s not the heaviest traffic one, but it’s the one that is doing the most logging, given the nature of its usage. And it happened reproduceably on that.

I could not replicate in our dev environment either.

None of this is very high traffic: for all intents and purposes it’s an internal application, not a public-facing web application.

We have one container that that is servicing humans using the UI… basically a CRUD interface to a DB. Looking at y/day’s traffic, it’s peaking @ 4req/sec. It largely only logs things when something goes wrong (intrinsically at a higher level than INFO), but still logs a bit of telemetry @ INFO level too This container had no problems.

The other container basically does batch processing via scheduled tasks. This does a lot of heavy data processing (and logging), but only runs one request (task) at a time. It logs a lot of telemetry @ INFO level, and some stuff at a higher level than that. This is the container that was bung.

It is not a case that some log files reset their log level to ERROR; it happens to all of them at once. One moment; we’re getting INFO log entries constantly; next moment it stops. It seemed to be maybe 5min after the container & application started, but as it was a mission critical server I had to sort it out immediately. Log entries at ERROR continued to log.

Restarting the app (so hitting the log config code in onApplicationStart) did not fix it. Restarting the container did. For a while.

It doesn’t seem to be to do with updateLogSettings. But on the other hand re-running that also doesn’t actually help.

In your position, I would be looking at any code you have that monkeys with logging settings that is not triggered by human user interaction (so like as the response to updates to the Lucee admin UI, or active calls to updateLogSettings, etc), and ask “why are we running this code? What runs this code?”. If you have code that is “trying to be a big help” or second-guessing what might be going on and “fixing” stuff, or addressing this issue automagically instead of throwing an error… I’d question its validity. Just a guess though.

Hope this lot helps. Sorry I’m not in the position to run any testing; but business continuity is more important than upgrading Lucee or troubleshooting logging issues whilst we lose critical log information. I can’t roll Lucee forward again to test stuff.

Cheers pal.

Zackster · August 12, 2023, 12:16pm

To clarify, by configuring in the admin, I simply meant via lucee-server.xml or lucee-web.xml (which the admin is basically a gui for) opposed to application level settings.

doing a bit of a code review, I did notice this bug with logLevel and application log configuration

github.com

lucee/Lucee/blob/5.3/core/src/main/java/lucee/runtime/listener/ApplicationContextSupport.java#L342


      
          			String lc = AppListenerUtil.toClassName(sctLay);
          			String lbn = AppListenerUtil.toBundleName(sctLay);
          			Version lbv = AppListenerUtil.toBundleVersion(sctLay);
          			if (StringUtil.isEmpty(lbn)) cdLay = ((ConfigPro) config).getLogEngine().layoutClassDefintion(lc);
          			else cdLay = new ClassDefinitionImpl<>(config.getIdentification(), lc, lbn, lbv);
          
          			if (cdApp != null && cdApp.hasClass()) {
          				// level
          
          				String strLevel = Caster.toString(v.get("level", null), null);
          				if (StringUtil.isEmpty(strLevel, true)) Caster.toString(v.get("loglevel", null), null);
          				int level = LogUtil.toLevel(StringUtil.trim(strLevel, ""), Log.LEVEL_ERROR);
          
          				Struct sctAppArgs = Caster.toStruct(sctApp.get("arguments", null), null);
          				Struct sctLayArgs = Caster.toStruct(sctLay.get("arguments", null), null);
          
          				boolean readOnly = Caster.toBooleanValue(v.get("readonly", null), false);
          
          				// ignore when no appender/name is defined
          				if (!StringUtil.isEmpty(name)) {
          					Map<String, String> appArgs = toMap(sctAppArgs);

Zackster · August 12, 2023, 12:35pm

fixed [LDEV-4669] - Lucee

that bug has been there since it was first added, way back in 5.1.0.28!!!

AdamCameron · August 12, 2023, 2:50pm

Excellent work, but it’s nothing to do with this issue. This issue started between 5.3.9.166 and 5.4.1.8.

Zackster · August 12, 2023, 3:07pm

Yeah, just narrowing it down!

Cfadmin action=UpdateLogSettings doesn’t effect application settings btw,

application action=update logs=.... does but only for the current request

Configuration is hierarchical like CSS, server, web, application, request

AdamCameron · August 12, 2023, 3:08pm

I’m just looking at a diff of Lucee/core/src/main/java/lucee/commons/io/log/LogUtil.java between 5.3.9.166 and 5.4.1.8, and there’s some of stuff that’s changed around dicking around with log levels and falling back to LEVEL_ERROR.

OR. There’s a change in Lucee/core/src/main/java/lucee/runtime/tag/Log.java (find the comment // if we do have a log with the same name, we use the log) about how a log is init’ed when… by the looks… one is using a file not a log…? Could something be erroneously calling that with invalid context or something, causing a log level reset?

That’s just from looking at files that mention loglevel that have changed (and where the code seems to be touching something that could end up going wrong). It’s not a very well-informed search. as I’m not gonna take the time to understand how the code is called.

Still looking…

AdamCameron · August 12, 2023, 3:09pm

Cfadmin action=UpdateLogSettings doesn’t effect application settings btw,

Oh yeah, my reference to restarting the app is because that is where the code to config the logs is, remember? Restarting the app => log config is updated again.

AdamCameron · August 12, 2023, 3:22pm

Another thought - as I look at some code that’s messing around with threads and contexts and stuff - the code in the container that is having the issue uses cfthread for a bit of its stuff.

What if there’s been some change to how the context for the code in the cfthread block’s code is being run in, and when some config for its logging is made, it’s a) inadvertently doing it wrong so the log’s log level is not handled properly; b) inadvertently changing the main app’s context as well as the thread’s one?

I could be talking utter shite - worse than usual I mean - but this is a difference between the nature of the code being run on this container vs the other containers. PLUS it could explain why the logging resets after a while… it would only happen when one of the tasks using cfthread would be executed, and it needed to log something.

Wild stab.

I’ve looked through all the code mentioning logs and stuff now, and the diffs between the two versions, and the stuff above is all I could find that seems if not like a smoking gun, then there’s a slight waft of cordite about them.

Zackster · August 12, 2023, 3:36pm

Try dumping out getapplicationsettings().logs?

AdamCameron · August 12, 2023, 3:41pm

I’ll try to say it again in case it sticks this time.

The container is gone man. We have rolled back. We will not be rolling fwd again.

After this experience of trying to upgrade to 5.4 we’ve given up, and are expediting our move away from Lucee; hoping we don’t get bitten by any security issues that crop up on 5.3.9.166 in the interim.

This info is just for your benefit; the ship has sailed for it to be any benefit to us.

AdamCameron · August 13, 2023, 12:55pm

I tried to create a repro case based on tasks / threads / etc, but couldn’t reproduce

Still thinking about it.