Lucee 6.1.0.243 - all contexts/settings being wiped

Hi all, I wonder if you can help me with a bit of a mysterious issue.

My Lucee install is being randomly wiped clean - as in I have 4 contexts (sites) working absolutely fine and then all of a sudden the whole lot lose their contexts/all settings - e.g. all DSNs are wiped, all settings lost for the whole lucee server. I can’t access lucee admin through any sites subsequently and I have to restore a backup of the whole /opt/lucee folder to get it back online.

This is, obviously, a bit of a disaster.

I’m on Debian 12, using apache2. Server has 8Gb RAM, 8Gb Swap, 500Gb disk, 8vCPUs. Tomcat is starting up with the following in setenv.sh:
export JAVA_OPTS=“-XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=70 -XX:G1ReservePercent=15 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 -XX:+AggressiveOpts -Xms5000m -Xmx5000m -Djava.awt.headless=true”;

Does anyone know what would cause this catastrophic wipe of everything? We took a copy of a last-known-good copy of the folder before it happened, then ran a diff afterwards and found that all jar bundles in tomcat/lucee-server/bundles were removed, all files relating to installed extensions in tomcat/lucee-server/context removed, and new conf files for each of the sites created in the corrupted version of the lucee folder in tomcat/work/Catalina that weren’t in the last known good version.

I wondered if anyone had any pointers for where we can look but currently we don’t know if it’s something wrong with Lucee itself, or if it’s some sort of hack script running (there are plenty of automated hack attempts showing themselves in the logs, but none we’ve been able to identify as causing this issue - the vast majority just bounce off harmlessly.

Any clues?

TIA,
Rich

ok, so further investigation reveals a number of lucee-server.x.buggy files in the /opt/lucee/tomcat/lucee-server/context/ folder, the timestamp of each one matches the date the server went down and lost all config.

Running a diff on the most recent .buggy file compared to the current last-known-good .CFConfig.json file shows only some scheduled tasks start date and intervals different, nothing else. Some scheduled tasks on this server are dynamically updated so that’s kind of expected.

However, that got me thinking and I now have a working hypothesis that perhaps if 2 or more tasks have their trigger dates altered by concurrent threads, both trying to write to .CFConfig.json, then this causes a problem that ends in the file becoming corrupted, so Lucee creates a blank version.

We’re testing that hypothesis by placing named locks around our task scheduling calls and I’ll report back in a few days if that seems to have helped.

Ok, so it’s now been 5 days since we placed named locks around all our calls and the problem has not persisted, so we believe the hypothesis that concurrent calls to update scheduled tasks can corrupt the CFConfig.json file is true. This has the effect of removing all server settings - e.g.; DSNs, scheduled tasks, mappings etc etc. Granted they’re all saved in a lucee-server.x.buggy file, but still.

If this hasn’t already been reported or dealt with in a newer snapshot of Lucee, should it be reported as a new bug? Surely the CFConfig.json file should be more protected against this?

Many thanks,
Rich