Corruption and ghosts from scheduled tasks

jonathan.haglund · March 27, 2024, 11:55pm

I have been hitting very serious issues with scheduled tasks and the cfschedule function. When my app starts, I clear out all tasks, then re-register them in onRequestEnd in cfthreads that call different modules of my app that have their own tasks. The first issue manifests here: If I join my threads my entire lucee install becomes corrupted. Random components get deleted, I get syntax and language parsing errors that don’t make any sense, and I have to rebuild the whole image.

<cfschedule action="list" result="v.q_schedules" />

<cfloop query="v.q_schedules">
	<cfschedule action="delete" task="#v.q_schedules.task#">
</cfloop>

If I don’t join the threads and just let them finish after the request, the lucee install remains intact until I restart the container, and I randomly get ghost copies of the tasks that are slightly different from what I registered and its not all of them that get a ghost.

Example lucee errors:

The OSGi Bundle with name [esapi.extension] in version [2.2.4.15] for is not available locally [ (/opt/lucee/server/lucee-server/bundles)] or from the update provider [ (https://update.lucee.org)].

ERROR: Failed to download the bundle for [findbugsAnnotations] in version [3.0.1] from [https://update.lucee.org/rest/update/provider/download/findbugsAnnotations/3.0.1/?serverId=a940094252c4a7488ae793c8599315b2&serverSecurityKey=6a3f632d-dff4-497f-b901-541de33dc26b&allowRedirect=true&jv=11.0.22], please download manually and copy to [/opt/lucee/server/lucee-server/bundles]

com/mysql/cj/protocol/a/SqlDateValueEncoder$1

Once lucee is corrupted, I have to add --force-recreate to docker-compose to repair it.

In my dev environment, all tasks register as paused, and that is reflected in the admin UI. But these ghost ones are running, and they can’t be stopped. Deleting all tasks from the admin or using cfschedule doesn’t work. All tasks do disappear from the admin, but these ghost ones persist.

When my app registers its tasks, the .CFConfig.json updates to show them, though they are all missing the paused attribute, but in the admin they do show as paused. If I pause them in the admin, the attribute gets added in the json. If I delete them, they are gone from the json. But regardless of being paused or deleted, if there’s a ghost one it’ll keep trying to execute.

Sometimes I get this error in both the admin and using cfschedule: lucee.runtime.exp.ExpressionException:can't delete schedule task [taskName], task doesn't exist. In the admin I’ve selected clearly visible tasks to delete, and with cfschedule I’m just looping through the query that the list action produced.

Since this is more of a mystery than a bug report, where can I look for more clues? There’s got to be a cache, or maybe a quirk of the request lifecycle’s interaction with cfschedule. The ability to destroy a lucee install is particularly troubling since I don’t think I’m doing anything very interesting.

I’m running the official Docker build, lucee/lucee:6.0.0.585-SNAPSHOT-light, upgraded to 6.0.1.83. I’m running in Single Mode. This is my partial Dockerfile

ENV LUCEE_ADMIN_ENABLED true
ENV LUCEE_SERVER /opt/lucee/server/lucee-server

ADD https://cdn.lucee.org/6.0.1.83.lco "${LUCEE_SERVER}/deploy/6.0.1.83.lco"
ADD https://ext.lucee.org/lucee.admin.extension-1.0.0.5.lex "${LUCEE_SERVER}/deploy/lucee.admin.extension-1.0.0.5.lex"
ADD https://ext.lucee.org/ehcache-extension-2.10.0.37-SNAPSHOT.lex "${LUCEE_SERVER}/deploy/ehcache-extension-2.10.0.37-SNAPSHOT.lex"
ADD https://ext.lucee.org/esapi-extension-2.2.4.15.lex "${LUCEE_SERVER}/deploy/esapi-extension-2.2.4.15.lex"
ADD https://ext.lucee.org/lucee.image.extension-2.0.0.26-RC.lex "${LUCEE_SERVER}/deploy/lucee.image.extension-2.0.0.26-RC.lex"
ADD https://ext.lucee.org/com.mysql.cj-8.1.0.lex "${LUCEE_SERVER}/deploy/com.mysql.cj-8.1.0.lex"
ADD https://ext.lucee.org/org.postgresql.jdbc-42.6.0.lex "${LUCEE_SERVER}/deploy/org.postgresql.jdbc-42.6.0.lex"

And my dev docker-compose

version: "3.8"

services:
  lucee:
    build:
      context: .
      dockerfile: Dockerfile-local
    volumes:
      - ./:/var/www
      - ./password.txt:/opt/lucee/server/lucee-server/context/password.txt
      - ./LuceeSettings/.CFConfig.json:/opt/lucee/server/lucee-server/context/.CFConfig.json
      - ./LuceeSettings/server.xml:/usr/local/tomcat/conf/server.xml
    restart: always
    ports:
    - "80:80"

jonathan.haglund · April 3, 2024, 9:24pm

Maybe I can focus the question to keep trying to troubleshoot: What is the result of cfschedule action="update", besides an entry in either /opt/lucee/server/lucee-server/.CFConfig.json in single mode or /WEB-INF/lucee/.CFConfig.json in multi mode? I’ve tried both single and multi mode and I get this same issue. Something is trying to execute a task that either doesn’t exist in those files, or was created in a paused state and the admin UI shows it as such (though the entry in .CFConfig is always missing the “paused” attribute. Maybe those are separate issues, so stick with the first question: where is the actual scheduled task execution happening and what is its source of tasks?

Lucas · September 4, 2024, 8:01am

Hi Jonathan,

Did you get anywhere with that? I seem to experience similar issue on Lucee 6.1.0.243.
For example I can see 11 scheduled tasks in the admin UI / cfschedule list, but in the site log I can see 100s of ghost tasks running with multiple duplications. CFSCHEDULE action=“delete” seems to remove them from the UI, but something is still running somewhere.

jonathan.haglund · September 4, 2024, 3:52pm

No, somehow there wasn’t any interest in a pretty simple tag being able to corrupt the whole platform. Maybe people just don’t use this tag, or they accomplish scheduled tasks some other way.

I removed the feature and use the admin to make edits, pushing the cfconfig.json file into my container hoping it doesn’t break something else.

Lucas · September 12, 2024, 11:39am

That seems to be the problem. We’ve been having the above issue in Single Mode too and it went away when we switched to Multi Mode (same as we use in production on Lucee 5).

There’s still some issue with tasks running too often for what they are set, but at least the server is not dying daily.

Lucas · October 16, 2024, 7:31am

I’ve forgotten to post here - I’ve created a ticket for this issue [LDEV-5097] - Jira

mafimo · November 26, 2024, 3:52pm

I was creating the tasks using the admin locally in a development container, then copy the resulting .CFCONFIG out using a bash shell. Then installed while the Docker file builds the container and has worked consistently. See below

The latest release states that a partial config can be used by dropping in the /deploy folder, having an issue getting them to start after a fresh build

COPY docker/CFConfig-prd.json /opt/lucee/server/lucee-server/context/.CFConfig.json