Request does not time out whilst in a query

PeterJackson · October 24, 2022, 3:48pm

We had an interesting problem on our production servers that was caused by a single query that suddenly started misbehaving and taking an extraordinary amount of time.

Whilst a query is running, pages don’t time out due to a request timeout or stop processing when the user disconnects. So the net effect is a large number of headless queries running. This can (and did) cause havoc on the database server.

It looks like a known issue: [LDEV-1067] - Lucee

I’d like to find a generic way to stop an unexpected slow query causing havoc again.

I could use timeout on the cfquery but in the discussion on LDEV-1067, one person recommends against it. Is there a better way?

I’d prefer not to add a timeout on the datasource as some of the queries we run when everyone’s in bed are expected to take a long time.

I know that a properly designed schema with correct indexing and well-formulated queries is essential. We do that already! The execution plan for a query can change over time and mysql does not always make the best decisions. In my recent example, with two identical servers with identical database config and identical data: on one server the query completed in milliseconds, and on another server it did not complete in minutes - due to mysql choosing a different execution plan.

Thanks, Peter

Gavin_Baumanis · October 26, 2022, 1:08am

@PeterJackson
Just shooting from the hip, here…

Could you change it / split it?
To use a “base” / simple query
And then a query of queries for any specific filtering / grouping etc?

I have done this a few times in our code base - specifically (for us) where there is a dynamic nature to the end-result that is needed.

PeterJackson · October 26, 2022, 8:09am

@Gavin_Baumanis Thanks for your reply.

When we encounter problems, we use those methods too, and I’m already confident that I can find an SQL or SQL/CF combined solution to any problem I encounter. I was able to fix the issue I had a few days ago by reformulating the query using common table expressions. We monitor the slow query log and proactively fix any SQL.

I want to prevent future problems from any unexpected rogue SQL. A rogue query actually crashed MySQL: because users try slow pages over and over again and queries continue to run after the page has timed out, many copies of the query were running headless and it filled up the partition used for MySQL temporary files.

I want to abort such queries in a timely manner so that can investigate a few aborted page requests than fix more serious damage.

We have made changes at the operating system and database server level, but I feel there should be a CF answer too.

Gavin_Baumanis · October 26, 2022, 3:07pm

I love CTEs!

PeterJackson · November 3, 2022, 4:07pm

I just discovered this:

On our Windows 10 dev servers, a request DOES timeout during a slow query.
On our LINUX dev servers, a request DOES timeout during a slow query.
On our LINUX live servers, a request DOES NOT timeout during a slow query.

As far as I can tell, all servers are using the same version of Lucee and connecting to the same version of MariaDB using the same version of the JDBC driver, and the configuration is identical.

Confused.

PeterJackson · November 4, 2022, 9:01am

I can work round this by prefacing my SELECTs with SET STATEMENT max_statement_time=n FOR

But I would dearly love to understand why the behaviour is different between my servers.

carehart · November 4, 2022, 1:05pm

Peter, about your discovery yesterday, what means of setting a timeout are you using (which worked differently for you in the different servers)?

It’s not unusual for ACF or lucee to not timeout a request while it’s running a query. If you’d found that to work, many would want to know how and when it can work.

Besides the cfquery timeout, there’s an option in ACF (with their built-in drivers licensed from datadirect) to set a timeout. That’s not always killed all longer-running db requests. And some jdbc drivers offer their own args that could help here. Then there’s the request timeout (at the page, app, or admin level), which has tended not to stop requests waiting on a db call, or a cfhttp, etc.

But in fact, you want to also timeout the query itself (in the db), not just the cf request . Are you confirming that in your “working” example, the query on the db times out also? If not, your “rogue queries could still wreak havoc in the db”. Two problems to solve, yes

But I realize your main surprise was that it worked on one and not another. In that case, was it the same kind of query to the same db? If not, some aspect of the query itself, or configuration for the db itself, could be where the “difference” is.

Finally, for when a dsn-level setting DOES work, one idea that some use is to create two fans that are identical in all respects but that timeout, and they change the code that should NOT need to run long to use the one with the timeout. Not as elegant, but functional…as long as it does timeout the query, or at lest the cf request’s waiting for it.

Looking forward to what you or others may have to say.

PeterJackson · November 7, 2022, 10:00am

Charlie - thanks for looking. We have not knowingly set any different config between the servers. The linux dev and live servers are very deliberately kept on the same version of everything: o/s, Java, tomcat, apache, Lucee, jdbc drivers, database, config, etc

Here is some simple test code I have been using to compare the behaviour on different servers. When I run this, I also monitor the running queries on the server.

<cfflush interval="1">

<cfset requesttimeout	= url.requesttimeout	?: 20>
<cfset querytimeout		= url.querytimeout		?: 20>
<cfset runQuery			= url.runQuery			?: "Y">

<cfif requesttimeout neq 0>
	<cfsetting requesttimeout="#requesttimeout#">
</cfif>

<cfoutput>
<html>
<body>
	<p>This will run a query or CF code that takes 200s.</p>
	<p>Run with <a href="?requesttimeout=60&amp;querytimeout=10">?requesttimeout=60&amp;querytimeout=10</a></p>
	<p>Run with <a href="?requesttimeout=10&amp;querytimeout=60">?requesttimeout=10&amp;querytimeout=60</a></p>


	<p>url.requesttimeout = the CF request timeout.  Default 20.  0 means no timeout.</p>
	<p>url.querytimeout = the cfquery timeout.  Default 20.  0 means no timeout.</p>
	<p>url.runQuery = Y|N.  Default Y.  Y means run a database query.  N means run CF code.</p>

	<p><strong>Current Params</strong></p>
	<p>requesttimeout=<strong>#HTMLEditFormat(requesttimeout)#</strong>, querytimeout=<strong>#HTMLEditFormat(querytimeout)#</strong>, runQuery=<strong>#HTMLEditFormat(runQuery)#</strong></p>

	<cftry>
		<cfif runQuery eq "Y">
			<h2>Running a slow query on <strong>#Application.svr_installConfig.databaseReadonly#</strong>...</h2>
			<cfquery name="q" datasource="#Application.svr_installConfig.databaseReadonly#">
				<cfif querytimeout gt 0>
					SET STATEMENT max_statement_time=<CFQueryParam value="#querytimeout#" sqlType="INTEGER" /> FOR
				</cfif>
				SELECT SLEEP(200)
			</cfquery>
			<p>Query has completed</p>
		<cfelse>
			<h2>Running slow CF code...</h2>
			<cfloop from="1" to="200" index="i">
				<p>Sleep #i#</p>
				<cfsleep time="1000">
			</cfloop>
		</cfif>
	<cfcatch>
		<cfdump var=#cfcatch#>
	</cfcatch>
	</cftry>

	<p>Finished</p>
</body>
</html>
</cfoutput>

On my dev servers:

When request timeout=10 and query timeout=60, the request aborts after 10s. The in-page exception is “Statement cancelled due to timeout or client request”

When request timeout=60 and query timeout=10, the request aborts after 10s. The in-page exception is “Query execution was interrupted (max_statement_time exceeded)”

On my live servers:

When request timeout=10 and query timeout=60, the request continues for 60s. The page silently stops after 60 and the exception in exception.log is “java.io.IOException: The current thread was interrupted;lucee.runtime.exp.NativeException: java.io.IOException: The current thread was interrupted”

When request timeout=60 and query timeout=10, the request aborts after 10s. The in-page exception is “Query execution was interrupted (max_statement_time exceeded)”

BK_BK · November 7, 2022, 10:53am

The following (applying the url scope) might or might not matter, but I missed it in your code:

	<cfparam name="url.requesttimeout" default="20">
	<cfparam name="url.querytimeout" default="20">
	<cfparam name="url.runQuery" default="Y">
     ...
       <p>requesttimeout=<strong>#HTMLEditFormat(url.requesttimeout)#</strong>, querytimeout= 
       <strong>#HTMLEditFormat(url.querytimeout)#</strong>, runQuery=<strong>#HTMLEditFormat(url.runQuery)#</strong></p>
      ...
       SET STATEMENT max_statement_time=<CFQueryParam value="#url.querytimeout#" cfsqlType="INTEGER" /> FOR

PeterJackson · November 7, 2022, 4:51pm

I don’t think that matters because variables scope is used in preference to url scope.
We stopped using <cfparam> years ago because it was astoundingly slow. It’s much much better now.

BK_BK · November 7, 2022, 6:47pm

Ah, OK.
My next questions: Could cfflush have something to do with it? Is interval=“1” perhaps too small? What happens when you comment out the cfflush line?

Gavin_Baumanis · November 8, 2022, 1:30am

Hi @PeterJackson ,
Just as a "am I sure I they’re “exactly” the same configs…
You could use commandbox to export the configs from the two servers in question and do a compare.

I am not suggesting you’re wrong… But you just never know…
You could have missed a “checkbox” difference in the datasources… / etc.
And a DIFF of the two might result in a surprise.

(Of course - I have “never / ever” sworn that something was EXACTLY the same only to find out, embarassingly, in front of others - that they actually weren’t exactly the same at all - and that the very obvious difference - was indeed the culprit… Never happened… )

Similarly, you mention that you keep your dev / prod infrastructure the same.
Have you “actually” checked the version of MariaDB / Java / OS / etc for minor (patch) level differences?

Are the DBs on separate servers to Lucee?
How are the dev / prod database servers - WRT OS / Java / etc version matching?

Lastly - just came to me.
Have you considered “environment variable” differences - they’re not normally in a repository WITH the source code - so perhaps there is some environment setup / config at play here - that is “hidden”.

Good luck!