Poor performance/timeouts on 5.3.4.80 with Tomcat?

Julian_Halliwell · March 9, 2020, 12:57pm

Stack: Windows (10/2012R2), OpenJDK 8u232, Lucee 5.3.4.80, Tomcat 9.0.31, IIS+Boncode.

We’ve had problems with 5.3.4.77/80 which at first glance seem to be connected with the Tomcat AJP Ghostcat patch.

We ran 5.3.4.77 locally and in prod before Ghostcat and didn’t observe any issues, but after patching noticed application restarts were taking much longer than usual. Same with 5.3.4.80, which we applied to another prod machine and which after a few hours started throwing timeouts/thread deaths and eventually ground to a halt. At that point we reverted to 5.3.3.62 and things returned to normal.

Running 5.3.4.80 locally, I’m seeing test suites that normally complete in a couple of seconds or less taking 40 seconds plus, or just timing out (at 50 seconds). ORM apps seem worst affected. Can’t see anything in the logs other than timeouts.

But if I spin up 5.3.4.80 in CommandBox (no Tomcat) performance is fine.

This suggests a problem with Tomcat, but I’ve tried Zac’s advice to modify the AJP connector definition in server.xml (see also LDEV-2741) and it doesn’t seem to make any difference.

And as I say, rolling back to 5.3.3.62 with the same Tomcat config immediately solves the problem, suggesting there is something going on at the Lucee end.

Anyone else seeing this?

alexskinner · March 9, 2020, 1:10pm

This is a known issue with that build try snapshot 5.3.5.80 or above

It’s an issue with ~~complication~~ compilation but there are other issues,

Julian_Halliwell · March 9, 2020, 1:42pm

Thanks Alex. Is there a JIRA ticket for it?

I’ll give 5.3.5.80 a go.

Julian_Halliwell · March 9, 2020, 2:47pm

5.3.5.80-SNAPSHOT does seem to solve the problem, thanks.

Unfortunately it’s also introduced another bug affecting our Tika library wrapper… hopefully I can work around that.

bdw429s · March 9, 2020, 5:12pm

@alexskinner Did you by chance mean “compilation”?

@Julian_Halliwell That’s curious that it doesn’t happen on CommandBox/Undertow. I’m a little familiar with the compilation issues I think Alex was referring to, but I would have expected it to affect all servers alike. Have you been able to use FusionReactor or some other tool to get stack traces and see what the slow requests are spending their time doing?

Julian_Halliwell · March 9, 2020, 5:50pm

Sorry, Brad, I haven’t got any detailed analysis. The timeout exception stack traces just list normal activity at the moment the timeout hit. But as I say the trigger was consistently app re-initialisation and running Testbox suites. Normal requests seemed to run ok. Guess that would fit with “compilation issues”, as there’s a lot of “warm up” activity going on in those cases.

I may be wrong about CommandBox. I can’t run the same apps on both stacks (without going to a lot of effort), so I wasn’t testing like for like, but app re-initialisation and test running seem to be normal with CmdBox.

bdw429s · March 9, 2020, 5:57pm

@Julian_Halliwell Something else to test is to time the total execution of the CF code and see if the delay is part of the CF code or the AJP connector. That may help narrow things down. For example, if the entire request takes 1 second, but the CF part is only 100ms then Lucee is probably not even part of the problem.

Julian_Halliwell · March 9, 2020, 7:53pm

I might try that tomorrow, Brad, but the fact that the problem goes away when upgrading to 5.3.5.80-SNAPSHOT suggests the problem is with Lucee rather than Tomcat, although I perhaps it could be a combination(?)

It’s more difficult to attribute cause/effect when changes to more than one moving part are involved as you’ll know better than most! Having to keep on top of Lucee, Tomcat and Java, all of which have been pushing significant updates recently, does make maintenance a challenge.

justincarter · March 9, 2020, 11:07pm

There was an issue with compilation in 5.3.4.77 which causes very slow app startup, so 5.3.4.80 was released to resolve that (bug is marked as fixed in 5.3.4.79 but 5.3.4.80 is the Release version to use).

https://luceeserver.atlassian.net/browse/LDEV-2685

Note that 5.3.5.x is still in Release Candidate so technically isn’t recommended for production, however this same issue is fixed in 5.3.5.80 according to the ticket. The similar version numbers here can make it easy to get confused between the two builds, and the last RC release was 5.3.5.78-RC which doesn’t include the fix so the 5.3.5.80-SNAPSHOT would need to be used if you are testing 5.3.5.x.

If possible try to isolate the Lucee version change from the Tomcat AJP Ghostcat change. There is potentially a separate Ghostcat mitigation / performance issue which has been discussed here, something to do with localhost / IPv4 vs IPv6 lookups on Windows;

https://lucee.daemonite.io/t/tomcat-cve-2020-1938-ghostcat-ajp/6650/24?u=justincarter

bdw429s · March 10, 2020, 12:22am

Wow, lots of versions! And to complicate that, I saw that 5.3.4.82-snapshot hit the 5.3.4 branch today which makes me wonder if there will be a third 5.3.4 release. If it’s any consolation, ForgeBox is now auto-publishing every single Lucee (and Lucee light) build including snapshots. It’s all automated-- in fact there were 14 new Lucee versions pushed to ForgeBox just today. That may help with testing all these versions.

server start cfengine=lucee@5.3.6-SNAPSHOT+13

etc…

modius · March 10, 2020, 5:29am

3 posts were split to a new topic: Where are the list of SNAPSHOT artifacts?

Julian_Halliwell · March 10, 2020, 10:57am

@justincarter The issue is with 5.3.4.80, so I don’t think the compilation bug in 5.3.4.77 can be the problem. Startup was fine, it was app re-inits and test running that showed the most marked effects.

I was really hoping that using the IPv6 address for AJP would sort it as it has for others. But it didn’t in my case.

What did immediately solve it was upgrading to 5.3.5.80-SNAPSHOT. For now in prod we’ve had to downgrade to 5.3.3.62, but a bug with that version is playing havoc with our logs so we may have to take a risk on deploying a snapshot. We were forced to do that a couple years ago and actually it proved very stable.

A bit more info…

I’ve just tried downgrading my dev server from 5.3.5.80-SNAPSHOT to 5.3.4.80 to re-check what I was seeing with debugging on this time.

For the app re-initialisation (ie. onAppliationStart()) the slowness is entirely down to instantiation of non-ORM singleton service layer objects (around 3 seconds in one case for no more than a couple of dozen objects).

The test suites are being slowed by OrmReload() calls in the beforeAll() method of each spec. These are taking around 4-5 seconds each, which easily adds up to a 50 second timeout.

Given what I’m seeing I think we can rule out AJP as a factor.

bdw429s · March 10, 2020, 2:28pm

@Gavin_Pickin Given our conversation last night with our client ORM site timing out due to very slow ORM inits, please see the thread above,

andreas · March 10, 2020, 3:18pm

This is really odd. What I really can’t understand is that @Zackster solution worked on our production server, but: Because the devs at apache closed the issue quite quickly without digging deeper into it, I wanted to add some more information to the bug filed. So I’ve setup a new local dev environment for switching between various tomcat and lucee versions, but I wasn’t able to reproduce it. Finally I gave it up. What I can clearly say is that eveything like starting execution time and also the on page execution time done by lucee was quite the same as ever. It felt more like a network thing, like when your ISP/Colocator is having network problems, and that was the first thing we tried to sort out.

Julian_Halliwell · March 10, 2020, 3:53pm

I’ve done some incremental upgrade testing - moving up through the snapshots from 5.3.4.80 - and have identified the build which appears to fix the issue: which is 5.3.4.83-SNAPSHOT.

Unfortunately there is no changelog for it and nothing I can find in Jira.

But it means that the current RC 5.3.5.78 will fix this (tested and it does), so we can avoid deploying a snapshot.

bdw429s · March 10, 2020, 3:57pm

@Julian_Halliwell Nice work on the additional testing. I actually pinged Micha this morning and asked him why there were snapshot builds on the 5.3.4 branch and if there was a release possible in the future. His response was simply this:

i don’t know about that yet, we still see an issue and i try to figure out what is going on …

Hopefully that means he’s been looking into whatever this issue was. I assume a review of the Github commits will help pinpoint what has been changed.

Here are the most recent commits on that branch

Technically the commit titled “add logging” is what bumped to the 5.3.4.83-snapshot which seems benign, but if you look closely at that commit, you’ll see the removal of the synchronized keyword on the getinstrumentation() method too so it’s more than logging in that commit.

Zackster · March 10, 2020, 3:57pm

try out sourceTree https://www.sourcetreeapp.com/ the history, all branches view is really informative

Julian_Halliwell · March 10, 2020, 5:07pm

… although there seem to be bugs in RC 5.3.5.78 which might make going for 5.3.5.80-SNAPSHOT a safer bet. I expect that’s why Alex suggested that version.

andreas · March 13, 2020, 11:12am

This might be a little off topic from my side, but for VSCode Users who like sourceTree, there is a similar extension called GitGraph Git Graph - Visual Studio Marketplace

Samuel_W_Knowlton · March 13, 2020, 11:42am

https://luceeserver.atlassian.net/browse/LDEV-2644