Stack: Windows (10/2012R2), OpenJDK 8u232, Lucee 188.8.131.52, Tomcat 9.0.31, IIS+Boncode.
We’ve had problems with 184.108.40.206/80 which at first glance seem to be connected with the Tomcat AJP Ghostcat patch.
We ran 220.127.116.11 locally and in prod before Ghostcat and didn’t observe any issues, but after patching noticed application restarts were taking much longer than usual. Same with 18.104.22.168, which we applied to another prod machine and which after a few hours started throwing timeouts/thread deaths and eventually ground to a halt. At that point we reverted to 22.214.171.124 and things returned to normal.
Running 126.96.36.199 locally, I’m seeing test suites that normally complete in a couple of seconds or less taking 40 seconds plus, or just timing out (at 50 seconds). ORM apps seem worst affected. Can’t see anything in the logs other than timeouts.
But if I spin up 188.8.131.52 in CommandBox (no Tomcat) performance is fine.
This suggests a problem with Tomcat, but I’ve tried Zac’s advice to modify the AJP connector definition in server.xml (see also LDEV-2741) and it doesn’t seem to make any difference.
And as I say, rolling back to 184.108.40.206 with the same Tomcat config immediately solves the problem, suggesting there is something going on at the Lucee end.
@Julian_Halliwell That’s curious that it doesn’t happen on CommandBox/Undertow. I’m a little familiar with the compilation issues I think Alex was referring to, but I would have expected it to affect all servers alike. Have you been able to use FusionReactor or some other tool to get stack traces and see what the slow requests are spending their time doing?
Sorry, Brad, I haven’t got any detailed analysis. The timeout exception stack traces just list normal activity at the moment the timeout hit. But as I say the trigger was consistently app re-initialisation and running Testbox suites. Normal requests seemed to run ok. Guess that would fit with “compilation issues”, as there’s a lot of “warm up” activity going on in those cases.
I may be wrong about CommandBox. I can’t run the same apps on both stacks (without going to a lot of effort), so I wasn’t testing like for like, but app re-initialisation and test running seem to be normal with CmdBox.
@Julian_Halliwell Something else to test is to time the total execution of the CF code and see if the delay is part of the CF code or the AJP connector. That may help narrow things down. For example, if the entire request takes 1 second, but the CF part is only 100ms then Lucee is probably not even part of the problem.
I might try that tomorrow, Brad, but the fact that the problem goes away when upgrading to 220.127.116.11-SNAPSHOT suggests the problem is with Lucee rather than Tomcat, although I perhaps it could be a combination(?)
It’s more difficult to attribute cause/effect when changes to more than one moving part are involved as you’ll know better than most! Having to keep on top of Lucee, Tomcat and Java, all of which have been pushing significant updates recently, does make maintenance a challenge.
There was an issue with compilation in 18.104.22.168 which causes very slow app startup, so 22.214.171.124 was released to resolve that (bug is marked as fixed in 126.96.36.199 but 188.8.131.52 is the Release version to use).
Note that 5.3.5.x is still in Release Candidate so technically isn’t recommended for production, however this same issue is fixed in 184.108.40.206 according to the ticket. The similar version numbers here can make it easy to get confused between the two builds, and the last RC release was 220.127.116.11-RC which doesn’t include the fix so the 18.104.22.168-SNAPSHOT would need to be used if you are testing 5.3.5.x.
If possible try to isolate the Lucee version change from the Tomcat AJP Ghostcat change. There is potentially a separate Ghostcat mitigation / performance issue which has been discussed here, something to do with localhost / IPv4 vs IPv6 lookups on Windows;
Wow, lots of versions! And to complicate that, I saw that 22.214.171.124-snapshot hit the 5.3.4 branch today which makes me wonder if there will be a third 5.3.4 release. If it’s any consolation, ForgeBox is now auto-publishing every single Lucee (and Lucee light) build including snapshots. It’s all automated-- in fact there were 14 new Lucee versions pushed to ForgeBox just today. That may help with testing all these versions.
@justincarter The issue is with 126.96.36.199, so I don’t think the compilation bug in 188.8.131.52 can be the problem. Startup was fine, it was app re-inits and test running that showed the most marked effects.
I was really hoping that using the IPv6 address for AJP would sort it as it has for others. But it didn’t in my case.
What did immediately solve it was upgrading to 184.108.40.206-SNAPSHOT. For now in prod we’ve had to downgrade to 220.127.116.11, but a bug with that version is playing havoc with our logs so we may have to take a risk on deploying a snapshot. We were forced to do that a couple years ago and actually it proved very stable.
A bit more info…
I’ve just tried downgrading my dev server from 18.104.22.168-SNAPSHOT to 22.214.171.124 to re-check what I was seeing with debugging on this time.
For the app re-initialisation (ie. onAppliationStart()) the slowness is entirely down to instantiation of non-ORM singleton service layer objects (around 3 seconds in one case for no more than a couple of dozen objects).
The test suites are being slowed by OrmReload() calls in the beforeAll() method of each spec. These are taking around 4-5 seconds each, which easily adds up to a 50 second timeout.
Given what I’m seeing I think we can rule out AJP as a factor.
This is really odd. What I really can’t understand is that @Zackster solution worked on our production server, but: Because the devs at apache closed the issue quite quickly without digging deeper into it, I wanted to add some more information to the bug filed. So I’ve setup a new local dev environment for switching between various tomcat and lucee versions, but I wasn’t able to reproduce it. Finally I gave it up. What I can clearly say is that eveything like starting execution time and also the on page execution time done by lucee was quite the same as ever. It felt more like a network thing, like when your ISP/Colocator is having network problems, and that was the first thing we tried to sort out.
@Julian_Halliwell Nice work on the additional testing. I actually pinged Micha this morning and asked him why there were snapshot builds on the 5.3.4 branch and if there was a release possible in the future. His response was simply this:
i don’t know about that yet, we still see an issue and i try to figure out what is going on …
Hopefully that means he’s been looking into whatever this issue was. I assume a review of the Github commits will help pinpoint what has been changed.
Here are the most recent commits on that branch
Technically the commit titled “add logging” is what bumped to the 126.96.36.199-snapshot which seems benign, but if you look closely at that commit, you’ll see the removal of the synchronized keyword on the getinstrumentation() method too so it’s more than logging in that commit.