Thanks Alex. Is there a JIRA ticket for it?
I’ll give 22.214.171.124 a go.
Thanks Alex. Is there a JIRA ticket for it?
I’ll give 126.96.36.199 a go.
188.8.131.52-SNAPSHOT does seem to solve the problem, thanks.
Unfortunately it’s also introduced another bug affecting our Tika library wrapper… hopefully I can work around that.
@alexskinner Did you by chance mean “compilation”?
@Julian_Halliwell That’s curious that it doesn’t happen on CommandBox/Undertow. I’m a little familiar with the compilation issues I think Alex was referring to, but I would have expected it to affect all servers alike. Have you been able to use FusionReactor or some other tool to get stack traces and see what the slow requests are spending their time doing?
Sorry, Brad, I haven’t got any detailed analysis. The timeout exception stack traces just list normal activity at the moment the timeout hit. But as I say the trigger was consistently app re-initialisation and running Testbox suites. Normal requests seemed to run ok. Guess that would fit with “compilation issues”, as there’s a lot of “warm up” activity going on in those cases.
I may be wrong about CommandBox. I can’t run the same apps on both stacks (without going to a lot of effort), so I wasn’t testing like for like, but app re-initialisation and test running seem to be normal with CmdBox.
@Julian_Halliwell Something else to test is to time the total execution of the CF code and see if the delay is part of the CF code or the AJP connector. That may help narrow things down. For example, if the entire request takes 1 second, but the CF part is only 100ms then Lucee is probably not even part of the problem.
I might try that tomorrow, Brad, but the fact that the problem goes away when upgrading to 184.108.40.206-SNAPSHOT suggests the problem is with Lucee rather than Tomcat, although I perhaps it could be a combination(?)
It’s more difficult to attribute cause/effect when changes to more than one moving part are involved as you’ll know better than most! Having to keep on top of Lucee, Tomcat and Java, all of which have been pushing significant updates recently, does make maintenance a challenge.
There was an issue with compilation in
220.127.116.11 which causes very slow app startup, so
18.104.22.168 was released to resolve that (bug is marked as fixed in
22.214.171.124 is the Release version to use).
5.3.5.x is still in Release Candidate so technically isn’t recommended for production, however this same issue is fixed in
126.96.36.199 according to the ticket. The similar version numbers here can make it easy to get confused between the two builds, and the last RC release was
188.8.131.52-RC which doesn’t include the fix so the
184.108.40.206-SNAPSHOT would need to be used if you are testing
If possible try to isolate the Lucee version change from the Tomcat AJP Ghostcat change. There is potentially a separate Ghostcat mitigation / performance issue which has been discussed here, something to do with localhost / IPv4 vs IPv6 lookups on Windows;
Wow, lots of versions! And to complicate that, I saw that
220.127.116.11-snapshot hit the 5.3.4 branch today which makes me wonder if there will be a third 5.3.4 release. If it’s any consolation, ForgeBox is now auto-publishing every single Lucee (and Lucee light) build including snapshots. It’s all automated-- in fact there were 14 new Lucee versions pushed to ForgeBox just today. That may help with testing all these versions.
server start email@example.com-SNAPSHOT+13
3 posts were split to a new topic: Where are the list of SNAPSHOT artifacts?
@justincarter The issue is with 18.104.22.168, so I don’t think the compilation bug in 22.214.171.124 can be the problem. Startup was fine, it was app re-inits and test running that showed the most marked effects.
I was really hoping that using the IPv6 address for AJP would sort it as it has for others. But it didn’t in my case.
What did immediately solve it was upgrading to 126.96.36.199-SNAPSHOT. For now in prod we’ve had to downgrade to 188.8.131.52, but a bug with that version is playing havoc with our logs so we may have to take a risk on deploying a snapshot. We were forced to do that a couple years ago and actually it proved very stable.
A bit more info…
I’ve just tried downgrading my dev server from 184.108.40.206-SNAPSHOT to 220.127.116.11 to re-check what I was seeing with debugging on this time.
For the app re-initialisation (ie.
onAppliationStart()) the slowness is entirely down to instantiation of non-ORM singleton service layer objects (around 3 seconds in one case for no more than a couple of dozen objects).
The test suites are being slowed by
OrmReload() calls in the
beforeAll() method of each spec. These are taking around 4-5 seconds each, which easily adds up to a 50 second timeout.
Given what I’m seeing I think we can rule out AJP as a factor.
@Gavin_Pickin Given our conversation last night with our client ORM site timing out due to very slow ORM inits, please see the thread above,
This is really odd. What I really can’t understand is that @Zac_Spitzer solution worked on our production server, but: Because the devs at apache closed the issue quite quickly without digging deeper into it, I wanted to add some more information to the bug filed. So I’ve setup a new local dev environment for switching between various tomcat and lucee versions, but I wasn’t able to reproduce it. Finally I gave it up. What I can clearly say is that eveything like starting execution time and also the on page execution time done by lucee was quite the same as ever. It felt more like a network thing, like when your ISP/Colocator is having network problems, and that was the first thing we tried to sort out.
I’ve done some incremental upgrade testing - moving up through the snapshots from 18.104.22.168 - and have identified the build which appears to fix the issue: which is 22.214.171.124-SNAPSHOT.
Unfortunately there is no changelog for it and nothing I can find in Jira.
But it means that the current RC 126.96.36.199 will fix this (tested and it does), so we can avoid deploying a snapshot.
@Julian_Halliwell Nice work on the additional testing. I actually pinged Micha this morning and asked him why there were snapshot builds on the 5.3.4 branch and if there was a release possible in the future. His response was simply this:
i don’t know about that yet, we still see an issue and i try to figure out what is going on …
Hopefully that means he’s been looking into whatever this issue was. I assume a review of the Github commits will help pinpoint what has been changed.
Here are the most recent commits on that branch
Technically the commit titled “add logging” is what bumped to the 188.8.131.52-snapshot which seems benign, but if you look closely at that commit, you’ll see the removal of the
synchronized keyword on the
getinstrumentation() method too so it’s more than logging in that commit.
try out sourceTree https://www.sourcetreeapp.com/ the history, all branches view is really informative
… although there seem to be bugs in RC 184.108.40.206 which might make going for 220.127.116.11-SNAPSHOT a safer bet. I expect that’s why Alex suggested that version.
This might be a little off topic from my side, but for VSCode Users who like sourceTree, there is a similar extension called GitGraph https://marketplace.visualstudio.com/items?itemName=mhutchie.git-graph
I also using Sourcetree and really like it.
But it doesnt have a linux client… If anyone has a proper alternative for sourcetree on linux, please tell me
GitKraken is what I use, much prefer it over Sourcetree