Lucee's BIFs (Built-In-Functions) load during every cache function call, causing drop in performance

BK_BK · September 13, 2022, 8:19am

Lucee’s BIF(Built-in-Function) bundles seem to load during every cache function call (cacheget, cacheput, etc). As some of the bundles are singlethreaded, this causes a bottleneck when the number of requests in our application begins to increase.

For example, a cacheget of a simple value hangs for over 2 minutes, until request-timeout. See stacktrace. The relevant line of code:

/*** /nl/dvnt/stdmtr/service/ExamCoaches.cfc:10 ***/
// within a function
var oResult=cacheGet("ExamCoach#arguments.examCoachId#", false, "methodinfo");

This is similar to the OSGI-bundle-loading problem for createobject.

We had this problem together with the OSGI issue on Lucee 5.3.9.141. So we moved to Lucee 5.3.9.163-SNAPSHOT, which seems to have helped on the OSGI issue so far.

I should add that we use CommandBox;

Hibernate extension version 3.5.5.84;
Ortus RedisCache extension 2.0.0+6428.

stacktrace_BIFs_loading_every_time.txt (12.6 KB)

BK_BK · September 13, 2022, 10:08am

Another cacheget call elsewhere in the application has resulted in a severe degradation of performance. It seems cacheget is unnecessarily loading Lucee’s system BIFs (Built-in-Functions) every time.

I have attached FusionReactor’s Profiler result.

Lucee 5.3.9.163-SNAPSHOT on CommandBox
Hibernate extension version 3.5.5.84;
Ortus RedisCache extension 2.0.0+6428

FusionReactor_Profiler_stacktrace_BIFs_loading_every_time_during_cacheget.txt (26.8 KB)

bdw429s · September 13, 2022, 1:41pm

I don’t see anything related to the BIF loading in there. The Redis extension is deserializing the cached value which includes a CFC and the OSGI loading appears to be related to that. I’m curious what it’s doing.

BK_BK · September 13, 2022, 2:26pm

Thanks, @bdw429s .
What I am wondering is: might the BIF / OSGI loading involve a lock or singlethreadedness of some sort? That would explain why requests would queue up.

bdw429s · September 13, 2022, 3:22pm

I think the BIF is a red herring. You have to know how to read these profiles, and just because you see references to something in there doesn’t mean it’s the slow part. The trick is to follow the percentages down and find the lowest stuff in the “tree” that’s slow without anything also slow under it. I could get a better feel if you could screenshot the actual profile since the txt version loses all the indenting (I’ve asked FusionReactor to add a feature to copy indented text) but what I see is all about loading CFCs back into memory. The fact that this is happening inside of the eval function which is a BIF is more less irrelevant. re-creating the CFCs is the slow part. The “BIF” is just part of the plumbing that calls the code that is instantiating the CFCs.

So a few notes here:

the profiler doesn’t tell you how MANY times any of those lines of code ran. It’s entirely possible that rebuilding a CFC is actually fairly fast, but your code has simply pulled thousands of them from the cache and you’re just seeing that little bit of overhead adding up.
The “breaking points” in the profile are things like org.apache.felix.framework.BundleImpl.loadClass which means if there was a regression in speed, it’s likely an underlying OSGI change and not directly related to the caching or the BIF, nor the extension
The general advice is to never store CFCs in external caches because things can get sticky-- they don’t rehydrate correct (mixins are lost as ticketed) Lucee is prone to throwing odd errors in the logs when the serialized CFCs in the cache no longer match the bytecode in that version of Lucee (ticket). And caching a CFC with a refernce to an entire framework can cause a HUGE hierarchy of serialized objects which is unecessary and performs poorly. When possible, store just the raw data in cache and then re-inflate the CFCs yourself.
Analyze /nl/dvnt/stdmtr/service/Webshop.cfc and see how many calls to cacheGet() it has and maybe log how many times that is getting called and how many CFCs are being returned from it every time it is called.
Since you’re using the Ortus Redis cache, any ticket you put in is likely to get pushback and try to blame the 3rd party code. Since I have access to that source code, I can tell you the cache follows the lead of the other Lucee-maintained caches and it uses the serialize() BIF to serialize the incoming data and the cached value is deserialized with the evaluate() BIF. I tell you this so you can try to create a stand alone repro case that doesn’t involve any cache bits that you can run in a loop and time across different Lucee versions to see if you can reproduce the CFC-loading slowness in a much simpler test case. Something along these lines is basically what happens behind the scenes. All Redis ever sees is the string in the middle dump. Lucee does the work of serializing your CFC instances and bringing them back to life.

// replace this in your test with whatever actual value you're storing in the cache
var valueToSerialize = new Query();

var serializedData = serialize( valueToSerialize );

dump( valueToSerialize )
dump( serializedData )

// Run this as many times as you're getting items back out of the cache with cacheGet()
var deserializedData = evaluate( serializedData );

dump( deserializedData )

Anything is possible, but blind guessing is usually wrong. The profiler doesn’t show any time spent waiting on locks to open so I doubt it. OSGI loading can already be quite slow on its own.

There’s a lot of reasons request can queue up. And high CPU usage is already a known symptom so it’s likely just that the server is really busy. The real question is why the class loading stuff is taking so long and really only Micha can answer that. But if you can get a nice stand along repro case, that will help him.

BK_BK · September 13, 2022, 3:36pm

Hi @bdw429s ,
Thank you for the suggestions. Very helpful.
I shall follow each one and report the results here.

Like you I initially thought the BIF is a red herring. Then I saw durations of 10, 20 or 30 seconds between BIF calls in FusionReactor. Or perhaps, an incorrect interpretation of the FusionReactor results? That’s possible.

In any case, I am attaching the FusionReactor profile of one such result. It is for a request that lasted about 9 minutes (8 minutes too long). No cache function involved this time.

FusionReactor_Profiler_stacktrace_BIFs_loading.txt (27.0 KB)

bdw429s · September 13, 2022, 4:17pm

Unless you pasted the wrong profile, that stack is calling cacheGet()

100,0% - 59,558s dvnt.stdmtr.service.examcoaches_cfc$cf.udfCall1(/nl/dvnt/stdmtr/service/ExamCoaches.cfc)
100,0% - 59,558s lucee.runtime.functions.cache.CacheGet.call(CacheGet.java)

BK_BK · September 14, 2022, 8:12am

Oops. That was the intended profile. I simply missed the cacheGet calls. Long day at the office.

BK_BK · September 19, 2022, 2:37pm

Our research on this is still going on. Just for the sake of completeness I am attaching a similar FusionReactor stack that doesn’t have a cacheGet call.

Here the bottleneck seems to have been caused at the following line:

/*** /nl/dvnt/stdrdr/vo/BasicUserVO.cfc ***/
isInstanceOf(arguments.data, "nl.dvnt.stdmtr.orm.UsersRO")

Remarks:

The request eventually succeeds. But it lasts almost 3 minutes instead of a second or two.
nl.dvnt.stdmtr.orm.UsersRO is a CFC.

FusionReactor_Profiler_stacktrace_longlasting_request.txt (16.7 KB)

bdw429s · September 19, 2022, 2:50pm

This may or may not be related to the other requests’ slowness. It does seem class loading is involved again still. Questions to answer here are similar to before.

How often did that code run (once or thousands of times in a loop?)
What is “upstream” in the stack trace that may help answer that
I recommend you do the same thing here that I recommended you do the last time, which is answer the question in bullet 1 and then create a repro case that runs the same logic the same number of times in a test file and see how it performs.

A couple other things to look at

do you have debugging enabled that may be slowing down the requests?
Do you have any Lucee logging enabled or set to debug or trace which may be slowing stuff down

BK_BK · September 20, 2022, 7:36am

Thanks, @bdw429s .
We in our team also think that class loading is likely to be involved.

Re bullets 1 and 3, I have been following your earlier suggestions. It’s just that I have yet to come up with a repro case to present here. The challenge is in simulating the critical combination of load and types of user-requests that triggers the sharp drop in performance. This research is on-going.

Nevertheless, on the question of how often that line of code runs, FusionReactor’s profiler provides a useful indication. As you can see, it ran 230 samples on that occasion.

Debugging is not enabled.
We’re making use of default logging in Lucee (no debugging or trace enabled)

michaelborn_me · September 20, 2022, 4:10pm

Pardon my ignorance, but that sample number has nothing to do with the number of executions of any piece of code… correct? The sample count is merely the sample rate times the profile time. In this example, 230 / 60 seconds is roughly 1 sample every quarter second.

bdw429s · September 20, 2022, 4:21pm

Correct. The samples is simply how many stack traces FR took, which by default are every 200 ms. That doesn’t tell us how many times a given line of CF code executed. You’d have to put some logging of your own in the code base to answer that question.

If you haven’t made to the point where you can take a given page and know how many times a line of code ran, I’m confused what you’ve been doing in your attempt to reproduce the performance, as that would have been the very first step in my mind! And just to clarify, I hadn’t suggested that you try and reproduce the entire server slowness, just that you take what appears to be a slow bit of code and urn it in a loop the same number of times and see how it performs.

BK_BK · September 28, 2022, 10:47am

@michaelborn_me :

Pardon my ignorance, but that sample number has nothing to do with the number of executions of any piece of code… correct? The sample count is merely the sample rate times the profile time. In this example, 230 / 60 seconds is roughly 1 sample every quarter second.

@bdw429s

Correct. The samples is simply how many stack traces FR took, which by default are every 200 ms. That doesn’t tell us how many times a given line of CF code executed.

I did not say that FusionReactor’s sample-size is the number of executions of the relevant piece of code. I said the stacktrace resulting from 230 samples is a useful indication.

@bdw429s

If you haven’t made to the point where you can take a given page and know how many times a line of code ran, I’m confused what you’ve been doing in your attempt to reproduce the performance, as that would have been the very first step in my mind!

I am, and have been, doing that hundreds of times daily, from day one. With positive results.

For example, performance improved when we replaced isInstanceOf - which, incidentally, appears in the FusionReactor stacktrace. Using name-value syntax for the arguments in certain calls has also helped.

However, it has so far been a question of putting out fires. I have yet to pinpoint a common thread - pun intended. - underlying the drop in performance. In the beginning I only suspected there might be one. Now I do believe there is at least one. And it’s got to do with classloading in Lucee.

See: https://lucee.daemonite.io/t/are-lucees-custom-class-loaders-single-threaded-or-multithreaded/11123
The challenge is to come up with a repro case which I can share in this forum.

Alas, the repro case has been a secondary aim. Our primary aim has been to resolve show-stoppers and bottle-necks, enabling us to open shop every day.

@bdw429s

And just to clarify, I hadn’t suggested that you try and reproduce the entire server slowness, just that you take what appears to be a slow bit of code and urn it in a loop the same number of times and see how it performs.

No need to clarify. I have enough experience and know enough analysis not to attempt a Hercules on the entire server. I understood what you meant right from the start.