I've made revolutionary performance changes to Lucee 5.3.2.16

brucekirkpatrick · January 1, 2019, 9:14pm

I’ve been exploring making optimizations to Lucee internals that impact new and existing CFML code, and I’ve made some real breakthroughs by exposing more Java concepts to CFML but still keeping it relatively simple. My fork of Lucee doesn’t attempt to maintain ACF compatibility, and I also already updated all of my CFML code to be modern like Lucee suggested over the years since switching from coldfusion to lucee 7 years ago. As a result of working on the major rewrite effort, I’ve been able to switch to now developing a more strict & pure version of CFML using Lucee’s open source project.

I’m moving the language in my own personal direction. It is still very compatible since I have like 1 million lines of CFML code already written that I can’t break. However, I’ve found ways to deviate that make CFML more memory efficient, more stable, and faster then ever. I’ve been using Lucee since 3.2 and before that I started with Coldfusion 5 in 2001 up until Coldfusion 9. There really has only been 2 big features added to the Lucee language that I’ve really cared to adopt. The fundamental shift was when we were educated to get rid of scope cascading in the Railo/Lucee admin, and that wasn’t really possible without having localmode=“modern” attribute on functions first, since it was too much work to switch everything all at once. So essentially the most important feature of lucee was the ability to get rid of scope cascading in lots of granular ways. However, this is just stating that scope cascading is really a bad feature for CFML, so I’ve made it a permanent change to the language rather then an option, and this allows me to squeeze out more performance since the extra code paths and operation have been deleted.

The second biggest change was lazy attribute of cfquery, but Railo/Lucee left it incomplete since it can cause you to run out of memory. I’ve fixed that and made this query result type able to do even more to give us data access support that is closer to JDBC.

I’ve taken the features that Lucee said was important and pushed those ideas to their absolute limit in order to make Lucee several times faster and then I micro-optimized the internals and learned how to generate new bytecode. In the last 2 months, I’ve spent around 200 hours of my personal time working on some of the most radical improvements to CFML performance for any release of the language.

If your CFML code touches absolutely anything outside the Java process, like a database or a file, the performance of that access becomes 99% of the bottleneck of your code instantly. So while I’m making the language much faster relative to the previous way it worked, the difference often becomes trivial or unnoticeable if your application is not CPU bound on Lucee CFML internals. My Jetendo CMS application is written with so much CFC caching and struct lookups in application scope, that some of my requests never have to leave the Lucee process, which means that some of my requests are CPU-bound and I also like minimizing the overhead of my framework since that is the minimum amount of time it takes for the fastest request. I’m finding ways to do this optimization work in Lucee internals that I could never achieve with just CFML code.

Below I’ve described many of the changes I’ve got working on my open source Lucee fork on github. There are other changes I’ve made too to fix bugs and features that aren’t described here, but can be found in Jira and my commit history. My fork is also broken on some things right now, since I intend to remove quite a lot of features from Lucee so that it builds and executes differently and is simpler. I only build the Lucee core .lco patches file right now in an unofficial way in order to develop it faster, and ignore the Jar build because the lucee admin is too legacy-oriented and needs a rewrite to continue to function. I’m going to drop most of the admin features though instead. If anyone is curious on how to build my version of Lucee, let me know.

I’ve made it possible to create and access CFML variables as true Java local variables without the hash map overhead. The Java locals get compiled in the bytecode, so that CFML variables become able to be accessed at the same speed as the Java language, which is up to 9 times faster then regular cfml variable access, but often more like 3-4 times faster due to JVM hotspot magic. This is a very exciting feature, which requires just scoping CFML variable in a new scope for it to work. It also requires being more strict on Java type casting since the bytecode is doing less casting for you in this new code to make it even more efficient. FYI: All cfml variables are stored as Java.lang.Object and have to be cast repetitively on both read/write to be accessed. Casting is less then 5% overhead typically, so it’s not a problem but I’ve reduced the amount of casting anyway to have more raw performance.

I’ve reduced the overhead of CFML function calls by 60% by eliminating the extra OOP calls and making it possible for most of the internall implementation to read/write the Java fields directly, and also removed some features/redundancy in the operations. FYI: Calling CFML function on a CFC is very complex under the hood - over 30 times slower then one Java call, but I got it closer to 10x with a host of changes in many places.

I rewrote how CFML variable access works inside the CFML bytecode to be direct field operations against the pagecontextimpl and scopes instead of the numerous function calls. This has a cumulative effect on performance since none of those calls individually measure up to anything, but CFML in a tight loop, ends up doing substantially less Java operations and doesn’t have to rely on Java hotspot optimizations as much to be faster. You can see hotspot kick in on the 2nd or 3rd request usually, which makes a lot of benchmarks look weird. Even just adding an extra call can seemingly disable hotspot optimization sometimes. For example, function inlining in a loop seems to work when you only call it once, but if you repeat the call, it can’t inline it anymore I think. I’ve run into little details like that which make it seems like Java code can’t optimize code around hash maps as well as it can local variables/fields/methods. As a result, I think the performance gain is occasionally 100x faster then CFML when you use this java locals feature and make the code run enough to hit the “hot” threshold that hotspot looks for. OpenJDK wiki has more information:

https://wiki.openjdk.java.net/display/HotSpot/Inlining

Because of seeing this performance gap appear sometimes between hotspot with more plain Java code and Lucee cfml code, I don’t think it would ever be accurate to assume Lucee is a thin layer on top of Java. Lucee is really 30 times slower for many benchmarks compared to Java and kind of disables hotspot optimizations from working as well in the CFML bytecode. ACF 2018 was measuring around 3 times slower then Lucee 5.3.2.16 on various language tests I did recently so they aren’t doing it any better.

In some areas much more speed is now possible on my version especially where the work is more primitive like complex math calculations because variable and function overhead shows itself more there then it does on strings or more complex objects. Lucee CFML can’t touch Java primitives without boxing them first. Also all CFML numbers are Double type to keep numbers simple for CFML, but Integer would be up to twice as fast because of using half the memory. Java primitives are up to 10 times faster then objects in Java itself, so we are definitely carrying a lot of Object overhead by using a dynamic language where everything is an “Object” under the hood.

I merged the arguments scope with local scope so that lucee doesn’t have the overhead of managing 2 scopes for every function call or variable access anymore. I also got rid of the ScopeFactory for new localImpl and let garbage collection do the job to simplify it more. For the most part, I’ve permanently eliminated scope cascading at the Java and bytecode level. Variable access was made up to 100% faster.

I created a new way to call functions in external CFCs that is 25% to 100% faster. Because of how Lucee works on the inside, the code ends up doing an extra hashmap lookup to get access to a function in another scope/object, which shows that call overhead to external cfc methods is 2 to 3 times more then calls to methods in the same CFC. My new feature works by letting the CFML developer cache the methods as java locals which are a new object that is a ComponentImpl and UDFImpl so that the actual call doesn’t need the hashMap getCollection to find the UDF repetitively anymore. I also bypass several of intermediate java calls internally since there is a lot of calls that just manipulate arguments in the existing code. I could get the majority of my shared objects to operate at this increased speed with minimal code rewriting or I could generate new bytecode for existing variable names to make them work this way automatically. This change will make a real impact to overall performance in a large CFML application since it is very common to many calls between cached CFCs.

I always liked making code generation systems, and Java bytecode is especially fun and challenging to work on when I can change how CFML works. I’ve done a lot of bytecode learning by trial and error for the last 2 months which I didn’t know at all before that. This work is helping me to understand how to make patterns in CFML code able to be connected to new concepts in the Java bytecode, which is actually more interesting then any of the other work since I’ve already made excellent systems in CFML. I could now translate my CFML framework’s nested shared scope accesses directly into new more direct bytecode without needing to rewrite the CFML so that it runs at the speed of static java class speed. This will give my application a huge boost compared to anything else written in CFML.

I also made a way to statically compile into Lucee core my CFC components and function names which can get rid of dynamic call overhead and other security checks that Lucee does. It is also possible to call a CFML function in a “unsafe” way that is twice as fast if you don’t need some of the normal debugging/variable scoping features to make it even faster, which I made a feature to let me do that too.

I changed string.concat() to Java stringbuilder.append() internally in the bytecode generation, so that the cfml & operator is at least 2 to 3 times faster on average because it eliminates making many redundant copies of the string. It gets progressively faster with this approach if your string is larger and you concatenate many times. I did this because this is how the Java compiler works to optimize string concatenation in the Java language to be the fastest possible. Lucee CFML didn’t do this yet it has its own compiler for CFML code.

I modified lazy queries which return Lucee’s SimpleQuery.java object to let me modify the resultSet cursor directly so that I can call next(), reset() and close() on the result myself. I also made it automatically close after a normal for/cfloop so that you don’t have to remember if you don’t want to think about it. This prevents lazy queries from running out of memory, which allowed me to turn lazy on for all queries by default, which massively reduces the amount of simultaneous memory usage across an application. For other code, I want to optimize further, I can now manage looping through the resultset myself while only consuming the memory of the row and columns I am currently using, instead of the entire query result being looped multiple times and fully processed like legacy CFML requires. I also made it possible to allow direct access to the query records by column or columnIndex as new function calls to further eliminate the internal hash map overhead of CFML variable access. In addition to adding these features, when I combine this with my new Java locals feature, looping over query data is nearly 100% faster. It’s like JDBC now with a lot less typing. This is a very exciting set of changes.

I do have quite a few other major changes planned since I’m trying to migrate Lucee away from servlet technology and to integrate my application and my java web server with it. I’m going to make Lucee start much faster as a CLI application like it used to, so that I don’t have to use tomcat or any particular server technology eventually. This change will be very invasive, since JSP, Javax, and Servlet are in several thousand places in lucee core. However, I’ve already proven various things don’t actually require these dependencies and Lucee CFML code doesn’t require separate threads to work right. So the hard part is just the time it takes to refactor everything and keep it working. The benefits later will be that we could serve a CFML request on the same thread as the socket or original CLI thread which saves around 0.3 milliseconds on most CPUs. My benchmarks of Java code I wrote suggests that just reduces the amount of threads being created can make Java able to complete HTTP requests 4 times faster for CPU-bound code, in addition to the other improvements I’ve made to speed up Lucee. They announced Lucee 6 would start faster, but they must only be talking about Lucee itself, since 90% of the start-up time is actually Tomcat. Perhaps the scripting engine startup is what they are talking about, but Lucee is very integrated with servlet, so it carries whatever overhead there is there. I actually experimented with tweaking OSGi startup speed, but it wasn’t the bottleneck, so I’m not sure there is much “slow” about Lucee start-up in their code really.

More frustratingly: Tomcat often fails to be able to stop itself in production since there are running thread. So even if Lucee took 0 seconds to start, we’d still be waiting 5 to 30 seconds for tomcat to do its thing. I know my Java web server could run Lucee just fine at some point, and it is just a few thousand lines of code with only JDK as the dependency. I don’t expect Lucee to restart in under a second until I have replaced the servlet features with my custom server. More importantly though, it will handle more simultaneous connections and use less memory then ever.

It makes me really excited to think of where I’ll be Java and Lucee in the future since I’ve only been working on Lucee core since July this year.

Brad_Wood · January 1, 2019, 10:47pm

Awesome work Bruce. I’m curious if there is a way we can incorporate some of your work into the Lucee core. Obviously not all of it is compatible, but there is to be done solid improvements we could get from you. I’m very interested in your work to remove the servlet and improve startup time since that is one of my biggest issues in CommandBox. That, and CFC compilation/metadata generation are the biggest part of CommandBox startup.

brucekirkpatrick · January 2, 2019, 2:02am

Brad,

It would probably be easier to work with me on making an alternative packaging of CommandBox, rather then wait for something to come to Lucee itself if you want to try the features. I can’t claim to have made Lucee start faster yet since I didn’t work on that, but I’ll certainly let you and everyone know if I reach that point. I assume you have lucee configured to reuse cfml class files after restarts since that makes a big difference to start-up time. If you don’t know, there are function calls to programatically tell it to clear that cache. I also made a new function that can reload just 1 cfc class instead of dropping the whole cache, which might be interesting to you if you need it to be somewhat dynamic for specific file that was created via command line. Compilation also goes faster on solid state drives without using virtualization.

The goal of sharing is to inspire others and bring awareness of what is possible. Maybe a few people would try my version, or comment on it, and want something better/different.

I did share some of my query changes a while ago on Jira, but I haven’t had a quick response, and it takes time to share each individual change. I’ve changed many of my Jira submissions several times after sending them too.

I shared the string concat change on Jira yesterday.

I also share all of the changes on the open source fork in a certain branch, which they can browse if they are interested or perhaps other Lucee association members could encourage some of these things.

I think the query changes and Java local features are super exciting ideas for improving CFML.

I’m just getting started with Lucee compilation ideas.

I’d also like to switch to doing more challenging backend work during the day in Java or something that isn’t typical web-dev clones in a statically typed language, but I don’t want to move away from Daytona Beach, Florida. Being really good at making difficult changes to various Java projects can help me get such a job if I want it later. I’m gearing up to incorporate my Java work into my current job more, but the CFML work is so much further ahead since I’ve built it as a single application for the last 17 years, and I’m just 37. I’m pretty serious about improving things where I am, instead of starting over.

brucekirkpatrick · January 2, 2019, 2:41am

Brad,

Another idea. I’m not sure if you have a bigger cfml application inside commandbox, but since you mentioned component creation and metadata, it sounds like you have something loading. You could also use ObjectLoad and ObjectSave to serialize your entire application scope and reload it on start up faster all at once, rather then individual calls to createobject from scratch. I used to do that, but there were issues switching between versions of lucee since they might not be binary compatible, so you have to check the lucee version in server scope to make sure they are the same version on next start. Plus some bugs in railo made it break once years ago. Now my application scope is 6gb, which is a bit much too much save and reload all at once, so I switched to loading sites in sequence to keep it more stable and i make all other requests fast-fail since Lucee is designed to self-destruct otherwise. I suppose I can now redesign how Lucee starts on the inside to be safer for everyone, but that was going to be done in my web server instead. I definitely like to optimize start-up and reliability issues since downtime just gets worse if lucee and the application can’t auto-restart correctly.

My CFML application takes about 60 seconds to reload that 6gb cache, which isn’t that bad considering how much data that is in millions of little CFML values.

I also made my application able to index and getMetaData on all the CFC in a multi-threaded startup process so that I can bind urls to the cfcs for automatic routing. So then all future calls have none of the overhead for CFC creation or metadata. When you combine that with the objectload/objectsave work, it gets pretty quick to reach a stable and consistent performance for the whole application.

There is also a feature in mysql that can do a similar thing when enabled, storing its cache to disk so it is already warmed up when you first start it.

It is also possible to restart just the lucee engine programmatically with the cfadmin tag, which is useful if you are working on the core like me. This is much faster then a tomcat restart, but still several seconds.

bdw429s · January 2, 2019, 5:30pm

I assume you have lucee configured to reuse cfml class files after restarts since that makes a big difference to start-up time.

Can you clarify what setting you are referring to? Adobe CF has a “save class files” setting but Lucee just does that without an option to disable it. So far as I know, there’s no way to NOT have Lucee reuse class files from disk on startup.

If you don’t know, there are function calls to programatically tell it to clear that cache.

If you’re referring to pagePoolClear(), then yes I use it in CommandBox.

I also made a new function that can reload just 1 cfc class instead of dropping the whole cache, which might be interesting to you if you need it to be somewhat dynamic for specific file that was created via command line.

YES, I would actually love that. Adobe CF gives a way to clear class files for a single file, but the lucee method is heavy handed. For instance, I have to call pagePoolClear() every time CommandBox executes a Task Runner to pick up changes in the CFC since Lucee is set to only check for changes in .cf? files once per request, but the entire CLI is one “request” so it’s basically as though I have it set to “never”. I hate having to nuke the entire cache just to reload a single file. I suppose I could put in a ticket for it. Your changes might make a good pull. Just add parameter to pagePoolClear() that takes an array of template paths.

Compilation also goes faster on solid state drives without using virtualization.

I have SSDs, but I can hardly require that of my users

I did share some of my query changes a while ago on Jira,

Thank you for doing that.

I shared the string concat change on Jira yesterday.

I saw that too.

perhaps other Lucee association members could encourage some of these things.

I am a Lucee member and I’ve advocated internally for the core dev team to work with you on getting ideas from some of your work on at least two occasions.

I think the query changes … are super exciting ideas for improving CFML.

I’m also interested in the query changes. Firstly because I’ve heard Sean Corfield say on several occasions how direct JDBC access in Clojure is so much faster than CFML query objects, and secondly because Luis’s recent cbstreams project could really benifit from creating a Java stream provider out of a JDBC result. You could literally start processing records right away, before the entire result had even been transferred from the DB! Of course, CFML currently forces us to gather the entire result into memory in a query object, THEN iterate over it, and build a stream so it’s a little self defeating in some ways.

I’m not sure if you have a bigger cfml application inside commandbox,

I’m not sure what constitutes “bigger”, but ComnandBox has grown into a framework of sorts that borrows a lot of things from ColdBox such as modules, interceptors, core settings, Wirebox, CacheBox, etc.

since you mentioned component creation and metadata, it sounds like you have something loading.

All the commands are implemented as CFCs, but each of those CFCs must be discovered on disk and their metadata read and cache in memory before CommandBox can begin. Wirebox also must register any model CFcs in the installed modules on startup which again requires metadata to be generated for each CFC. The more modules you install, the slower startup can be due to the loading of all the interceptors, commands, and models that are in each module.

You could also use ObjectLoad and ObjectSave to serialize your entire application scope and reload it on start up faster all at once, rather then individual calls to createobject from scratch.

Hmm, that’s a very interesting idea. I’ve wanted for a long time the ability to “store” the entire heap on disk and just load it up where I left off but I’ve not found any way to do that from the JVM level. I’d never considered attempting to store the entire application scope, but what I did change a while back is I added a disk-based cache to caches all of the CFC metadata because it’s so darn slow to regenerate it on every start, even if nothing needs to recompile. Another issue I have is the metadata is VERY struct heavy and I burn a lot of memory storing the stupid stuff due to all the structs they create in memory. Hundreds of thousands of structs for just a few hundred CFCs.

brucekirkpatrick · January 2, 2019, 6:56pm

Brad,

It’s not called the same thing as ACF, but its under Performance/Caching, and you can override in on web, and also on individual mappings, which I guess you use since you just said “never”. If it is set to never, it will be the fastest at startup since the date check isn’t even done and if the file is already there, it won’t attempt to compile it again even if the cfml changed.

Inspect Templates (CFM/CFC): Never

That ReloadComponent feature is something I put in my version of Lucee already and it works fine, but I also suggested it on JIRA here: Log in with Atlassian account
So maybe you could get some folks to encourage it being added on the jira issue.

My ReloadComponent feature is actually cooler then just forcing recompilation, it also bypasses the overhead of searching component path, since it lets you pass the existing CFC instance in as an argument instead of the string name. I was trying to come up with a faster version of object duplication too, but this is currently hard due to the internal structure of many individual field/hashmap operations needing to be re-assigned. This is a minor performance issue since lucee can cache those anyway, but its something I wanted to try in order to retain cache in more situations where I can predict what needs to change.

Worth noting that the overhead of running the query outside of Lucee is often still most of the time even with my new code, and that a streaming approach to processing the query is not guaranteed to have the highest throughput under load, but it is guaranteed to massively reduce peak memory usage, which can increase the stability of Lucee if you are running dangerously close to out of memory in the current setup. So performance is more able not crashing Lucee on this one. Java streams are more about code style, they are not faster performance. I also avoid ORM for similar reasons. I’m not sure I’d use functional style programming that often because of the overhead it currently has, but obviously Sean Corfield is on the other side of that debate using Clojure. I wish it was cheaper to create Java threads, and then parallel streams would look smart, but unfortunately, most real code is faster without being parallel, and precomputing intermediate objects is often best for slow things. To me, it seems like JDBC and socket performance is more of the bottleneck on queries, since it’s got to transfer and type cast everything from strings instead of binary or something that was already the same as java native code. The code in the driver is doing plenty of work. I made it possible to skip casting as one of the things I did since the driver does the least work if you use getString() for at least mariadb, which wasn’t as buggy as it could have been for most of my app. When the query is fast / small, the language optimizations show themselves better.

I was going to try some work-stealing parallel programming stuff at a higher level then requests someday to see if I could get parallel code to run faster via continuous message passing and thread.sleep(0). You kind of need a message passing system that is running all the time for threads to make more code run in parallel. There are systems that claim this works better. Its harder to setup then a simple cfthread loop. This is why I’m trying to replace servlet, since I need direct control of the thread task queue. I think Scala / Akka / Play are built around some of these ideas.

I use a combination of binary format (objectload is basically native java serialization relying on lucee’s implementation for some of the lucee objects, and plain java on the rest) and simple json files for caching complex structures to disk so that I can startup my sites much faster. Rebuilding all these caches is a very heavy task in comparison, since i’d have to run and loop over thousands of queries, instead of loading just the files. Sounds like you’ve done some of this, which certainly helps.

I also find CFML’s external cache feature useless and much slower most of the time, so I end up making my own cache implementations. I don’t think “cachedwithin” or cfcache are worth it most of the time.

I also build immutable objects all at once in cfml before replacing caches to keep everything thread safe and to have zero locks in my CFML code. Internally, variable assignment is synchronized, so any single assignment in cfml is safe without adding a lock. But if you do multiple assignments, it is no longer safe. I’m mentioning this because the performance of multi-threaded activity depends a great deal on not using cflock and reducing write contention. I also do similar things to avoid locking when making bulk database changes. I go as far as rebuilding and swapping a whole table with rename table sometimes to get reader to never have to be blocked since some big tables can’t delete or alter quickly enough.

bdw429s · January 2, 2019, 7:33pm

It’s not called the same thing as ACF, but its under Performance/Caching

Yes, of course I’m familiar with that setting. Adobe CF actually has 3 settings that relate to class caching:

Trusted cache (yes/no)
Cache in request (yes/no)
Cache on disk (yes/no)

Lucee combines #1 and #2 into a single setting (inspect templates) with three options (never, always, once per request). Lucee doesn’t have an equivalent for Adobe setting #3 and always caches on disk regardless. It’s worth noting there are 2 levels of class caching even though many people think it’s just a single “cache”. There’s the in-memory cache that is rebuilt on every start of the JVM regardless of your settings, and then there is the disk cache which persists across restarts. On Lucee, since the disk cache is turned on with no way to disable it, you always get that benifit regardless of your settings. This is why I was a little confused originally because there is literally no way I CAN’T get the benifit of the disk cache in Lucee. It’s guaranteed.

since you just said “never”

To clarify again. CommandBox uses the “once per request” setting, but since the entire CLI shell is viewed as a single “request” in an event loop it’s effectively “never”.

but I also suggested it on JIRA here: https://luceeserver.atlassian.net/browse/LDEV-2103

Thanks for the link, I voted on the ticket.

My ReloadComponent feature is actually cooler then just forcing recompilation, it also bypasses the overhead of searching component path, since it lets you pass the existing CFC instance in as an argument instead of the string name.

That’s interesting, but surely creating a throw away instance of the CFC is slower than simply letting Lucee lookup its path. In my use cases, I wouldn’t have any instances of the old CFC laying around as Task Runner CFCs are transients so creating an instance of the CFC just to clear the cache to create another instance doesn’t seem like it would be of any help to me.

Java streams are more about code style, they are not faster performance.

I disagree. Streams are fundamentally different in at least two ways,. The first is how each operating against a stream is happening simultaneously, not procedurally. Take this ex:

myCollection
  .filter( ... )
  .map( ... )
  .reduce( ... );

If myCollection is just a standard struct or array and we’re calling the CFML built-in member functions, the filter is applied to the entire collection first. Then and only then does the map start processing and it is applied to the entire filtered collection. Then and only then does the reduce kick in and start processing. However, if myCollection is a Java stream, then the filer, map, and reduction happen as soon as elements become available on the stream. That means the reduce is crunching the first items it receives before the filter is even done with the original stream. The second way Java streams are fundamentally different is in their super powerful ability to run them in parallel. Take the example above and simply change it to

myCollection
  .filter( ... )
  .map( ... )
  .reduce( ... )
  .parallel();

And now a fork join pool of threads is spun up to process the elements in the stream all a the same time. The overhead of the thread management is basically nothing from what I’ve seem and this works very well in Lucee EXCEPT for the fact that your fork/join threads will lose a reference to the original application context. Micha has claimed to have a fix for this that will be part of Lucee 6.

and to have zero locks in my CFML code

I’m curious how you prevent more than one thread from trying to rebuilt cached data at the same time. I can think of some ways to do it, but that is a common use of locks in cache refresh-- to only allow a single thread to replace the missing data while the other threads wait until it’s ready instead of dogpiling on.

brucekirkpatrick · January 2, 2019, 9:19pm

Brad,

That is a good description of the difference between how Lucee is today and how it will be with true Java stream integration. But many people seem to say Java streams are the same or slower then java primitives/loops (Java vs other Java code). In Lucee, there is 30x+ overhead for functional programming style compared to Java which implemented lambdas as InvokeDynamic in Java 7 before releasing the Java 8 streams feature. You’d be exaggerating this by adopting heavy usage of closures and streams someday compared to simple code. You can also do tens of thousands of things in Java in the same amount of time it takes to coordinate a new thread so you always have to evaluate if the parallelism has any value. It usually doesn’t until you are running external I/O so I don’t get excited about doing structs and arrays in parallel until Java releases a much faster “fiber” feature that is supposed to match Go lang goroutines. I think Lucee needs to rethink objects and functions some more, use invoke dynamic or a new type system so it can compile cfml faster. Abstractions on top of that foundation would perform like Java, but then look simple like cfml.

You don’t actually have to stop multiple threads from assigning the same variable at the same time. The consistency problem doesn’t come up very much, but in the rare case that 2 people did tell it to do the big cache reset at the same time, only the one that finished last would be the new version - there would be no java exception. If that broke the application to have the wrong version published, I’d have a bug which gets fixed by just publishing the cache one more time. I can guarantee they are synchronized with a cflock on just the cache builder which i’ve done a few times, but I don’t like to make the public internet ever wait on a lock condition. These are usually password protected users causing that change. The biggest caches are only accessible by the developer, which protects them more. The ones clients have access to are very granular, like one record in one object on one domain. This also makes my admin features very fast, since most of the system is running from memory, and knows how to update the least data possible.

It’s as simple as writing code like this in you cache rebuild process:

ts={};
// build ts nested objects fully here.
application.mycache=ts; // assign once at the end.

it’s important to not accidentally write to nested keys though anywhere other unprotected requests access the cache without validation.

ts={}; // build cache on ts object
application.mycache={}; // you just temporarily broke other requests here
application.mycache.mycache2=ts; // now it works again.

You need to think of how the code is used everywhere to feel comfortable deleting locks. I’m very careful with cached objects and there is no encapsulation on them to keep them fast as possible. After redesigning everything like that, all concurrency problems were gone and locks could be deleted.

You can also use duplicate() to fix almost any concurrentModificationException quickly, which has helped me knock out fix it on a few rush jobs that need a little custom sorting on cached objects where I didn’t think of the concurrency problem ahead of time. arraysort() for example is not safe on a multiple thread shared object because it operates on the cached data directly. structsort() is safe though since it creates new data.

bdw429s · January 2, 2019, 9:47pm

In Lucee, there is 30x+ overhead for functional programming style compared to Java which implemented lambdas

This is troubling to hear if FP is slower than non FP methods in CFML, but I’m not quite sure if that’s what you are stating since you compared the speed of doing something in CFML to the speed of doing the same thing in Java. Now, that is also something I’m curious about, but is’ not an argument against FP, but an argument against using CFML in general (as opposed to just writing your code in Java) I think it’s fair to say most everything will be somewhat slower in CFML than Java but so long as it’s negligible I’m ok with that. If you’re saying FP in CFML is slower than procedural coding in CFML then that’s certainly more concerning.

You’d be exaggerating this by adopting heavy usage of closures and streams someday

Again, you appear to be drawing a conclusion about CFML’s FP specifically on a basis of comparing CFML to Java. I’m not sure that’s logically sound.

I don’t get excited about doing structs and arrays in parallel

I guess our experience here is anecdotal. I don’t believe anything is a silver bullet, but I’ve personally seen massive improvements in given chunks of code by simply adding parallelism. A recent example was the CodeChecker CLI command I built. it was both disk IO bound (reading files) and CPU bound (running regex). I saw huge performance gains by simply adding the parallel flag to the arrayEach() function to utilize more cores.

You don’t actually have to stop multiple threads from assigning the same variable at the same time. The consistency problem doesn’t come up very much, but in the rare case that 2 people did tell it to do the big cache reset at the same time,

I think you completely missed my point there. I understand that setting a variable is an atomic operation. Two users both kicking off an expensive cache rebuild is something very common in, say, the ColdBox framework where an applicationStop() under load could immediately have hundreds of threads triggering a reinit. The point was, in your case where two users kicked off the process, you might not have an error, but you still had one user putting unnecessary load on your server by also rebuilding the cache at the same time. Under more load, this would trigger a “dogpile” as I said in my previous post which could take down the server. So it’s not all just about how you set the variable at the end, it’s keeping threads from all trying to do the same thing at the same time. Coldbox 5 actually offers “fail fast” reinits that don’t use locks but just give back a maintenance page immediately which is another way of handling that.

brucekirkpatrick · January 3, 2019, 3:17am

Brad,

You seem like you are a really good programmer who cares to develop the whole solution fully, so it’s hard to surprise you with anything. We are doing similar code on the things we talked about.

The functional programming approach in every language carries an overhead compared to the other ways of writing code in the same language. Primitive int loops will always be faster than nested function calls and closures. The fluent api of functional programming was the main attraction.

The CPU is 40 to 1000 times faster then our cfml and java abstractions allow it to be. There is a lot of choice in deciding how fast something should be at each layer.

joe.gooch · January 3, 2019, 2:39pm

brucekirkpatrick:

It’s as simple as writing code like this in you cache rebuild process:
ts={};
// build ts nested objects fully here.
application.mycache=ts; // assign once at the end.
it’s important to not accidentally write to nested keys though anywhere other unprotected requests access the cache without validation.
ts={}; // build cache on ts object
application.mycache={}; // you just temporarily broke other requests here
application.mycache.mycache2=ts; // now it works again.

This is exactly the approach/strategy I wish Wirebox leveraged. Maybe create a unique guid when clearing singletons and reinitting, instead of using i.e. application:wirebox as the key, use a unique one each time it reloads, and keep the items separate, preventing having to lock all requests to reinit. Save the guid in the binder and I’d set Request.Wirebox = the binder in onRequestStart - so literally the old objects are used until all requests lose reference to the old guid… And maybe I’d have a reaper that would clear things out after an hour or so.

brucekirkpatrick · January 6, 2019, 4:20pm

The way Lucee works inside Createobject is much more complex then ReloadComponent because of the search for the cfc and creation of the cfc. I had to make changes in 30 places to add a forceReload 3rd argument to createObject now to get it how you wanted it, since I agree it would be useful to have it there too. So now it can force re-compilation on createObject without an instance.

ReloadComponent is more efficient then this if you have the object though because it bypasses searching for the cfc by using the internal path that was already found.

I didn’t make a feature for forcing .cfm to re-compile, so if that is used inside the CFC, it would not recompile the cfm, only the cfc. It doesn’t try to recompile any cfcs created by the recompiled cfc either, you’d have to handle that in cfml code instead.

My lucee work already deviates in other ways, so it would have to be re-implemented on the original code to be part of the official release someday.

My changes for this are here:

brucekirkpatrick · January 8, 2019, 1:45pm

Before I got up today, I was thinking about Lucee bytecode, and I remembered that the CFML key names are stored in this.keys[1]=KeyImpl.intern(“cfmlVarName”), etc, which takes about 4 operations in bytecode to be loaded. ( load this , getfield, push integer on stack, load array index), plus the overhead of creating the Key object.

Plus when a component is created with createObject() lucee runs initKeys which makes a copy in memory of those keys for each instance of the component.

Because these keys are meant to be static caches, it seems like it was an performance oversight to not make keys a static field, and only create these values once.

Additionally, with minor changes to the Lucee compiler, it is possible to create 1 static field for every key by changing it to make it like this: private final static key1=KeyImpl.intern(“myCFMLVar”); which can be loaded with just one bytecode operation, which is getstatic.

I’m not sure how this measures in performance yet, but doing 1 operation instead of 4 sounds would definitely be faster, plus createObject would become faster then it is, especially on bigger CFCs which can have hundreds of these hash map Key objects to initialize since almost every cfml variable name becomes a key object. I had also optimized Lucee’s Key objects already to not calculate one of the hashes that are rarely used and I reduced the number of places it is used as well. So my key creation overhead is less then the original version, but it could become zero overhead in more places. I’d setup a benchmark on a creating a big CFC before/after and post results, and also benchmark running some slower functions that run many variable operations in a loop.

I should also note that this feature is a good thing, since it caches KeyImpl object creation for repeat uses of cfml variables to make them faster. Unfortunately, this caching feature currently doesn’t achieve as much of a benefit if the CFC instance is a transient object. As a static field loaded only once, CFC transients would see a greater benefit. It would even help cfm files since the bytecode for cfc and cfm are nearly identical.

micstriit · January 8, 2019, 2:58pm

as far as I remember this keys are stored in the static scope of the class, so only loaded once per component type, not for every instance. they are loaded at the time the class instance is created. in addition the compiler checks if the key in question exists in the KeyConstant class, if so the are referenced. the Key class was introduced to improve performance, a key is loaded only once but compared all the time. every time you get a variable lucee has to check one or more scope for that variable key. adding keys to lucee as they are have given us a big boost actually. the idea is (for a long time) to create a dynamic KeyConstant table, instead storing the keys in pages to avoid duplications. problem is to avoid having keys in memory that are no longer used.

micstriit · January 8, 2019, 3:17pm

i also checked various applications back then, creating the hash in advance was always the fastest way. have in mind that the hash of a key in a scope is always checked even the application is not looking for that key. you have much more false positives than a match. so they are not really keys rarely used.

brucekirkpatrick · January 8, 2019, 3:42pm

Greetings Micha,

I think I forgot that the pagesource object can be reused for each component instance since it is ComponentPageImpl, not ComponentImpl. So the improvement would be reduced to comparing bytecode this scope array access vs getstatic, which is very minor, but I think I’ll do it anyway to find out what happens. It is the combination of all the changes I’ve done that makes it faster. I want to continue reducing any extra operations in the bytecode.

The hash map features are very good for the dynamic features, and it makes it easier to split the functions due to length limits and have everything still work. I’m trying to explore a more static version of the language since it was harder to make it faster as a dynamic language.

The decompiled bytecode looks like this at the top, if keys was static, it would be less operations and have the same function. The key creation is probably already optimal.

public final class jetendo_cfc$cf extends ComponentPageImpl {
    public ComponentImpl[] componentVariables;
    private final ImportDefintion[] imports;
    private Key[] keys;
    private final CIPage[] subs;

brucekirkpatrick · January 10, 2019, 5:00pm

I learned that Java can’t redefine the same class with more or less fields multiple times during the life of the JVM for the same class name, which was blocking me from finishing this idea. clinit was giving me nullpointerexceptions, despite being able to see the code was working and running, so I had to switch to class constructor instead to get them to be visible when read. There could be versioning system for bytecode classes to get around the JVM limits, but I choose to apply the static field concept to the first 100 names as a workaround, and apply the old array concept to any keys made after 100. Benchmarks reveal this took around 9% off the variable overhead since most of it is the map operation. In a very artificial test, you can show array vs static for a primitive is up to twice as slow, but its almost impossible to measure that in real code.

Also, I really like the KeyConstants feature Lucee has where it translates to common keys to those already created static fields instead of putting them in the classes which makes variables like “i” already static, and this feature also shows you already knew this helps cut down the class size and improve performance. Because I use many of the same names throughout my application, I’m going to bake most of these names as constants into that file so that the bytecode becomes even smaller for most of my application. I also measured internal and external GETSTATIC calls and found they are the same speed. This may improve startup time for my CFML slightly since more of the Key Java objects would only get created once and there would be less memory usage.

Also, unrelated, I found local scope access speed is not optimal because of other undesirable CFML behavior that is allowed for legacy reasons when typing “local” made a struct instead of being local scope. In my version of Lucee, I might make it impossible to assign or create “local” as a variable name at the root level in order to protect the local scope better. If we can guarantee local scope is never a root level variable, we can guarantee simultaneous access would not be possible on purpose or by mistake. Also worth noting that a struct in local scope is a separate object from the scope, and does its own synchronization, so this wouldn’t break other objects since they are concurrenthashmap or something else like that. The main issue here is that Lucee uses SyncMap, which is a Lucee class that wraps a custom implementation of a HashMap in order to do mutex lock on every read/write to the local scope. Those locks are pointless when local scope is used correctly. In a single thread (the only way local scope should ever be accessed in modern cfml), the syncmap benchmarks about twice as slow as using JDK hashmap directly with no synchronization. So on my version, I’ve just switched LocalImpl to a new type of struct that is a plain JDK hashmap for local scope. Everything in my application still works since I don’t use local as a variable anywhere that would cause trouble. This could be an performance option in the Lucee project someday, though I’m not pursuing the creation of options on my version since I’m going only for one optimized version of things.

I also am very interested in making a way for query loops able to generate bytecode that is more aware of the nested CFML variable accesses. It is possible to do the most basic Java calls directly instead of CFML variable access with some new bytecode generation. This should be able to be done safely on cfloop query type loops without breaking compatibility, and maybe others if we are willing to explicitly define that the variable is a query. There is already code that understands how to treat cfloop query in a special way, so the challenge would be to track that and be able to do it in nested loops too, which makes it a fairly complex change, but doable. Instead of it having to push things into the scopes, the lazy query could just do read operations at runtime. A for…in can’t really be improved since it has to access all the columns to make the struct, so this change would encourage using cfloop for best performance instead of for…in.

brucekirkpatrick · February 4, 2019, 4:49am

I worked on Lucee again this weekend. This time removing servlet from lucee was my goal, but unfortunately I found it would take dozens more hours and many thousands of individual changes, so today I switched to integrating my custom java web server with Lucee.

I modified lucee so I could internally call CFML code from my java web server. Not a reverse proxy, but direct java code.

Lots of details were done to get this working the same as other Lucee osgi bundles. My integration was done in such a way where I can extend my server project outside of the lucee project with a regular java development process that is simpler and faster then lucee development allows.

I setup some hello world benchmarks to compare lucee with servlet vs lucee with custom java async web server.

my laptop was used - it has a 3.2ghz core i7 4core/8thread
requests per second (rps)
On windows lucee tomcat servlet: 3000rps with 8 concurrency
On windows lucee custom java server: 6000rps with 8 concurrency

On windows, it seems impossible to get more then 6000rps with apachbench, so I think its actually faster then that but there is some limit on the platform with ports or something.

On ubuntu 18 lucee tomcat servlet: 1600rps with 8 concurrency
On ubuntu 18 lucee java server: 20000rps with 8 concurrency

I have successfully created a new entry point to executing Lucee CFML 10 times faster then servlet can.

My code definitely skips a lot of steps in the Lucee servlet flow. It’s not a false benchmark though, since my java web server can do a full http 1.0 parse and dynamic response while also running a cfml function call. My web server doesn’t need to spawn threads for each cfml request but it could.

My java web server is as fast as undertow’s static request speed, yet my response is dynamically evaluated.

One of the things I measured in Lucee was thread start/join speed, and I found threads get 10 times slower under load on average in CFML. So when benchmarking, my java web server is able to finish about 4 requests in the amount of time is takes for the servlet to manage one new thread and the rest of the overhead is servlet/lucee code doing other things.

When a system is closer to idle, tomcat threads and lucee performance is very good, but it just doesn’t scale as well under higher concurrency access.

I also did other things with lucee this weekend that are exciting too. I made a new way to store cfml variables in java class fields with bytecode instead of hashmaps, which will be able to work in certain places depending on the rest of the changes I do to the compiler later. Basically, I switched it so the compiler translates the name lookups instead of the runtime dynamically evaluating names. This means I have to make the compiler able to replace all the names correctly or it will break. I intend to make this work for all the functions in the same CFC and all the local scope variables, which will make code inside CFCs run at least 3 times faster. I have bytecode generation for both java locals and java fields now. Using fields gives more predictable Object conversion, which makes it more compatible to how the rest of lucee works. The way I did this was to come up with generate java template classes with sequential field names, so I don’t have to redefine the class. it will just leave the unused fields null until they are first accessed. This will also limit which cfml language features can be used, but again the compiler could figure this out and translate the code to other structures. For example I also modified, all the var1&=“test” and var1+=1; --var1 operator code to work with this feature since they were not compatible at first.

carehart · April 3, 2020, 1:49am

Curious to hear whatever came of all this, whether with respect to Lucee, or with the fork you pursued, Bruce.

skyflare · November 17, 2020, 6:46am

Charlie,

I had trouble logging in here the way I used to, and I couldn’t answer you before in April, but I created a new account today to share something I ran into on another thread. I’ve been using my fork of Lucee in production for over 6 months. I haven’t had any problems with it. I might upgrade to the newer Java versions and occasionally find a reason to go tweak the language, but I lost interest in doing larger changes for the time being. I like having the language stable and locked down, since I have so many other things to worry about. I wanted to keep up to date with lucee all the time in case some rogue bad things happened with security, but I learned how it works and I think I secure my fork so much, that it isn’t possible to hack lucee. I only have to worry about catching up to future JDK versions if the security / features have a problem that prevents it from working like it CFHTTP stopped working maybe. I also have a smaller Lucee project with fewer things holding it to older versions of the JDK.

Having the code faster is not as important to me as having the strict syntax and security changes I did. Though I know Lucee runs faster for the things I changed. I really wanted to bypass the servlet engine and spawn cheaper CFML threads in my custom web server, and I had some of that working, but I didn’t finish it to the point of using it. I just wanted to be able to control the way requests are queued so that abuse could be filtered out better and having an admin thread that could always get in. And also sometimes you can do work without creating a thread and go even faster, so I thought parts of the application could be single threaded for multiple requests to go faster.

I also gave up on the lucee start-up code I did that was using multiple threads to start it faster and switched it back to single thread, since I think some of the lucee features are able to break if not loaded in the right sequence, and having the server unable to start randomly was bad. hard to make lucee startup faster without causing problems. and it was really minor improvement.