I’ve been exploring making optimizations to Lucee internals that impact new and existing CFML code, and I’ve made some real breakthroughs by exposing more Java concepts to CFML but still keeping it relatively simple. My fork of Lucee doesn’t attempt to maintain ACF compatibility, and I also already updated all of my CFML code to be modern like Lucee suggested over the years since switching from coldfusion to lucee 7 years ago. As a result of working on the major rewrite effort, I’ve been able to switch to now developing a more strict & pure version of CFML using Lucee’s open source project.
I’m moving the language in my own personal direction. It is still very compatible since I have like 1 million lines of CFML code already written that I can’t break. However, I’ve found ways to deviate that make CFML more memory efficient, more stable, and faster then ever. I’ve been using Lucee since 3.2 and before that I started with Coldfusion 5 in 2001 up until Coldfusion 9. There really has only been 2 big features added to the Lucee language that I’ve really cared to adopt. The fundamental shift was when we were educated to get rid of scope cascading in the Railo/Lucee admin, and that wasn’t really possible without having localmode=“modern” attribute on functions first, since it was too much work to switch everything all at once. So essentially the most important feature of lucee was the ability to get rid of scope cascading in lots of granular ways. However, this is just stating that scope cascading is really a bad feature for CFML, so I’ve made it a permanent change to the language rather then an option, and this allows me to squeeze out more performance since the extra code paths and operation have been deleted.
The second biggest change was lazy attribute of cfquery, but Railo/Lucee left it incomplete since it can cause you to run out of memory. I’ve fixed that and made this query result type able to do even more to give us data access support that is closer to JDBC.
I’ve taken the features that Lucee said was important and pushed those ideas to their absolute limit in order to make Lucee several times faster and then I micro-optimized the internals and learned how to generate new bytecode. In the last 2 months, I’ve spent around 200 hours of my personal time working on some of the most radical improvements to CFML performance for any release of the language.
If your CFML code touches absolutely anything outside the Java process, like a database or a file, the performance of that access becomes 99% of the bottleneck of your code instantly. So while I’m making the language much faster relative to the previous way it worked, the difference often becomes trivial or unnoticeable if your application is not CPU bound on Lucee CFML internals. My Jetendo CMS application is written with so much CFC caching and struct lookups in application scope, that some of my requests never have to leave the Lucee process, which means that some of my requests are CPU-bound and I also like minimizing the overhead of my framework since that is the minimum amount of time it takes for the fastest request. I’m finding ways to do this optimization work in Lucee internals that I could never achieve with just CFML code.
Below I’ve described many of the changes I’ve got working on my open source Lucee fork on github. There are other changes I’ve made too to fix bugs and features that aren’t described here, but can be found in Jira and my commit history. My fork is also broken on some things right now, since I intend to remove quite a lot of features from Lucee so that it builds and executes differently and is simpler. I only build the Lucee core .lco patches file right now in an unofficial way in order to develop it faster, and ignore the Jar build because the lucee admin is too legacy-oriented and needs a rewrite to continue to function. I’m going to drop most of the admin features though instead. If anyone is curious on how to build my version of Lucee, let me know.
I’ve made it possible to create and access CFML variables as true Java local variables without the hash map overhead. The Java locals get compiled in the bytecode, so that CFML variables become able to be accessed at the same speed as the Java language, which is up to 9 times faster then regular cfml variable access, but often more like 3-4 times faster due to JVM hotspot magic. This is a very exciting feature, which requires just scoping CFML variable in a new scope for it to work. It also requires being more strict on Java type casting since the bytecode is doing less casting for you in this new code to make it even more efficient. FYI: All cfml variables are stored as Java.lang.Object and have to be cast repetitively on both read/write to be accessed. Casting is less then 5% overhead typically, so it’s not a problem but I’ve reduced the amount of casting anyway to have more raw performance.
I’ve reduced the overhead of CFML function calls by 60% by eliminating the extra OOP calls and making it possible for most of the internall implementation to read/write the Java fields directly, and also removed some features/redundancy in the operations. FYI: Calling CFML function on a CFC is very complex under the hood - over 30 times slower then one Java call, but I got it closer to 10x with a host of changes in many places.
I rewrote how CFML variable access works inside the CFML bytecode to be direct field operations against the pagecontextimpl and scopes instead of the numerous function calls. This has a cumulative effect on performance since none of those calls individually measure up to anything, but CFML in a tight loop, ends up doing substantially less Java operations and doesn’t have to rely on Java hotspot optimizations as much to be faster. You can see hotspot kick in on the 2nd or 3rd request usually, which makes a lot of benchmarks look weird. Even just adding an extra call can seemingly disable hotspot optimization sometimes. For example, function inlining in a loop seems to work when you only call it once, but if you repeat the call, it can’t inline it anymore I think. I’ve run into little details like that which make it seems like Java code can’t optimize code around hash maps as well as it can local variables/fields/methods. As a result, I think the performance gain is occasionally 100x faster then CFML when you use this java locals feature and make the code run enough to hit the “hot” threshold that hotspot looks for. OpenJDK wiki has more information:
Because of seeing this performance gap appear sometimes between hotspot with more plain Java code and Lucee cfml code, I don’t think it would ever be accurate to assume Lucee is a thin layer on top of Java. Lucee is really 30 times slower for many benchmarks compared to Java and kind of disables hotspot optimizations from working as well in the CFML bytecode. ACF 2018 was measuring around 3 times slower then Lucee 126.96.36.199 on various language tests I did recently so they aren’t doing it any better.
In some areas much more speed is now possible on my version especially where the work is more primitive like complex math calculations because variable and function overhead shows itself more there then it does on strings or more complex objects. Lucee CFML can’t touch Java primitives without boxing them first. Also all CFML numbers are Double type to keep numbers simple for CFML, but Integer would be up to twice as fast because of using half the memory. Java primitives are up to 10 times faster then objects in Java itself, so we are definitely carrying a lot of Object overhead by using a dynamic language where everything is an “Object” under the hood.
I merged the arguments scope with local scope so that lucee doesn’t have the overhead of managing 2 scopes for every function call or variable access anymore. I also got rid of the ScopeFactory for new localImpl and let garbage collection do the job to simplify it more. For the most part, I’ve permanently eliminated scope cascading at the Java and bytecode level. Variable access was made up to 100% faster.
I created a new way to call functions in external CFCs that is 25% to 100% faster. Because of how Lucee works on the inside, the code ends up doing an extra hashmap lookup to get access to a function in another scope/object, which shows that call overhead to external cfc methods is 2 to 3 times more then calls to methods in the same CFC. My new feature works by letting the CFML developer cache the methods as java locals which are a new object that is a ComponentImpl and UDFImpl so that the actual call doesn’t need the hashMap getCollection to find the UDF repetitively anymore. I also bypass several of intermediate java calls internally since there is a lot of calls that just manipulate arguments in the existing code. I could get the majority of my shared objects to operate at this increased speed with minimal code rewriting or I could generate new bytecode for existing variable names to make them work this way automatically. This change will make a real impact to overall performance in a large CFML application since it is very common to many calls between cached CFCs.
I always liked making code generation systems, and Java bytecode is especially fun and challenging to work on when I can change how CFML works. I’ve done a lot of bytecode learning by trial and error for the last 2 months which I didn’t know at all before that. This work is helping me to understand how to make patterns in CFML code able to be connected to new concepts in the Java bytecode, which is actually more interesting then any of the other work since I’ve already made excellent systems in CFML. I could now translate my CFML framework’s nested shared scope accesses directly into new more direct bytecode without needing to rewrite the CFML so that it runs at the speed of static java class speed. This will give my application a huge boost compared to anything else written in CFML.
I also made a way to statically compile into Lucee core my CFC components and function names which can get rid of dynamic call overhead and other security checks that Lucee does. It is also possible to call a CFML function in a “unsafe” way that is twice as fast if you don’t need some of the normal debugging/variable scoping features to make it even faster, which I made a feature to let me do that too.
I changed string.concat() to Java stringbuilder.append() internally in the bytecode generation, so that the cfml & operator is at least 2 to 3 times faster on average because it eliminates making many redundant copies of the string. It gets progressively faster with this approach if your string is larger and you concatenate many times. I did this because this is how the Java compiler works to optimize string concatenation in the Java language to be the fastest possible. Lucee CFML didn’t do this yet it has its own compiler for CFML code.
I modified lazy queries which return Lucee’s SimpleQuery.java object to let me modify the resultSet cursor directly so that I can call next(), reset() and close() on the result myself. I also made it automatically close after a normal for/cfloop so that you don’t have to remember if you don’t want to think about it. This prevents lazy queries from running out of memory, which allowed me to turn lazy on for all queries by default, which massively reduces the amount of simultaneous memory usage across an application. For other code, I want to optimize further, I can now manage looping through the resultset myself while only consuming the memory of the row and columns I am currently using, instead of the entire query result being looped multiple times and fully processed like legacy CFML requires. I also made it possible to allow direct access to the query records by column or columnIndex as new function calls to further eliminate the internal hash map overhead of CFML variable access. In addition to adding these features, when I combine this with my new Java locals feature, looping over query data is nearly 100% faster. It’s like JDBC now with a lot less typing. This is a very exciting set of changes.
I do have quite a few other major changes planned since I’m trying to migrate Lucee away from servlet technology and to integrate my application and my java web server with it. I’m going to make Lucee start much faster as a CLI application like it used to, so that I don’t have to use tomcat or any particular server technology eventually. This change will be very invasive, since JSP, Javax, and Servlet are in several thousand places in lucee core. However, I’ve already proven various things don’t actually require these dependencies and Lucee CFML code doesn’t require separate threads to work right. So the hard part is just the time it takes to refactor everything and keep it working. The benefits later will be that we could serve a CFML request on the same thread as the socket or original CLI thread which saves around 0.3 milliseconds on most CPUs. My benchmarks of Java code I wrote suggests that just reduces the amount of threads being created can make Java able to complete HTTP requests 4 times faster for CPU-bound code, in addition to the other improvements I’ve made to speed up Lucee. They announced Lucee 6 would start faster, but they must only be talking about Lucee itself, since 90% of the start-up time is actually Tomcat. Perhaps the scripting engine startup is what they are talking about, but Lucee is very integrated with servlet, so it carries whatever overhead there is there. I actually experimented with tweaking OSGi startup speed, but it wasn’t the bottleneck, so I’m not sure there is much “slow” about Lucee start-up in their code really.
More frustratingly: Tomcat often fails to be able to stop itself in production since there are running thread. So even if Lucee took 0 seconds to start, we’d still be waiting 5 to 30 seconds for tomcat to do its thing. I know my Java web server could run Lucee just fine at some point, and it is just a few thousand lines of code with only JDK as the dependency. I don’t expect Lucee to restart in under a second until I have replaced the servlet features with my custom server. More importantly though, it will handle more simultaneous connections and use less memory then ever.
It makes me really excited to think of where I’ll be Java and Lucee in the future since I’ve only been working on Lucee core since July this year.