Could there be a significant performance impact of using InvokeDynamic in the Java bytecode in a future version of Lucee?

brucekirkpatrick · November 30, 2018, 4:12am

Currently UDFs, and such are stored in hashmaps because of the dynamic nature of the language.

But when I looked at the bytecode, you actually create a fixed length array when compiling a component, and then use switch and integers to find the right method when the user wants to call the method, which looks pretty efficient like this:

public final Object udfCall(PageContext var1, UDF var2, int var3) throws Throwable {
	switch(var3) {
	case 0:
		// do first method body
	case 1:
		// do second method body
	} 
	return null;
}

But then I looked at how it comes up with the right number to use there, and I see it has to then call the PageContextImpl getFunction, which attempts to find a matching function, which has to do a fair amount of searching and casting work before it jumps into a much heavier operation where it accesses the hash maps and various objects to do all the things needed before getting back to running the method.

I believe InvokeDynamic was invented with the sole goal of allowing a language built on the JVM like Lucee be able to leverage a caching system they created for these method lookup operations that are commonly needed in dynamic languages. It may only work under certain conditions once implemented, but knowing what conditions those are, maybe could give us a fast lane in our CFML code.

I assume this would work for all CFML data types even cfml primitives, because everything is a method call underneath.

What I don’t know is how you’d go about invalidating that cache at the right times, and make sure everything that makes CFML what is it still works.

It would be interesting to discover if there are certain conditions where this cache can be used effectively and then migrate code towards that higher performance path which can use the faster bytecode. Like callWithNamedValues is probably slower then callWithoutNamedValues.

It also looks like the case of direct Java access, uses reflection to find the method to call. So that carries a higher overhead, but perhaps both CFML and Java calls would be closer to native java speed if there was invokeDynamic caching in place.

To further support the case, notice the invokedynamic performancetuning section here:

“This will make more heavy use of invokedynamic and you may see a substantial increase in eventual performance in your application. We recommend testing your application thoroughly with and without compile.invokedynamic and seeing how your application performs.”

Perhaps we could read the JRuby source and decompile its classes to understand more about how it works. I don’t use Ruby, but since they are able to run it both ways, it gives you a way to compare. It sounds like everyone agrees it makes it run faster over time to enable invokeDynamic. One person measured it 26% faster in a real-world queue application, another ran mandelbrot 95% faster, which suggests tight math loops benefit a huge amount with invokeDynamic. I think it may make a bigger difference in the by default, too-dynamic CFML language. Perhaps less if you already optimized away from scope cascading, but this should be on par with the benefits gained from using localscope modern or better I’d think.

Someone on Clojure forum also said a lot (all?) of classes in clojure only have a single method, which reduces how much searching the bytecode needs to do, that’s interesting. CFML unfortunately has to search like crazy.

Currently, there is no invokedynamic in the lucee 5.3 codebase.

Perhaps it will help to create another way to modernize and improve CFML performance to have this feature.

I might try to figure out a way to do this in Lucee’s Page.java to change how UDF_CALL works, but I probably need to do a simpler hello world first since I keep trying things that don’t work so far since I haven’t learned it yet.

FYI: I’m learning JVM and bytecode stuff for fun.

brucekirkpatrick · December 3, 2018, 2:06am

I was able to verify in benchmarks I made today that Lucee function calls would be much faster with invokeDynamic.

It’s hard to say exactly how fast it would be until there is working code in Lucee running CFML. Lucee bytecode is able to run about twice as fast as the Java code I came up with to simulate Lucee performance, so I’m going to guess based on that and the data, invokeDynamic is 3 to 50 times faster depending on the type.

I was using Java CallSite and MethodHandle invoke code to test it.

I also compared Lucee 5.3 to CF2018 for the same thing. Surprised how similar cf2018 was in structure, but then completely different in performance. CF2018 is 5 to 10 slower then Lucee 5.3 on Windows. They use the same hash map lookup approach for function lookup, and there is no invokeDynamic code. InvokeDynamic would further widen the gap.

I want to work on this more soon. Even if it breaks something, it would be nice as an option or to use for myself.

brucekirkpatrick · December 10, 2018, 4:42am

I spent the weekend on making a few small changes to Lucee CFML compilation so I can run some of my Java code way faster and more directly without reflection or Lucee extensions. I was able to circumvent the need to wrap everything in a fld/tld and bypass the overhead of the Lucee bytecode to have my own direct path to Java. I can also guarantee this code is safe since I wrote it, so I don’t have to let Lucee keep checking if direct access is allowed like it does on other Java integration. I needed a better bridge to Java essentially so I can make more of my Java not have to depend on Lucee. I want to make my Java depend on a loosely coupled interface based module instead, so I don’t have to look at the thousands of Lucee files all the time in Intellij. If I only used Lucee extension, I wouldn’t have been able to make it as high performance and simple to extend because Lucee doesn’t generate direct access bytecode for Java in Lucee 5.3.

I figured out how to upgrade ASM jars from 4.2 to 7.0, and then changed the Classwriter to use COMPUTE_FRAMES and updated all the V1_6 references to V1_8 so that Lucee generates valid Java 8 class file format for compiled CFML code. I needed that to be able to use invokeDynamic or other new things later. For the specific feature I wanted, I don’t even need invokeDynamic of any kind of reflection at runtime though.

I was able to add a new global scope to Lucee called the Jetendo scope, which is going to be my bridge between Java and CFML code.

Instead of allowing the scope to function as a hash map like all of the others, I have modified the bytecode to be able to directly translate the CFML code into direct Java field and method calls that have the least overhead possible. Like no reflection and minimal casting at runtime.

I know it is working because I can see the decompiled asmified bytecode is greatly simplified compared to what Lucee usually generates and I still get the right output in CFML.

Then I benchmarked the new direct bytecode compared to the normal CFML code for reading fields and calling functions.

Fields can be read twice as fast with the new bytecode.

Java methods can be called 9 times faster then normal CFML UDFs from CFML now.

I haven’t tried to work with function arguments or more complex objects or write operations yet. So it’s just one level deep on read tests so far. like jetendo.callFunction() or jetendo.myField.

I don’t know what the reason is for Lucee to always do reflection on Java objects at runtime instead of compile-time, but there is a huge performance benefit to doing that reflection at compile time instead. I think I could patch/replace the way that reflection is done for all scopes at some point with similar code to what I wrote for the Jetendo scope. Perhaps if the Java class changes at runtime and there is a new interface, it becomes necessary for all the CFML code to be recompiled to be able to be correct and that’s why it is done at runtime. I’d be willing to create a way of dealing with that via a special call to invalidate the cfclasses folder, instead of making all CFML to Java access 2 to 9 times slower all the time though.

Also, the Java compiler typically converts String concat calls into StringBuilder calls when you write Java. I noticed Lucee bytecode doesn’t do this yet, so every time we use the & operator in CFML, we are making another copy of the string, which gets progressive worse the longer you do that. If you break these into separate echo() calls, you no longer have the memory waste. I wanted to determine how much faster a stringbuilder approach would be compared to concat to see if there is anything to gain from changing this. I did 4 concatenations on each loop in each way. The normal Java style was up to 5 times faster then the Lucee & operator bytecode. On my Jetendo scope, I’m going to create a way to use stringBuilder objects for output and creating names buffers to make it much more memory efficient. We can of course use arrays in CFML to make it faster too. I might try to figure out how to apply this back to all Lucee concatenation bytecode someday to speed up everything. Then we wouldn’t even need to mess with arrays or other tricks because Lucee compiler would automatically use stringbuilder array.

Another thing I want to improve in the Lucee bytecode is to figure out a way to store local variables for all those pagecontext function calls in the bytecode. In a loop, Lucee has to do 3+ extra Java function calls on average for each 1 thing happening in the CFML. If it was able to reuse a local java variable, the bytecode would much more efficient. It would be especially useful to apply this optimization before loops since that is where it would benefit the most. Now of course, hotspot might figure out how to optimize some of these, but I think the bytecode should read more like a simple Java class, instead of being so heavy on function calls. I understand it needs to refer to PageContext at least once for many things, but not thousands of times in a loop. We could track in a lookup table of these new local variables so we have the right int offset for ALOAD, etc. I can see how it makes the bytecode easier to write when you aren’t tracking locals, because you can just chain everything together, but tracking locals might be able to be abstracted more similar to how you track the string key names already at the bottom. like loadScope(scope), loadPageContext(), loadFunction(name, args), loadPageContextImpl(), etc instead of the basic ASM calls. These functions could check if they were already executed in the current udfCall() or not. The first time the function is run, it could append its localVar=function call to the top of udfCall and then only load the local back to the top of the stack at the actual line number we’re on to guarantee they don’t happen multiple times in a loop. Each additional call would just find that the local variable int offset, and load that instead of running the function again.

If a cfml user is willing to sacrifice being able to add/delete methods in the CFC dynamically (perhaps via a compiler option), you could also cache all the getFunction calls in a local variable to speed them up quite a bit especially if the arguments structure is the same on each call. getFunction is a complicated thing to replace, so it was easier to just bypass it for now.

It was super hard for me to figure out the ASM that wouldn’t crash with internalerror / classformaterror / arrayindexoutofbounds or verifyError. At one point, I wasn’t boxing boolean to Boolean and it expected to be able to cast to Object, but the error was nothing like that making it super confusing. If I was working on JVM or ASM, the first thing I would do is change the names of these errors and provide some human readable output, instead of the non-sense it gives. Fortunately, once the bytecode works, you unlock a serious performance boost by doing the least operations possible.

It is really cool to be able to generate my own version of what CFML code does.

I got so stuck on some of this that I haven’t had the time to setup a good use case of invokeDynamic yet in Lucee core. InvokeDynamic seems to really only be useful for when you have to evaluate types via reflection or other lookup methods at runtime because you don’t know what something is until runtime due to the fact you can pass everything around in CFML as “any” type. I’ll have to do that for CFC objects since we don’t know what they are most of the time. invokeDynamic should be up to 9 times faster for CFML method calls. InvokeDynamic has slightly more overhead, and it would always be normal slow speed on the first load, which might make invokeDynamic not so great if you want to use a lot of transient objects in CFML compared to a compiler option that would force static/virtual/interface calls until the class cache is flushed.

I also want to merge the CFML arguments and local scope and eliminate the concept of arguments as a separate object throughout Lucee core. This would also make function calls have less overhead, and make the CFML code simpler and more natural to write. I’m always having to type myvar=arguments.myvar to get rid of having to type it a bunch of times because i treat implicit scope accesses as errors. I was going to make arguments an alias for localScope so the compiler could prevent breaking existing code, but it would have zero runtime overhead since it would write the bytecode as var1.localScope().get(key) etc.

If we could inform the Lucee compiler the exact CFC to reference somehow (like com:path.to.com), we could further optimize runtime performance to have direct java function calls on those objects. This would be like making a TypeScript version of CFML. I learned how to build the Intellij CFML Support plugin the other day so I could explore creating new syntax and code completion concepts in both the plugin and Lucee core so that I don’t frustrate anyone with code the editor can’t understand. I thought I could even create a fake type system that Intellij would understand, but Lucee compiler would just ignore. This would give us code completion with CFC accuracy even for cached CFCs, and that would make Intellij an awesome tool for CFML developers since my #1 complaint about continuing to write CFML/Lucee is the lack of tools that match Java tool features. Currently, only Sublime Text is suitable for Lucee because it can do decent with “all autocomplete” plugin on a big project, but Intellij would increase the accuracy dramatically and save real time and make CFML feel productive like Java and Typescript in the Jetbrains IDEs. If I took that approach further and generated bytecode that was cfc type aware, cfc method calls would be 9 times faster without needing to use invokeDynamic to do it. This would be the most amazing upgrade for CFML to have cfc type aware code completion in the IDE and Java-speed object calls. I figured we need a way to import cfcs somewhere in the current file in order for the IDE to understand abbreviated names. And the IDE also has to understand the absolute paths of all mappings. It would also have to support editing mappings per project to let more dynamic ones work too. The current plugin doesn’t have any of these features. It only understand a component if you create it in the same file. You can make it aware of your cfcs by naming a bunch of them in a function you never call as a hack. It would be better if we could get the alt+enter code completion for importing cfc types, and it could search to find them like Java can.

Additionally, it would be cool if CFML structs could be typed somehow, to avoid having to make a CFC for everything since CFCs are heavy and forced to be in separate files. I could get this working at the IDE level at least, just to make it easier to call functions that have more complex arguments. Like a type definition file behind your application to extend what the IDE looks at when analyzing the code you write. Again, like typescript for CFML.

bdw429s · December 10, 2018, 4:45pm

Very interesting research indeed. I’ve wished I could get better performance in the past. When trying to do real-time hardware automation on my Raspberry Pi for instance, I ran into issues with the overhead of calling methods on Java libraries thousands of times a second. The reflection is a killer under that kind of load.

brucekirkpatrick · December 16, 2018, 3:43am

I’ve put another 15 hours into this direct to java bytecode since last week. I got super stuck with endless bytecode errors under certain conditions, and then I finally learned today that there are 4 variations of bytecode needed for all the direct Java write operations, because Lucee has both “Variable” cfml expression types and Java native types, and field/methods can be public or static calls. I only had to deal with static and public on read operations.

Now I have bidirectional read/write of public and static fields, public and static functions, and also supporting functions with 0 or more arguments, so I think that is everything from CFML point of view. Currently, it requires manually casting to the Java types before using them and using boxed Java types on the Java side, i.e. Double. I might leave it like that since it is more optimized to give it the exact types to avoid redundant casting. This new bytecode allows direct Java access from CFML code without any runtime reflection or function lookup needed because it compiles the operations to direct Java calls in the new bytecode.

I did additional benchmarks and made sure the decompiled bytecode is simple/optimal still. The results for write operations to the Java object were 7 to 8 times faster then the equivalent CFML code. I made sure no CFML hash maps were touched in those benchmarks by using loop times=“1000000” and direct Java calls inside the loop. I was able to handle all reflection and types operations at compile time instead of runtime, so there really isn’t anything “extra” happening at runtime for this code now. Also I’m still using the newer ASM 7.0 with COMPUTE_FRAMES successfully instead of ASM 4.2. I thought I had to revert, but no, I just had to fix my bytecode stack errors to get the frame errors to stop. There is close to zero answers on the web for bytecode errors, you just have to struggle to find the 1 or 2 things missing in your code.

Read operations are still 7 to 9 times faster then the equivalent CFML code.

Because Java field operations are much faster then function calls, you have greater performance if you are operating mostly on Java fields. The code accessing Java fields is up to 50 times faster then calling CFML UDF functions or Java functions with reflection. I may design some things to be able to take advantage of this in CFML more.

It seems like a simple Java program with the same functionality would be 10 to 50 times faster then a Lucee CFML program on simple language operations, though it quickly starts to become the same if you rely on hash maps and external I/O in both languages. Most of the performance difference is related to additional memory access time and processing of hash map and hash code algorithms.

I might redesign this bytecode to support compiling other Java object calls later to extend what I can do from CFML. Perhaps, I’d make a way to define a list of class names it will allow to be compiled this way externally for security so I don’t have to compile them all directly into Lucee core and then I can load/unload via OSGi instead without having to restart Lucee.

I may also still look at implementing InvokeDynamic since I understand a lot more of the bytecode now and that would speed up existing CFML code at least 3 to 9 times.

bdw429s · December 16, 2018, 3:57am

I’m curious about your comments on hashmaps. Are you saying they are slow in Lucee? Is there a way they can be sped up?

brucekirkpatrick · December 17, 2018, 2:39am

Brad,

I have made hash maps (i.e. all cfml variables) in lucee faster in several ways today and a while ago. I wrote about it in the ticket and did working changes to Lucee core on my fork.

You can measure that all of the lucee scopes are slower on the root node then the new structs you make because scopes are using LinkedHashMap, instead of ConcurrentHashMap, which must be some kind of ACF compatibility thing to be able to retain key insertion order. I switched it to ConcurrentHashMap almost everywhere to go faster and then measured it was actually faster afterwards.

Request scope is using a servlet object for the map and custom synchronization instead of JDK which made it up to 3 times slower then other scopes. I reduced it to be the simplest/fastest scope though today, since it really doesn’t need any of that extra behavior to retain its CFML functionality.

The casting and scope lookup overhead inside lucee is about 20% of the total time on variable operations, which is very minor and hard to get rid of. The new typed arrays and structs feature would be able to eliminate that 20% overhead if the bytecode was able to store and retrieve them directly as the right type, but the language is dynamic, which forces Lucee to treat almost everything as “Object” for get/put variable operations. I checked the bytecode for typed arrays and it is still treated them as Object and casting them the same way.

If we knew the CFML variable types at compile time, it could do the most direct casting (possibly no casting) instead of the logic and function calls it does now because all hashMap are Map<Key, Object> and not Map<Key, String>, Map<Key, Double>, Map<Key, Component> etc. This would be about 20% faster on average. 80% of the overhead is in the hash map search algorithm which really can’t be made faster in a general way but the JDK and JVM keeps improving, so just upgrading Lucee to 11 makes Lucee CFML 15% to 30% faster. You have to find ways to avoid using maps to go any faster.

I’m going to post a separate topic about making typescript style version of CFML. This would enable the compiler to optimize and validate some additional things to aid performance and improve tooling for the code.