Updated: > 400% threading boost - Testing Virtual Threads on my custom Lucee on Project Loom / Java 17 beta

brucekirkpatrick · June 17, 2021, 7:09am

I did a new experiment tonight. I upgraded my Lucee build to the Project Loom build of JDK 17 to use the Virtual Thread (previously called fibers) feature they might release someday. It’s not scheduled for JDK 17 though even though it appears to work very well.

I made my Lucee cfml requests threads run as virtual threads and I setup my benchmarks to be able to turn this on and off and to have different levels of concurrency.

I found that virtual threads can finish up to 4 times faster then regular threads. And they can handle concurrent Lucee cfml requests very well, but it’s still better to have less concurrency since the performance degrades as you increase it. These are some numbers I was getting on my quad core cpu. I ran my trivial hello world script for 1 million requests.

10,000 requests per second with 10,000 concurrent lucee requests.
39,000 requests per second with 1,000 concurrent lucee requests.
79,000 requests per second with 100 concurrent lucee requests
83,000 requests per second with 10 concurrent lucee requests

There are more performance benefits to Virtual Threads within individual user requests since the performance boost is there even when there is zero load on the system.

I might make new CFML functions like this that bypass how Lucee does cfthread to take advantage of Project Loom. I believe this is the best way to do it based on how I see the bytecode is setup.

thread=startThread(component cfcObject, string methodName, boolean virtual, struct args);

resultArray=joinThreads([thread]);

// optional
stopThread(thread);

Currently, cfthread depends on the body being the stuff that gets executed but virtual threads rely on Java lambda expressions and InvokeDynamic in the bytecode which is a different structure that I can’t deal with or don’t know how I’d do it. This is why I’m making the functions (not tags) that receive a callback function and the native java thread reference will be passed around. I think I can make it work super efficient this way without have to generate any weird CFML bytecode if I just make it possible to copy the pagecontext and directly call the method on the object and pass the arguments struct to it. I’d probably ignore the cfthread scope stuff and just operate on the struct that was passed in, in order to minimize the Java operations under the hood since the initialization of a regular cfthread is pretty intense in Lucee part. There are hundreds of places where faster virtual threads would be useful and it would be fun to write more things with parallelism.

The cost of cfthread and Java threads is so high in CFML, that it is rarely worth using them since you usually can’t beat the cost of the thread after testing it. But virtual threads will be able to show measurable improvements most of the time.

The new virtual thread threadpool executor is pretty weird since it supplies infinite virtual threads, and doesn’t let you limit how many. You can make your own threadpool logic to control this, which is what I did since the performance degrades badly if you send in thousands of new virtual threads without joining them back to the main thread.

I think JDK 17 (beta) might not be as optimized as JDK 16 (final), because I’m getting a little bit lower numbers. I think it’s more exciting to have 1 request finish faster and make the system more resilient to spikes by having lots of unused capacity. My production system is always idle because I have it so efficient already. JDK 17 is scheduled for September this year and it will be the next LTS so it could be years still before these preview features are officially finalized into a popular LTS release. It is a
shame project Valhalla and Project Loom are taking so long, but they will be amazing someday.
Project Valhalla would bring C++ level performance to Java classes without making it any harder to write them. The features added to Java in the last few years are pretty cool though. I’m using the new syntax and it’s nice how they are reducing how much you type in more ways.

brucekirkpatrick · June 17, 2021, 10:04pm

Today, I went ahead and implemented virtual threads as a new CFML language feature and ran some benchmarks.

Though in the process of my quest to make the “fastest” pagecontext object possible, I ran into BUFFER_SIZE=100000; in the CFMLWriterImpl.java class, which is causing all that memory to be wasted for the thread even if you don’t actually have any output. While this wouldn’t be very noticeable on a lightly loaded server, it actually makes my optimized custom Lucee build 50% slower when you do load testing. I submitted a enhancement issue to suggest making it zero:

https://luceeserver.atlassian.net/browse/LDEV-3578

My benchmarking for CFML requests using virtual threads increased by 50% due to less memory waste in CFMLWriter.
15,000 requests per second at 10,000 concurrency.
66,000 requests per second at 1000 concurrency.
125,000 requests per second at 100 concurrency.
175,000 requests per second at 10 concurrency.

That’s incredible.

But like I said, I took it further and implemented new cfml functions which allow using regular thread or virtual thread with just a boolean argument. Here is my demo code. Also notice, that I run 2 extra loops before outputting benchmark information to make sure my CPU is running at full speed.

<cfcomponent output="yes">
<cfoutput>  
<cffunction name="start" localmode="modern" output="yes" access="remote">
	<cfcontent type="text/html; UTF-8"> 
	<cfscript>
	// warm up CPU
	total=0;
	for(i=1;i<10000000;i++){
		total++;
	}
	threads=[];
	
	threadLimit=10000;
	start=getTickCount();
	useVirtualThread=true;
	for(i=1;i<=threadLimit;i++){
		data={
			id:i,
			start:start,
			doAnything:"I want"
		};
		arrayAppend(threads, { data:data, thread:startThread(this, "runThread", data, useVirtualThread)});
	}
	savecontent variable="out"{
		for(i=1;i<=threadLimit;i++){
			currentThread=threads[i];
			if(joinThread(currentThread.thread)){
				echo("Thread "&currentThread.data.id&" finished with value: "&currentThread.data.doAnything&chr(10));
			}else{
				echo("Thread "&currentThread.data.id&" failed"&chr(10));
			}
		}
	}
	//echo((getTickCount()-start)&"ms to start and join "&threadLimit&" virtual threads.");
	
	start=getTickCount();
	threads=[];
	useVirtualThread=false;
	for(i=1;i<=threadLimit;i++){
		data={
			id:i,
			start:start,
			doAnything:"I want"
		};
		arrayAppend(threads, { data:data, thread:startThread(this, "runThread", data, useVirtualThread)});
	}
	savecontent variable="out"{
		for(i=1;i<=threadLimit;i++){
			currentThread=threads[i];
			if(joinThread(currentThread.thread)){
				echo("Thread "&currentThread.data.id&" finished with value: "&currentThread.data.doAnything&chr(10));
			}else{
				echo("Thread "&currentThread.data.id&" failed"&chr(10));
			}
		}
	}
	echo((getTickCount()-start)&"ms to start and join "&threadLimit&" threads.");
	
	
	start=getTickCount();
	threads=[];
	useVirtualThread=true;
	for(i=1;i<=threadLimit;i++){
		data={
			id:i,
			start:start,
			doAnything:"I want"
		};
		arrayAppend(threads, { data:data, thread:startThread(this, "runThread", data, useVirtualThread)});
	}
	savecontent variable="out"{
		for(i=1;i<=threadLimit;i++){
			currentThread=threads[i];
			if(joinThread(currentThread.thread)){
				echo("Thread "&currentThread.data.id&" finished with value: "&currentThread.data.doAnything&chr(10));
			}else{
				echo("Thread "&currentThread.data.id&" failed"&chr(10));
			}
		}
	}
	echo((getTickCount()-start)&"ms to start and join "&threadLimit&" virtual threads.");
	//echo(out);
	</cfscript>
</cffunction>

<cffunction name="runThread" localmode="modern" access="public">
	<cfargument name="struct" type="struct" required="yes">
	<cfscript>
	total=0;
	/*for(i=1;i<=1000;i++){
		total++;
	}*/
	savecontent variable="out"{
	echo("output"&struct.id);
	}
	struct=arguments.struct;
	// generate a unique result for each thread
	struct.doAnything&=" after some work.  Did : "&total&" loops after "&(getTickCount()-struct.start)&"ms"&out;
	</cfscript>
</cffunction>
</cfoutput>
</cfcomponent>

Which produces this output:
1166ms to start and join 10000 threads.
36ms to start and join 10000 virtual threads.

When I run it more times, I get results showing the virtual threads with all my other optimizations have made Lucee threading anywhere from 400% to 3500% faster. If you do significant work inside the thread, it of course slows down a lot. My goal is to benchmark the implementation, not the work in this demo.

I also implemented the java the same way it works in pure java so the byecode ends up generating an InvokeDynamic Lambda expression and the lambda is calling my components method and passing the struct to it. It’s very simple inside and out. I also build a stripped down version of new PageContextImpl, which avoids as many java function calls as possible, and copies the existing PageContext field as much as possible, which avoids a lot of initialization overhead that is redundant which is in the current implement of cfthread. It’s still dozens of operations, so it’s nowhere near as simple as the bytecode you’d have in Java, but it’s a lot better. The reason it has to stay complex is because without a pagecontext all the variables and body output in your functions will bleed together into all the other functions and it’s horribly broken. So there is no alternative but to keep these features, but I found that I really only need to initialize a new CFMLWriter and a new local scope in my implementation of Lucee. In the official version, there is a lot more required, because of Adobe compatibility stuff, but I deleted all of that…

StartThread.java

public class StartThread extends BIF {

    @Override
    public Object invoke(PageContext pc, Object[] args) throws PageException {
        switch(args.length) {
            case 4:
                return call(pc, Caster.toComponent(args[0]), Caster.toString(args[1]), Caster.toStruct(args[3]), Caster.toBooleanValue(args[4]));
            case 3:
                return call(pc, Caster.toComponent(args[0]), Caster.toString(args[1]), new StructImpl(), false);
            case 2:
                return call(pc, Caster.toComponent(args[0]), Caster.toString(args[1]), new StructImpl(), false);
            default:
                throw new FunctionException(pc, "StartThread", 2, 4, args.length);
        }
    }

    public static Object call(PageContext pc, Component component, String methodName, Struct struct, boolean useVirtualThread) throws PageException {
        var pc2=((PageContextImpl)pc).getVirtualThreadPageContext();
        Collection.Key methodKey=new KeyImpl(methodName);
        var args=new Object[]{struct};
        if(useVirtualThread) {
            return Thread.startVirtualThread(() -> {
                try {
                    component.call(pc2, methodKey, args);
                } catch (PageException e) {
                    throw new RuntimeException(e);
                }
            });
        }else{
            var thread=new Thread(() -> {
                try {
                    component.call(pc2, methodKey, args);
                } catch (PageException e) {
                    throw new RuntimeException(e);
                }
            });
            thread.start();
            return thread;
        }
    }

}

And JoinThread.java

public class JoinThread extends BIF {

    @Override
    public Object invoke(PageContext pc, Object[] args) throws PageException {
        switch(args.length) {
            case 2:
                return call(pc, (Thread) args[0], Caster.toLong(args[1]));
            case 1:
                return call(pc, (Thread) args[0], Caster.toLong(0));
            default:
                throw new FunctionException(pc, "JoinThread", 1, 2, args.length);
        }
    }
    public static boolean call(PageContext pc, Object threadObject, double timeoutInMilliseconds) throws PageException {
        var thread=(Thread) threadObject;
        try {
            if(timeoutInMilliseconds>0) {
                thread.join(Double.valueOf(timeoutInMilliseconds).longValue());
            }else {
                thread.join();
            }
        } catch (InterruptedException e) {
            return false;
        }
        return true;
    }
}

brucekirkpatrick · June 18, 2021, 1:03am

I downloaded current release of lucee / tomcat installer for windows, and turned on never inspect templates and ran the same CFML code for my request benchmarking.

I am able to getting about 22,000 requests per second at all the different concurrency levels. The server is limited to a tomcat thread pool of course, and the manual says 200 is the default. I went into server.xml and changed maxthreads to test other concurrency levels on my system. I also increased the heap to 4096mb since high concurrency needs a ton more memory. And I disabled the tomcat access logging. I adjusted max threads and restarted tomcat service on each test.

10,000 concurrency failed to even start.
5,000 concurrency failed to even start - heap was only 1.3gb so not a memory problem.
1,880 requests per second at 2000 concurrency
6,800 requests per second at 1000 concurrency.
23,600 requests per second at 100 concurrency
30,000 requests per second at 10 concurrency.

It looks like I made lucee about 6 times faster under load and able to support more then 5 times more concurrency without crashing (more of a servlet/tomcat problem).

I also tested deleting Application.cfc since that is one of the bottlenecks and I only got 10% increase in performance.

I’ve made thousands of changes to lucee so it’s really the combination of memory reduction and cpu reduction and reduction of I/O and use of virtual threads that let me speed it up, not just one thing.

My changes impact every line of CFML bytecode as well, and these benchmarks were just testing the servlet / engine startup mostly. If you benchmark a full application, my version would be closer to 10 times faster.

I’m really impressed with the performance of the current version of tomcat 9 / lucee 5.3 though. Certainly not slow. I can see that there were many optimizations made over the years while I’ve been using it.

I’ve been able to achieve what I set out to do. Prove that Lucee could operate on lighter threads without tomcat and be more stable under higher concurrency by trimming off all the optional stuff and lowering memory requirements. I was very interested in Node.js several years ago, but I think it is dumb to write everything in Javascript (or typescript) when you have it already working at the speed of Java very optimized. I think unexpected high concurrency is the biggest threat for a CFML application since it will either crash or queue up and hang depending on your configuration. There are other ways to handle this outside of the Lucee engine, but I’ve been exploring how much can be done from inside it.

I used to think you could get rid of servlet and make it faster, but I’m not sure servlet is a cause for any concern since servlet doesn’t force how you handle threads or anything else that is heavy. I can’t find anything with the servlet api that looks slow, since it looks like it mostly interfaces and you can do whatever you want with it including disabling the parts you don’t want. It seems like a lot of the Lucee code is overriding it with something reasonable that I would keep the same.

I didn’t like any of the configuration system or pagecontext code. I know that stuff has to be like that to work for others, but if I could rewrite Lucee from scratch I’d want the compiler to be easier to manipulate the structure. It would be nice if you could analyze it like an xml tree so you could transform it to different code before generating the bytecode. I think there are opportunities to determine that most of the variable can be treated like local or field variables if this was done.

I think there is a lot of tiny resources being wasted in all the whitespace / output buffering stuff and it would probably be a better optimized and consistent language if all of the tags were deleted, and everything was explicitly echo’d when you want output. It would be better to have fewer temporary strings, and build just one big buffer, but the response code is hidden from the CFML user, so it’s not optimal. A lot of the bytecode would not even be there if the tag language didn’t exist. and the bytecode generated for functions is much more efficient as is. It also makes the compiler more complicated and the logic for processing tags more complicated to have nested tag features. I’ve never wanted to make my own custom tag since first using the language back with coldfusion 5. I only like functions and components/classes. I already removed a bunch of things I don’t use, including custom tag support.

Threads and external process communication are most of the problems, but when I think about CFML now, I mostly miss all the features of Intellij Idea and Webstorm. I can’t get the same level of code navigation and code completion, and that’s where CFML sucks. I will be more of a Java programmer over time I guess, since I don’t think I want to turn CFML into Java. I have other people who help do simple work in CFML, so it doesn’t really make sense to force them to do classes and stuff, but I really hate how messy it is for anything more complex. By making sure I can hack away at Lucee core, I feel more confident that I will be able to make more organized Java and integrate it with our existing work, and still allow the CFML to work like a view language for designer type programmers.