I successfully embedded my custom Lucee build in Nginx to go even faster with Java JNI and native optimizations

While HTTP on Java is pretty fast, the SSLEngine appears to be much slower then the C / C++ engines like Nginx.

I just spent the last month learning how to build a C++ application and then I integrated it directly with Nginx as a module to allow it to have the fastest possible TLS performance.

I also spent time learning Java JNI (Java Native Interface) and building an integration of the Java version of the program to test its performance against the C++ version. I had excellent results on Java with HTTP, I could even beat some Nginx benchmarks, but once I enable HTTPS on Java, even with the native openssl provider for sslengine and Netty, the results are always 50 times slower. Even tomcat has the option for native openssl connectors to try to go faster because there isn’t a Java way to be as fast.

So in summary, Java HTTPS TLS is very slow compared to Nginx TLS.

I’ve simplified my custom build of Lucee which I’ve built on JDK 17 with project loom virtual threads with a simple CLI style start-up and cfml request running functions so I could have some entry points. More about my customizations is available here:
https://dev.lucee.org/t/updated-400-threading-boost-testing-virtual-threads-on-my-custom-lucee-on-project-loom-java-17-beta/8445/4

Java JNI is able to call C++ from Java or you can create a virtual machine in C++ and call Java from C++. I just learned how to create the Java VM inside the Nginx module, so that I could load the custom lucee build and call it. It worked after struggling for a day on the details.

Both my C++ program and my custom Lucee build are able to fully saturate Nginx’s maximum TLS performance in my benchmarks. I.E. TLS is always slower then the time it takes for the Lucee or C++ application to finish.

So in my virtual machine tests, I get about 20,000 to 25,000 TLS requests per second on localhost with my custom Lucee CFML function call.

I think this is very exciting to have embedded Lucee inside Nginx through the use of Java JNI api and Nginx module api. Almost 2 years ago, I thought it would be possible to eliminate the need for tomcat and go faster, and I have finally done it. This means I no longer have a socket or reverse proxy needed at all for the Java process. I am able to pass data to Nginx at the speed of JNI which means there is no inter-process communication like sockets to slow it down and I can avoid making copies of memory. This will also allow my Lucee application to have direct access to the C++ memory cache in my nginx module, which makes it massively faster compared to sockets and queries.

I wasn’t sure how much faster C++ would be when I started the project. My C++ application is a recreation of my real estate listing database-based search application as a pure C++ in-memory database. I was able to beat the performance of my existing Lucee / Tomcat / Mysql application by over 1000 times through TLS, and over 3000 times on HTTP.

I also took advantage of having native access to integrate libdeflate instead of zlib that Nginx requires for its own gzip feature, and I compress the http response myself instead of letting nginx do it. libdeflate has SIMD optimizations for vectorization, which makes it able to go about twice as fast. In my benchmarks, the gzip response is the same speed as without it, which is impressive since the output is still 4 times smaller. I also was careful to make no extra copies of the memory when running it. By having direct control, I was sure it is the most efficient it can be.

Also in my C++ search application, I learned how to use SIMD to do 8 of the search operation simultaneously which made the overall performance 3 times faster. While Java Hotspot tries to apply vectorization optimizations on its own, it is no match for what you can do yourself directly in C++. My C++ vs Java language comparison in the fully optimized application without a network connection were massively different. Java could do about 60,000 searches per second, but C++ could do 2 million searches per second, again with no network. My C++ application is about 80 times faster then the time it takes for TLS to do a zero byte HTTP response.

Also, service nginx reload is very fast still even with Lucee embedded because I used the ngx_link_function module to integrate my application, which is actually setup to let you load your C++ application as a dynamic library, so it lazy loads everything after nginx is fully running. The first lucee request takes less then half a second to load and fully respond with the first hello world cfml request. This is massively faster then current Tomcat/Lucee startup time. My Custom Lucee startup is multi-threaded in the hotspots, and I removed OSGI which made it much faster as well. I have custom serialized caches for the tag/function library to avoid the XML overhead, and the application.cfc cache avoids the overhead of application.cfc, which doubled performance as I wrote about before. I have deleted 30% of the Lucee Java code, deleted all of its CFML code and deleted most of the lucee extensions so there is a lot less to load and the security risk in my version is much less because there is virtually no surface area to attack.

The reason my C++ application is so much faster is not just the language change. It is that I worked very hard to avoid all memory allocations after start-up and I built a custom index engine from scratch. There were times when my application was doing dynamic allocations and once I got rid of them it went like 10 times faster, so the cost of constantly allocating memory on the heap is very high compared to a design that doesn’t need to do this. I did it the hard way with all C++ primitive pointer stuff, which is a nightmare to debug with segmentation faults and such. I was working 12+ hours a day many times just getting a few lines of code done, it is brutal. I also did this project for fun, no one is paying me to do all this. I store presorted indexes in C++, and I wrote optimization algorithms to pick the least records and search through the least data. Because I have full language features in C++, I am able to do operations against complex database structure that I need in the application instead of running separate queries. As a result, I found the C++ application can actually do all of the queries in fewer loops and I added numerous features that allow it to skip doing millions of operations in certain areas.

I even found in the process of integrating Mysql in C++ to load my data that the C connector has a feature that allows my faster loading of the data because you can predefine all the memory locations with a MYSQL_BIND structure. Compared to JDBC this is a huge difference in performance. It is impossible for JDBC to match C connector performance on MySQL because of how it works on the inside, but you also have to do pretty much every step with code. It takes like 200 times more code compared to writing a query in CFML, so obviously this is why we like CFML. Write less, and forget about performance.

But my personal efforts to take the slowest part of our commercial websites and make it 1000 times faster has been realized, and I used the same techniques to squeeze out the maximum performance possible with my custom Lucee and Java JNI nginx module.

Everyone should have a deep respect for the fact they don’t have to write C++ code, and trust me that managing memory yourself makes the work take about 50 times longer then the equivalent features in a language like CFML since I have spent over 300 hours on this C++ program and it’s only 4000 or 5000 lines of code. I think I rewrote it and bug hunted about 80% of that time. I love the suffering of doing it the hard way and learning more about the machine but no one else should do this.

Java is incredible compared to C++ language, but the constant churn on memory / heap is mostly unavoidable unfortunately. By integrating these technologies and having native direct speed, I don’t have to compromise on performance passing data between them, which makes it where I can optimize in specific areas and get the full benefits, without abandoning CFML entirely.

1 Like

Following from my DM on cfml slack id love to hear more about this Bruce, I think the nginx module is really interesting

Alex,

Cool, I replied to your DM on cfml slack. I am available to discuss.