Almost 130,000 requests per second for CFML engine on one quad core cpu - Discovered some Java internal Application.cfc context / pagecontext caching and page source i/o optimization possibilities

Regarding further progress on my Lucee fork discussed here:

I had theorized that maybe Application.cfc could be optimized internally and now I have proven it. Regardless of whether the CFML language standard gets broken or not with this behavior change, I was willing to tinker to see how to remove the overhead of Application.cfc and the Lucee internals surrounding this since I could tell something was wrong here in my benchmarking, since Lucee is at least twice as fast when you delete Application.cfc. This overhead is very significant because the configuration is reloaded on every request instead of once during start-up and there is also disk I/O.

Internally in Lucee Java, I found 2 ways to optimize this, and achieve a massive performance increase while still allow concurrent access to database and overall thread safety.

#1 I found that the PageSourceImpl.java constantly performs disk I/O to verify the existence of Application.cfc

To fix this, I created static concurrenthashmap cache of “exists” results in physcalExists (typo is intentional) in PageSourceImpl.java. I also added cache clear to call function of PagePoolClear tag to make sure existence is only tested again once when desired by application developer.
When a developer sets their application to “never” inspect templates, they would expect that to also mean Application.cfc doesn’t get inspected or cause disk I/O overhead, but unfortunately Lucee has this disk I/O for just Application.cfc still. I believe at one point there was caching there, but then it was removed because there was a hashmap remnent in the code. Perhaps there are reasons for compatibility not to do this, but it’s something very good for those willing to optimize more for best performance!

#2 I found that the configuration of Application.cfc is reloaded on every request. While this behavior might ensure maximum compatibility, it would make sense to create a mode and documentation describing a way to load Application.cfc values once and only once during the lifetime of the application scope. This is a massive issue for performance because Application.cfc processing is not trivial.

To fix this, I was able to update PageContextImpl.java initApplicationContext and ModernAppListener.java to utilize a ConcurrentHashMap for both the component instance and the applicationContext instance. I recognize other changes would be required for the other application context modes like classic and mixed, but in my Lucee fork, I have removed those to avoid any additional code execution.

These 2 optimizations allow a massive performance increase to Lucee for simple requests in simple load testing using Java future threadpool in Intellij editor, not on a production server. A production system could be even better.

My Lucee fork is already 2 or 3 times faster before these changes, but now it is extremely fast, close to raw Java server socket limits. Original Lucee would struggle to reached 4000 or 5000 requests per second on my system, but now I can hit 25,000 to 48000 requests per second on my Intel 4790K quad cpu in windows running the same “hello world” kind of CFML code as before. Amazing! These CFML requests are being done internally in a custom CLI script and not over the network since I’m just trying to benchmark and optimize the CFML engine request flow. I’m sure the TCP network overhead would slow it down some, but I just removed a bottleneck on Lucee that lets it handle the normal CFML request flow up to 10 times faster. I have also made so many other optimizations and removal of features that also add to the performance gains, but these 2 items listed are very significant and easy to address in the original Lucee code that everyone uses I believe.

I also verified these optimizations are not in the current master branch of Lucee on github so the opportunity still exists.

7 Likes

is there any reason you never file pull requests?

there was some work done in 5.3.8 which might not have been merged into 6.0 just yet

https://luceeserver.atlassian.net/browse/LDEV-3290
https://luceeserver.atlassian.net/browse/LDEV-3293
https://luceeserver.atlassian.net/browse/LDEV-3288
https://luceeserver.atlassian.net/browse/LDEV-3287

I have posted some things on Jira now and before, but to keep up with that over the long haul, it is not really my goal since I’m just treating Lucee like my own personal project now. I’m just trying to bring awareness to what is possible or what is wrong if I see a way to help the community, but Lucee 5.3 was really solid, so I mostly just work on performance stuff and integrating my application with it. Michael Offner, usually has more comprehensive knowledge of all compatibility and source concerns, and I can’t pretend to be able to do what he is doing for the community. So he will usually implement it another way in the end, if it is worthy of being looked at.

Thanks!

1 Like

I clarified my JIRA post based on the links you sent @Zackster

https://luceeserver.atlassian.net/browse/LDEV-3564

I want to note that the main difference in my approach that hasn’t been attempted yet is to make sure that Application.cfc is not created or executed. Only the event functions should fire.

The main reason for application.cfc bottleneck is not only the loading of the page and component, but the creation and reading of the entire application configuration repetitively here:

I don’t think anyone has questioned that yet.

in my application, i rely on cffunction localmode=“modern” behavior as the only behavior the language supports, so i can guarantee that all variables in my components are local by default which is important related to this performance optimization suggestion for application.cfc since others might feel that they are forced to create new application.cfc every request because of no guarantee all functions are localmode=“modern”. The developer would need to have an option to say that they have done this correctly, otherwise they might have behavior that is not threadsafe in their variables scope which could break under load. application.cfc currently behaves like a new instance every request instead of a cfc coming from shared memory in terms of thread safety. My application has the majority of CFCs written to be thread safe or duplicated correctly to avoid the code outside of functions from being a factor.

In my fork, implicit scope warnings are exceptions because the other java code paths aren’t there.

Maybe the option would be called “Cache Application.cfc Instances” with a note: You must use localmode=“modern” on your Application.cfc functions and make sure your application.cfc code is thread safe. To clear the cache, run PagePoolClear(). The original behavior is that each request creates a new Application.cfc instance when this box is unchecked. If you make admin settings changes, the cache will also be cleared automatically so that it can reload the new settings.

Whoops, I forgot my loop was running 2 CFML requests per loop instead of 1 in order to verify 2 different cfml are actually running without breaking since handling pagecontext wrong can cause errors. I’ve been removing and tweaking more things today so I might have made it a little faster then it was as well. I originally posted 50,000 requests per second, but I’m updating the title to 130,000 requests per second.

If I tweak the benchmark a little, I can get as high 130,000 internal cfml requests per second, about twice what I was saying. I ran it for 1 million requests and 10 requests per thread, and it finished in about 7.7 seconds multiple times. Without application.cfc cache, it takes 21 seconds, which is 50,000 requests per second. This is on a highly modified stripped down lucee.

The interesting thing about the application.cfc cache mode, is that my CPU actually hit 100% briefly on all cores, and it couldn’t get over 60% without it, so this really is a bottleneck.

Also, if I just loop without creating any threads, I get 86,000 requests per second with application.cfc cache on, and 22,000 requests per second with cache off. There appeared to be a lot of garbage collection overhead, since my cpu usage slowed down a lot more without the cache.

Doing more requests per thread hides the cost of a thread, which is about 6% of the total time.

I also checked if other JVM garbage collectors or heap size would change performance. The defaults for Java 16 were the fastest though.

5 Likes

This all sounds very awesome!

2 Likes

In my “utility app” instead of trying to load the application.cfc with everything, instead I load it with a single query with the cachedwithin or cachedafter attribute, leaving the query in memory for all the application variables in a table, which you then can use QoQ to further serve out config data as needed.

You mileage may very greatly depending upon your hardware, ectra but may be able to move some of your io overhead to memory.

1 Like

Even with an empty application.cfc, the overhead is still there because the lucee configuration has to be processed internally, which is heavier then the CFML code because everything in the Lucee admin is considered too.

Tonight, I just figured out how to make my java web server with my own custom HTTP 1.0 parser with non-blocking async socket I/O able to handle HTTP 1.1 Keep Alive requests. I was able to get up to 82,000 requests per second through the localhost network in windows, which is super fast. The default install of the newest version of Nginx on windows is only 6000 requests per second for static html for some reason and I can get it to be 20,000 per second with open file caching enabled. It should be faster then that but maybe only on linux.

It runs about 10,000 requests per second without keepalive. I tweaked OS network settings to get better results. This is very exciting to me, because I have direct control of the parsing and response logic in my Java code with no third party libraries plus direct control of Lucee entry point and bytecode, so it’s super efficient. I did a “byte by byte” streaming parser for HTTP to use the least memory and cpu possible. It can handle form fields, file uploads and utf-8 as well. When I map this to Lucee, I will be able to do close to 50,000 requests per second when you factor in the network overhead. I still need a proxy server in front of Java but not for performance reasons. In more realistic benchmarks, the numbers will get a lot smaller. I’m just measuring relative improvements in the areas I’m working on.

I had benchmarked queries yesterday and found that I can do like 110,000 requests per second with a application scope cached query being looped, and 20,000 requests with a live query. Those are best case numbers. I feel like it is not possible to optimize the database access further then I have. I removed all extra features in cfquery and the bytecode its still very close to the same. The changes I did to cfml bytecode for query make it more efficient when coming from shared memory though. It’s hard to measure that without going back to the original version.

I tested handlersocket for mysql, but the performance wasn’t any better through the java version. The most important thing is to select only the columns you need, since unused data has a high cost in JDBC / mysql. I don’t worry about this stuff generally, and store a lot of things in simple arrays/structs in my application, so query types are avoided if I want it faster. I just want to see what is possible for the database internally, but it looks like not much.

3 Likes

Yes Application.CF* does get called in the application per request, that is a halmark of the CFML and has been that way as far as I can remember.

As for MySQL depending upon version, there are number of things you can do to up the performance be it a table, a database or a cluster.

First, host on NIX when ever possible. Its literally 40 percent faster on any Linux version and 62 percent faster on any BSD version.

My personal checklist for Mysql / Mariadb tweaking
Performance schema = Off
Logging = off
Always call via IP Address to local host or local adapter (127.0.0.1 not localhost or 192.x.x.x.)
understand the storage engine
myisam for fast locking read access with no great repair options for write corruption
innodb for general write and read access (its slower than myisam)

Memory -
make sure you have 1.5 swap on host

Server Config

query_cache_size = For performance testing, Start out something like 64MB and move up accordingly to about 200-500 MB if you have a really heavily used cluser churning out the same non changing data
max_connections = This is what you maybe hitting a wall at in your test, this usually is the case.
Set this value to at least 500 if you are on a server with over 2GB of memory. for performance testing I start out at 300 and adjust accordingly by increments of 100. This is max threads / global-buffers (and memory)

innodb_buffer_pool_size =This setting allocates system memory as a data cache for your database. Default, I set this to 16MB, which usually is overkill for most smaller applications, and lower accordingly. If you are using blob storage in mysql, then you will want to up this value.

innodb_io_capacity = This is the raw speed your disks read / write rate can handle. (IOPS) The default value is 200, which is close to a consumer level 7200 RPM drive. My suggestion is set to 500 minium

innodb_adaptive_flushing_lwm = 0 DIsable the preflushing behavior for a benchmark system under heavy load for better performance tuning actual numbers

Innodb_flush_neighbors = off ; If you are on a virtual machine or SSD box, this value should be off as its for DISK operations. Why its set to ON by default is beyond me.

1 Like

I have my mysql setup very optimized and hand wrote all my queries and indexes. I want to beat the overhead of the socket connection, by using native language memory tricks. I just built a version of my real estate search in Java primitives and it is 1000 times faster then my optimized MySQL version of the same. I was able to create array lookup tables of int that are very fast for most of it. Real estate search can’t rely on a single primary key index and the data keeps changing all day, so these are my slowest public queries. They take 10ms to 100ms and have a lot of LIKE ‘%%’ statements since it isn’t possible to do things in MySQL another way, since Joins are always slower. I do a lot of precomputed lookup tables, but the main search is a denormalized MEMORY or Innodb engine table and its very fast, but the server could be overwhelmed with a mild denial of service attack, so I’d like to do more optimizations on it, plus its fun to me. In my Java version, I can do 20,000 to 200,000 native memory search per second instead of like 20 to 100 queries per second. If I could precompute all the searches, I’d do that, but it’s not possible.

Today, I just learned how to do native integration using Java JNI and setup a hello world project that can be part of my Lucee build. I’m going to translate my real estate search algorithm to C/C++ and then call it from Java and/or CFML. I think I will be able to convert from the limitations of JDBC / sockets to something that can do up to 1 million queries per second while also doing all the logic I need in fewer operations. These searches have to compare map coordinates, city, property, bed, baths, price, etc at the same time plus sort on date or price. I’m doing some presorted indexing to make it faster, and I have some code that can reduce the data I loop over to make the worst case performance very good. It’s pretty cool to use the algorithms (quicksort, selectionsort) directly instead of just relying on ORDER BY statement. Most of my performance optimizations have been for this real estate application. Most of our other work is already served from simple cached CFML arrays/structs and I have like 20gb heap to fit it all now. The majority of our projects serve data from an in-memory CFML database that is updated whenever the data changes or at start-up. I’m pretty close to the limits of what is possible with CFML, and that’s why I tinker with the Lucee project.

I do client-side optimizations too.