A review of Lucee start and restart times on Tomcat 8


#1

I created a bunch of time measurements throughout the Lucee loader and core in order to measure what is taking all the time and found that nearly all of the time occurs outside lucee’s code or is doing I/O operations. I know lucee records some of the measurements in its own logs, but that wasn’t recording the full amount of time Lucee code is running during startup. Now that I have isolated the hot spots, I may work on those areas or eliminate some of these features.

When we click restart engine in the Lucee admin, the code tries to stop and remove the OSGi bundles before running a reduced version of the startup again. In addition to that looking unnecessary to me, the sleep(5000) looks wrong considering it was only taking 150ms of time to finish that code and it could use a loop with logic instead if it really needs this code to run. I don’t think Felix even requires you to keep reloading them like this, but maybe you have an edge case to worry about. I think Felix bundle installation is more of a one time occasion, and then you only have to do it again on upgrades for specific bundles. Lucee shutdown becomes like 150ms instead of 5150ms now with no consequence that I can see yet, which is pretty awesome, since this saves me significant time iterating on changes to Lucee now.

The full Tomcat/Lucee in windows takes about 10 seconds to shutdown, and about 4 seconds to start again. Sometimes more then that. Also, anyone using multiple contexts would experiencing much worse startup performance probably. I switched to developing everything as one context over 7 years ago to avoid that problem, but they could at least be made to load in parallel if others are using this feature.

1500ms of the start up time is happening after lucee CFMLServlet starts, so it seems like the majority of the wait to start is happening in Tomcat and servlet jars and we’d wait at least 2 seconds even if Lucee did nothing. This startup time also includes running and finishing the first CFML request, not just starting the server. I’m also using a fast SSD with a core i7 cpu, so it could be worse then this for others.

Felix is starting and loading all the OSGi bundles in about 500ms. 200ms of that could probably done in parallel saving up to 150ms. Most of it occurs in Felix outside of Lucee’s code unfortunately. Even if you delete all of Lucee’s extensions in a custom build, I think it would only start up to 100ms faster.

I think some of these startup and shutdown processes could verify if they are needed more, instead of always doing them for OSGi loading and the XMLConfigWebFactory and deploying server/web context files. At a glance, I think it could be made up to 4 times faster with the same feature set. With some reorganization, perhaps it is possible to get the OSGi, XML and Context loading systems all running on separate threads in parallel since they are each around 500ms+. Additionally, I think some of the code could be deleted or made only to happen during upgrade/install events and even then, it is rare for all the files to need to change.

On my version, I might not even have most of these features that slow down start-up eventually. I think there only needs to force a cfclasses and temp directory to exist for the language to work, and I don’t want the Lucee admin, the cfml functions, or any of the xml code later. I think the majority of the files should be trusted and already exist on repeat startups. It could wait until a user explicitly requests an upgrade before trying to repeat some of these steps, or upgrades could be handled outside the project with a command line tool instead. Sometimes tracking md5 hashes or date/size of the data is all you need to eliminate lots of processing time.

After the CFML engine is loaded, it seems like the first CFML request takes another 200ms to 300ms startup time for a hello world kind of page that usually takes less then a millisecond. Some of that is probably servlet and initialization work that might be able to sped up or eliminated.

A clean start would probably take under a second to be able to run CFML code if there was no servlet, no XML config system, no dependencies and no OSGi for the simplified core language, which is would be 3 to 5 times faster then the full Lucee feature set allows.

Most of these startup features are there in order to make Lucee work in multiple language modes, the ability to extend it externally easier and to make the Lucee admin able to work. The core language itself is much faster/simpler to build/deploy/start if you forego these features and look at developing your own fork or looking into what I’m doing with Lucee.

Here is the raw log of measurements I took for a clean start of my modified version of Lucee 5.3.2.16. I made up some of the names based on the surrounding code. These don’t include the time for tomcat to start. Each line is measuring the time spent since the previous log item in sequence. I don’t think threads are used anywhere in startup yet.

1ms for CFMLServlet init start
0ms for CFMLServlet init Servlet init end
5ms for CFMLEngineFactory getInstance begin
3ms for CFMLEngineFactory getInstance after new factory
0ms for CFMLEngineFactory getInstance after readInitParam
0ms for CFMLEngineFactory initEngine begin
1ms for CFMLEngineFactory after patch check
1ms for CFMLEngineFactory initEngine before loadBundles
0ms for BundleLoader loadBundles begin
3ms for CFMLEngineFactory before new Felix
70ms for CFMLEngineFactory after new Felix
182ms for CFMLEngineFactory after start Felix
0ms for BundleLoader loadBundles after close all bundles
54ms for BundleLoader loadBundles after addBundle set 1
1ms for BundleLoader loadBundles after addBundle set 1
6ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
1ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
4ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
1ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
1ms for BundleLoader loadBundles after addBundle set 1
3ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 1
1ms for BundleLoader loadBundles after addBundle set 1
2ms for BundleLoader loadBundles after addBundle set 2
10ms for BundleLoader loadBundles after addBundle core
1ms for BundleUtil.start begin
84ms for Bundle.start before bundle load wait
0ms for Bundle.start bundles loaded
12ms for CFMLEngineFactory initEngine after loadBundles
127ms for CFMLEngineImpl getConfigServerImpl begin
1ms for XMLConfigServerFactory begin
14ms for XMLConfigServerFactory after loadDocumentCreateIfFails
1ms for XMLConfigServerFactory after getDocumentElement
152ms for XMLConfigServerFactory after new ConfigServerImpl
18ms for XMLConfigWebFactory load begin
19ms for XMLConfigWebFactory load after deployWebContext
0ms for XMLConfigWebFactory load after deployWeb
1ms for XMLConfigWebFactory load after loadConfig
78ms for XMLConfigWebFactory load after many load operations
7ms for XMLConfigWebFactory load after loadLib
110ms for XMLConfigWebFactory load after many load operations
47ms for XMLConfigWebFactory load after many load operations
41ms for XMLConfigWebFactory load after many load operations
9ms for XMLConfigWebFactory load after many load operations
0ms for XMLConfigWebFactory load end
0ms for XMLConfigServerFactory after load
3ms for XMLConfigServerFactory after createContextFiles
0ms for XMLConfigServerFactory end
0ms for CFMLEngineImpl getConfigServerImpl after newInstance for the xml configServer
2ms for CFMLEngineFactory initEngine before getUpdateType
0ms for CFMLEngineImpl getConfigServerImpl begin
0ms for CFMLEngineFactory initEngine after getUpdateType
0ms for CFMLEngineFactory initEngine end
0ms for CFMLEngineFactory getInstance after initEngineIfNecessary
0ms for CFMLServlet addServletConfig start
28ms for CFMLServlet loadJSPFactory start
0ms for CFMLEngineImpl getConfigServerImpl begin
1ms for CFMLEngineImpl getConfigDirectory end
8ms for CFMLFactoryImpl created
37ms for XMLConfigWebFactory load begin
2ms for XMLConfigWebFactory load after deployWebContext
0ms for XMLConfigWebFactory load after deployWeb
0ms for XMLConfigWebFactory load after loadConfig
13ms for XMLConfigWebFactory load after many load operations
1ms for XMLConfigWebFactory load after loadLib
27ms for XMLConfigWebFactory load after many load operations
11ms for XMLConfigWebFactory load after many load operations
19ms for XMLConfigWebFactory load after many load operations
4ms for XMLConfigWebFactory load after many load operations
1ms for XMLConfigWebFactory load end
4ms for CFMLServlet loadJSPFactory end
0ms for CFMLServlet init context(s) end
0ms for CFMLEngineFactory getInstance after addServletConfig
0ms for CFMLEngineFactory getInstance(config) end
0ms for CFMLEngineFactory readInitParam end
0ms for CFMLEngineFactory initEngine end
0ms for CFMLServlet addServletConfig start
0ms for CFMLEngineFactory addServletConfig end
3ms for CFMLServlet init CFML factory instance end
90ms for CFMLServlet service start
0ms for CFMLEngineImpl getConfigServerImpl begin
0ms for CFMLEngineImpl getCFMLFactory end
0ms for CFMLEngineImpl _service begin
36ms for PageContextImpl created
40ms for PageContextImpl initialize start
5ms for PageContextImpl initialized
0ms for CFMLEngineImpl _service after create PageContextImpl
0ms for CFMLEngineImpl _service before Request.exe
1ms for PageContextImpl execute begin


#2

So far, I found there was very little that can be done to improve Felix startup.

The hidden feature made for this ticket makes it so that Felix loads the bundles itself on startup:
https://luceeserver.atlassian.net/browse/LDEV-1757?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

All the individual addBundle calls become 0ms after setting those options to “none”, which means the option just changes which project does the initial Jar loading, instead of improving performance. I also think felix might not be loading them in parallel, but even if they did, it’s less then 30ms faster when I tried to do it. I have a feeling Felix might prevent parallel loading of bundles internally. There doesn’t seem to be anything better then Felix.

Then for the loadAvailableBundles, I made it read the jar manifests in parallel, which only saves 30ms, but that’s several times faster at least.

I can’t get the OSGi stuff any faster then around 300ms without deleting features and nearly all of that is happening in the Felix project, so I’d have to look at their code to do anything more.

There are about 50 bundle jars needed for Lucee to start. However, the lucee loader is already running before those 50 jars have to load, so it seems that startup has do more parallel work in other ways in the loader project to go faster. I’m going to try moving some of the slower I/O startup features from the core to the loader and make them run in parallel. Then the core would only need to access the data already loaded which may cut out a good chunk of time. These might be new simple data containers in order to keep the implementation mostly in the core still.


#3

I don’t think that’s correct. I made that change for CommandBox because otherwise, Felix was deleting the entire bundle cache off of disk every time Lucee started and then re-creating it from scratch which took a great deal of time. (This is not the Felix default, but rather how Lucee configures it by default) I’m not aware of the “storage clean” settings changing who loads the bundles.


#4

I understand what you are trying to say, but what I said is still true that felix.start() loads the bundles when the cache is already there.

The measurements are showing felix-cache creation time is almost undetectable, under 30ms for me. If the commandbox stuff runs differently, I wouldn’t know. If all the bundles had to reload, it takes a few hundred ms extra for me.


#5

I’ve been testing all these lucee startup changes in Windows, and I’ve continued to work on this all of today.

I was able to split a lot of XMLConfigServerFactory and XMLConfigWebFactory into separate future threads using multiple threadpools. Many of those steps take many milliseconds so it should make it a little faster. I’ve got over 30temporary threads in an executor service now. It’s so random slow/fast, but sometimes it looks about 300ms faster. I don’t think it can be made any faster anymore without rewriting the code itself to do less, or to synchronize less.

Felix cache has to be wiped whenever the patch version changes and there isn’t a feature to detect that yet, so you have to leave felix-cache deleting on in order to be able to update lucee at the moment. I think cfadmin restart should be able to override the setting temporarily at least, so we have fast start-up when there isn’t an upgrade. When developing lucee, it has to clear felix-cache or you can’t see new code, so this isn’t really possible to turn it off without rewriting how it works some more.

Restarting Lucee engine takes just over 1 second, which is super fast compared to a tomcat stop/start.

Maybe if all these objects could serialize and reload themselves, it could load faster, but I don’t think they are setup to be able to be serialized yet and that’s more complex since there are a lot of resources and file system checks in this code, there is a mixture of read/write operations. The worst ones are reading those 6 huge TLD and FLD files. Perhaps 30% of startup is just xml to object translation stuff to get the language mappings setup. If those things were serialized and reloaded as java objects, perhaps that would be better on repeat starts then starting from XML which is never changing.

I’m surprised that adding over 30 threads to these slow processes only makes it about 20 to 30% faster. It should have been at least 2-3 times faster on a quad core with SSD. There must be synchronization or something blocking some of the parallelism from working. I’d have to look at more of the details instead of just wrapping these calls in futures to figure that out. However, I can see lucee use all 8 of my cpus are used briefly when it starts now, so it must be working to some extent.

I deployed the latest version of my Lucee fork to my linux test machine which is a virtual machine on the same computer to see if it works any different there. The tomcat full stop/restart is taking 4.13 seconds according to “time”. But it’s actually about 10 seconds for CFML to be able to first run without the threads. I can see I’ve cut 2 seconds off of that with all these threads and have it running in about 8 seconds now with the threads, which is a 30% reduction in startup time.


#6

I took a quick look for synchronized keyword throughout the project.

I found the XMLUtilImpl and CFMLEngineFactory are synchronizing in startup. I was able to safely minimize the need for that on CFMLEngineFactory. For the xml one, I just don’t know yet, but I removed it for now. I’m pretty sure that if we want synchronized on the xml document, it can be done in other ways, like what if synchronized is only done after startup and you do a boolean flag to toggle that mode. XMLUtilImpl functions are called I think hundreds of times, so this is probably the main thing that was stopping some of the parallelism.

After restarting it a few times, my windows box was 300ms faster with the synchronized changes, and my linux box was 700ms faster to startup to working cfml even with the felix cache being rebuilt. So now my new threads are working better.


#7

I took a break from programming last week, but now I’m motivated again.

I’ve optimized and removed steps for nearly everything now for lucee startup on my version.

I’ve removed many more things that are not essential to the cfml language working.

I made tomcat’s script sleep about 1 second less so that lucee actually starts loading about 1 second faster then the default script. I removed some optional things in tomcat jars and config too. Might not even need tomcat whenever I get rid of the servlet technology though.

Felix still takes like 300ms to 500ms depending on the cache setting. It’s a lot harder to trim that further, though at least 100ms is wasted by felix itself, so not using felix is one possible way to go. If felix was actually optional and only used for loading non-core features, that might be interesting, since Lucee core could be running much sooner. I’m not actually using any lucee extensions when I’m testing this, so all this felix loading time is jars required by Lucee core, which is around 50 jars.

While writing this post, I realized that the servlet is also needing to do processing outside of the lucee project between init and service calls, but the creation of the CFMLEngineFactory in init actually delays that work which causes up to 100ms wasted time after lucee finished creating the engine. I put CFMLEngineFactory creation in a separate thread and it didn’t break anything. This lets the tomcat servlet run init and then service as fast as it can while lucee is doing other things. I made service loop with Thread.yield until engine is not null. Now I’ve hidden even more of the Lucee startup time with threads. This optimization is right at the entry point to lucee so it will always be helpful to speeding it up even when other things get faster, the main lucee project should do this.

The code that is actually written in lucee’s project takes about 450ms to startup now.

The actual time to see the first CFML request from first line of lucee to output is measuring as low as 860ms total now, and I started with it measuring 1500ms.

I did a major rewrite of entire configuration today to no longer need a server.xml or web.xml context file anymore. I replaced them with a static java class that has a lot less variables and not using any hashmaps. I made a lot of personal choices to remove many features that I’ll never use while rewriting the contexts & configuration stuff. This eliminates over 100ms during startup. In some places, it wasn’t the xml that was the bottleneck, but the other kinds of I/O. I subdivided the work more carefully, and found dozens of ways to optimize little things while simplifying it.

I also rewrote when the cfml functions and tags that still load from xml to be much sooner in the process, and it runs in parallel to the other lucee startup operations. I also used java serialization to convert the final objects to binary on disk, and the next time it loads, it loads without any xml overhead. This made this part load about twice as fast. ~50 to 100ms saved

I found that initializing the hundreds of key constants could be put in a thread earlier so that the first use of them where they are needed is instant. Over 20ms saved. I also plan on making this file bigger later in order to speedup first time cfml object creation, and this will help keep it fast.

Also a lot of disk based logging makes startup slowdown a bit, so I removed most of that and do memory based logging and use regular exceptions instead for debugging startup issues.

I also learned how to kill and restart lucee via command line in windows so I could skip the crazy shutdown timeout tomcat service has. Command is:

taskkill /F /FI “SERVICES eq Lucee” && timeout /T 1 && sc start lucee

I also further optimized the lucee build process, and can make and deploy new versions of the Lucee jar in about 10 seconds and it does more of the steps for me, and avoids creating copies of files as much as it used to. The pack200 jar compression stuff is deprecated and seems to have zero benefit to startup times, so I got rid of it and just build regular jars faster now.

I also rewrote throw and writedump to be pure java functions instead of cfml functions for cfscript. I had to do this because I’ve completely eliminated all the CFML code from the Lucee project (hundreds of files), and there is no longer an admin area or docs or anything like that now in the project. I also removed all the features of CFADMIN tag that are unnecessary now since my configuration is static. I’ve removed most of the dynamic configuration features, to make it more secure and faster. I might restore some dynamic features in a new way later, but the majority of the settings never need to change really in a pure strict form of cfml especially when a lot can be done in application.cfc already. Code and documentation for configuration is better and easier to share to other systems.

I rewrote the cfml lucee debugging log to be part of my application instead of part of the Lucee project, so that I can see debugging logs.

In summary, I’ve removed about 45% of Lucee’s startup time, and about 1 second from tomcat startup on windows.

Here’s a newer version of my Lucee startup debug output:

3 ms for CFMLServlet init start
0 ms for CFMLServlet init Servlet init end
1 ms for CFMLServlet init CFML factory instance end
9 ms for CFMLEngineFactory getInstance begin
5 ms for CFMLEngineFactory initEngine begin
4 ms for CFMLEngineFactory after patch check
0 ms for CFMLEngineFactory initEngine before loadBundles
5 ms for BundleLoader loadBundles begin
2 ms for CFMLEngineFactory getFelix begin
1 ms for CFMLEngineFactory getFelix before mkdirs
1 ms for CFMLEngineFactory before new Felix
95 ms for CFMLServlet service start
2 ms for CFMLEngineFactory after new Felix (occurs outside Lucee)
192 ms for CFMLEngineFactory after start Felix
6 ms for BundleLoader loadBundles after loadAvailableBundles
13 ms for BundleLoader loadBundles after adding all bundles
0 ms for BundleUtil.start begin
1 ms for Bundle.start before bundle load wait
0 ms for Bundle.start bundles loaded
0 ms for CFMLEngineFactory initEngine after loadBundles
45 ms for CFMLEngineImpl begin
0 ms for CFMLEngineImpl after getFelix
25 ms for CFMLEngineImpl after getSeverContextConfigDirectory
34 ms for CFMLEngineImpl after InfoImpl
0 ms for CFMLEngineImpl getConfigServerImpl begin
2 ms for XMLConfigServerFactory begin
0 ms for XMLConfigServerFactory after doNew
0 ms for XMLConfigServerFactory before 4 threads
66 ms for XMLConfigServerFactory after loading 4 threads
29 ms for XMLConfigServerFactory after load threads part 2
0 ms for XMLConfigServerFactory end
0 ms for CFMLEngineImpl getConfigServerImpl after newInstance for the xml configServer
4 ms for CFMLEngineImpl getConfigServerImpl begin
0 ms for CFMLEngineFactory initEngine end
0 ms for CFMLServlet addServletConfig start
6 ms for CFMLServlet loadJSPFactory start
0 ms for CFMLEngineImpl getConfigServerImpl begin
1 ms for CFMLEngineImpl getConfigDirectory end
8 ms for CFMLFactoryImpl created
0 ms for XMLConfigWebFactory web begin
39 ms for XMLConfigWebFactory web end
0 ms for CFMLServlet loadJSPFactory end
0 ms for CFMLServlet init context(s) end
0 ms for CFMLEngineFactory getInstance(config) end
0 ms for CFMLEngineFactory readInitParam end
0 ms for CFMLEngineFactory initEngine end
0 ms for CFMLServlet addServletConfig start
0 ms for CFMLEngineFactory addServletConfig end
47 ms for CFMLEngineImpl getConfigServerImpl begin
0 ms for CFMLEngineImpl getCFMLFactory end
0 ms for CFMLEngineImpl _service begin
46 ms for PageContextImpl created
53 ms for PageContextImpl initialize start
9 ms for PageContextImpl initialized
0 ms for CFMLEngineImpl _service after create PageContextImpl
0 ms for CFMLEngineImpl _service before Request.exe
1 ms for PageContextImpl execute begin
843ms total time to restart lucee and run CFML code againtest0


#8

Sounds great, are you going to file PRs for some of these improvements?


#9

Zac, I labeled most of the commits related to startup performance with a startup prefix. That’s about all I’m willing to do to organize it since I don’t think they’d appreciate all of this as a pull request.

They’d do it another way because they want to maintain a different feature set in the existing style.

I want to simplify everything and blur the lines between my application and lucee.


#10

interesting