NPE after upgrade to Lucee 5.2.9.31

Markus_Wollny · September 18, 2018, 2:23pm

We are running four Lucee servers behind a load balancer. They were originally deployed with 5.2.7.63, but the upgrade to 5.2.8.50 went smooth. Now I wanted to upgrade to 5.2.9.31 and was very much loking forward to finally getting the fix for LDEV-1617.

After the upgrade of the first server, we were bombarded with Null Pointer Exceptions though - and those didn’t just hit the freshly upgraded server, but the other servers, too, via the Memcache layer we use.

We employ the Lucee Memcache extension and use CacheGet, CachePut etc. to cache objects in the database. The cache connections is configured in Lucee admin to use two memcache instances. After taking the freshly upgraded box out of the loadbalancing, we managed to get rid of the lingering NPE by clearing the cache for the affected keys where the content was generated on the upgraded box.

I have of course tested the upgrade on a separate VM which has an identical configuration to our live webservers. However, I couldn’t recreate this situation on that machine, it seems to happen only under some server load or some other special condition I haven’t reproduced in my simplyfied test case. Most of the NPEs don’t even provide any stack trace in the logs, but those which did mostly stumbled over a cfloop on a query object that was just fetched from cache via UDF.

We’re running Tomcat 8.5.14-1+deb9u3 on Oracle Java build 25.181-b13.

The line where the NPE happens is

<cfloop query="arguments.qryMyData">

and the NPE is thrown in lucee.runtime.type.QueryImpl.getCurrentrow:

Message string java.lang.NullPointerException
StackTrace string lucee.runtime.exp.NativeException: java.lang.NullPointerException
 at lucee.runtime.type.QueryImpl.getCurrentrow(QueryImpl.java:1079)
 at my.path.to.mycomponent_cfc$cf.udfCall1(/my/path/to/mycomponent.cfc:1234)

As I still haven’t found out how exactly this can be reproduced without causing havoc to our live sites, I unfortunately do not have more info than this - I have rolled back the upgrade and our applications are running fine again. I suspect that this has something to do with the changes made for LDEV-1945.

Kind regards

Markus

webonix · September 18, 2018, 10:17pm

The issue may be running different versions of Lucee in your cluster

We had a similar issue and if I recall correctly, this was the issue.

Markus_Wollny · September 19, 2018, 7:33am

You are correct in your assumption - I deployed a separate memcached instance, upgraded the server and before joining it back into the cluster I changed the cache configuration to use the separate instance - no NPEs so far, albeit an increased DB load, as I just doubled the probability for a cache miss. So there must be some binary incompatibility in the serialization of objects between the previous and the current Lucee version.

This is not quite optimal - in order to deploy even so much as a supposedly minor upgrade, I’ll have to switch each instance to a separate cache until all the servers are upgraded, then clear the production cache and switch them back. If anybody should be so careless and simply update the instance without adjusting the cache, all their apps will suffer from NPEs due to incompatibility cache poisoning and the only way to remedy that situation will be to completely clear the cache, which may not be a good idea for a production system to be faced with a cold cache.

It would be desirable to have a serializer that would prevent such incompatibilities in caching between minor versions, but at the very least there should be a warning in the release announcement advising people to clear their caches after the upgrade and make sure that there is no mix of (even minor) versions in a cluster sharing a common cache.

Kind regards

Markus

Jan_Verschueren · September 21, 2018, 12:50pm

Experienced the same problem by iterating over a query using query.next() after update to 5.2.9.31.

I don’t run different versions of lucee in a cluster. Had to downgrade lucee to get rid of the NPE.

Markus_Wollny · September 21, 2018, 1:09pm

@Jan_Verschueren: That sounds like a different beast altogether - could you file a ticket with a test case on how to reproduce that problem in the issue tracker? https://luceeserver.atlassian.net/

I have in fact gone down the route I described, i.e. isolation of cache instances during the upgrade, and we haven’t had any NPE since, but I don’t think we are using query.next() anywhere in our code.

Kind regards

Markus

joe.gooch · September 24, 2018, 11:55am

My guess is the default serializer just uses built in Java serialization - which means you’re going to be affected by any change in class file, jvm major version etc… as you’re caching bytecode for anything that doesn’t provide a smart serializer.

You could use a more portable serializer if it works (see this issue)

https://github.com/lucee/extension-memcached/issues/1

The tradeoff being that every push/pull from the cache now has a performance hit to translate to and from something like json. I havent checked the query object in java itself to see what the serialization looks like, but the built-in binary serializer is probably the most performant, and really you only have the possibility of having a problem after restarting the server with jvm/lucee changes… so…

A better option might be to make a unique key prefix based on various environmental keys - maybe jvm version + os version + lucee version, make a md5 hash, and prefix all your cache keys with that… that way you transparently have different cached objects per version. (It’d be even cooler if the extension could accomodate that)

Just tossing out ideas…

joe.gooch · September 24, 2018, 1:21pm

This is almost certainly a side effect of the fix:
https://luceeserver.atlassian.net/browse/LDEV-1934

That which was an ArrayInt is now a HashMapPro. Regardless that variable stores local instance state (based on Context IDs which change on server restart) . So they should not be serialized. (Otherwise, the position in the iteration is shared across instances where it doesn’t make sense to do so) . Needs a volatile keyword, or a custom java.io.Serializer method.

If you’re using a Query as a DTO, and serializing it, there are probably other things that shouldn’t be serialized - SQL, execTime, template, cacheType, name…

The serialize() function only serializes the data. (Which is smart) So you could wrap your cache calls with a serialize() before pushing into memcache extension, and a deserialize after.

Is the fix wrong? No, it was correct. And even if now the query object implemented java.io.Serializable, the NEXT version of Lucee would now have binary-incompatible query objects…

You could file a bug with Lucee, to have this sorta fixed - but binary serialization is always going to have issues, especially with Query objects, because the JDBC driver determines some of the datatypes… Consider Microsoft’s DateTimeOffset… since java.sql.Timestamp doesn’t save a timezone, Microsoft has their own class which exists in the JDBC driver that implements and extends java.sql.Timestamp. When serializing the query it’s going to follow what MS has defined for serialization behavior (assuming they HAVE defined serialization behavior) and if the class changes in the future, deserialization may fail. Or if you stop using the MSSQL driver and switch to another driver, it won’t be able to recreate the class. I guess my point is there’s only so much Lucee, or the memcache extension, can control.

Best bet for binary serialization is to use primitive types and portable types. You’d be better off serializing a Structure[colName] of Array(datarows), than a query object, and while building your portable object, you could do things like pick portable serialization choices for non-standard datasets.

Probably not the answer you want, but… it is what it is.

So your choices seem to be:

Issue a JIRA for Lucee to implement serialization in the Query object safely, realizing you may still have issues depending on the datatype you use.
You could register an issue w/ the memcached extension… Not only does issue 1 indicate JSON doesn’t work in the memcached extension, assuming I have the correct github, it was never implemented. There would need to be a Transcoder defined in the extension to serialize as something other than binary. Note that without additional annotations or custom serializers, even GSON or Jackson or JSON based serializers will serialize the data structures it shouldn’t, so this isn’t really a solution, without also fixing 1, and having a performance impact.
You could restructure your code (key by key) to use more portable DTOs than “Query”. The benefit of NOT using a query object at all, and using a safe Struct of Arrays for instance, is you can do that translation once, modify your code to use your new object, and not pay the penalty of constant serialization/deserialization.
You could wrap your memcached extension calls in your own translation layer, implementing key prefixes to keep the versioned data separate, or serializing yourself before put and deserializing before get (Binary rep of Strings is always going to be safe, but incur a performance penalty).

Cache invalidation is one of the hard problems of computer science, after all.

Markus_Wollny · September 24, 2018, 2:14pm

I guess we’ll do 3) eventually, i.e. when writing new code or refactoring legacy code, we’ll see to not using queries in DTO-contexts and convert to some more stable data structure.

1,2 and 4 come with too much of a general performance impact for what it’s worth IMO, so I don’t think that this would be a good idea entirely, even though it seems cleaner and serialization caveats for queries are somewhat unintuitive.

For future updates I’ll be careful to check for NPEs and quickly switch to separating the cache instances during the upgrade, if the need arises.

Thank you for your input!

Kind regards

Markus

Jan_Verschueren · November 13, 2018, 2:38pm

Thanks @joe.gooch.

Option 3 seems the most interesting and straightforward, but this means that this is yet another case where refactoring is required for working code coming from Adobe ColdFusion to Lucee. If Lucee wants to be able to use cached query objects the same way as in ACF, then there is no way around it and it has to be fixed in the Lucee JIRA, right?

And even if this is something that won’t be fixed, it should at least resolve in a readable Lucee exception when trying to serialize a query object. Currently Lucee >=5.2.9.31 seems incompatible with ACF.

joe.gooch · November 13, 2018, 10:49pm

Well, ultimately Option 1 should happen for lucee - those internal tracking areas SHOULD be marked volatile - while making that change will break currently cached objects, it’ll be better for the future.

I wouldn’t use binary serialization/deserialization w/ ACF either - you have no real guarantee the internal structure won’t change in the future, and the JDBC datatypes issue is still there… My guess is you’ll have similar issues with Adobe whether they’ll admit that or not. I mean, we’re talking about caching a query in an external cache that persists across app server restarts - none of this is an issue with memory caching within the same instance. So I’m not sure which ACF feature you’re referring to - can you elaborate?