I am working on a legacy app running on ColdFusion and I am using Lucee to try to run it
locally and understand it.
The app is using a Japanese encoding(windows-31j / shift_jis) for html and form post data.
windows31-j is an extension to ascii. The extended characters are 2 bytes long.
The first byte is always greater than 0x80, but the second byte can be in the range
of normal ascii.
To escape such characters, the browser (checked with edge and firefox) uses
the “%” escape notation for only those bytes that need escaping. This seems
to cause a problem with Lucee.
When both bytes are escaped, such as “%96%ee” the sequence is correctly decoded.
But when the second byte is not escaped, such as “%8eq”, which might have been
encoded as “%8e%71”, the sequence become converted to a bogus character.
I searched through the Lucee source, and I think URLDecoder.decode() is doing the conversion.
It seems that the function assumes that when % is used, then the remaining bytes that belong
to the extended character are all escaped with % as well.
I think this is a bug. Should I open an issue tracker entry?
thanks.
Hideo.
Don’t forget to tell us about your stack!
Using docker image lucee/lucee:5.3.9.133-tomcat9.0-jdk11-openjdk
OS: the one that comes with the docker image Java Version: jdk11 Tomcat Version: 9.0 Lucee Version: 5.3.9.133
Sorry about the old version I am using. It takes a little effort to replace it.
However I looked at the source of the 6.0 branch and I think the code in that branch (still) has this problem.
Shown below is a standalone class that demonstrates the proper decoding.
(I couldn’t upload as an attachment since my account here is new.)
Note SJIS and Windows-31J are roughly equal.
import java.io.UnsupportedEncodingException;
public class Codec {
public static byte[] utf8_to_sjis_bytes(String s) throws UnsupportedEncodingException {
return s.getBytes("Windows-31J");
}
public static String sjis_bytes_to_utf8(byte b[]) throws UnsupportedEncodingException {
return new String(b, "Windows-31J");
}
public static String escape_like_a_browser_does(byte[] bytes) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < bytes.length; i++) {
// escape only the non-ascii bytes.
if ((bytes[i] & 0x80) != 0) {
sb.append("%" + Integer.toHexString(bytes[i] & 0xff));
} else {
sb.append((char) bytes[i]);
}
}
return sb.toString();
}
// Compare this from URLDecoder.decode() in Lucee
public static byte[] decode_escaped_str(String s) {
byte[] buf = new byte[s.length()];
int pos = 0;
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) == '%') {
int code = Integer.parseInt(s.substring(i + 1, i + 3), 16);
buf[pos++] = (byte) code;
i += 2;
} else {
buf[pos++] = (byte) s.charAt(i);
}
}
byte[] result = new byte[pos];
for (int i = 0; i < pos; i++) {
result[i] = buf[i];
}
return result;
}
public static void demonstrate(String s1) throws UnsupportedEncodingException {
byte s1_sjis_bytes[] = utf8_to_sjis_bytes(s1);
String escaped_s1 = escape_like_a_browser_does(s1_sjis_bytes);
byte[] decoded_s1_bytes = decode_escaped_str(escaped_s1);
String decoded_s1 = sjis_bytes_to_utf8(decoded_s1_bytes);
System.out.println(s1 + " -> " + escaped_s1 + " -> " + decoded_s1);
}
public static void main(String args[]) {
try {
demonstrate("矢"); // 0x96,0xee in Shift_JIS
demonstrate("子"); // 0x8e,0x71 in Shift_JIS
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
}
I made a fix to URLDecoder.java as follows on the 6.0 branch and confirmed it works.
You have to wait until you go through the whole string before feeding it to the String
constructor with encodings specified.
public static String decode(String s, String enc, boolean force) throws UnsupportedEncodingException {
if (!force && !ReqRspUtil.needDecoding(s)) return s;
boolean needToChange = false;
int numChars = s.length();
byte bytes[] = new byte[numChars];
int pos = 0;
int i = 0;
while (i < numChars) {
char c = s.charAt(i);
switch (c) {
case '+':
bytes[pos++] = (byte) ' ';
i++;
needToChange = true;
break;
case '%':
try {
while (((i + 2) < numChars) && (c == '%')) {
needToChange = true;
bytes[pos++] = (byte) Integer.parseInt(s.substring(i + 1, i + 3), 16);
i += 3;
if (i < numChars) c = s.charAt(i);
}
if ((i < numChars) && (c == '%')) {
needToChange = true;
bytes[pos++] = (byte) c;
i++;
continue;
}
}
catch (NumberFormatException e) {
needToChange = true;
bytes[pos++] = (byte) s.charAt(i);
bytes[pos++] = (byte) s.charAt(i + 1);
bytes[pos++] = (byte) s.charAt(i + 2);
i += 3;
}
needToChange = true;
break;
default:
bytes[pos++] = (byte) c;
i++;
break;
}
}
return (needToChange ? new String(bytes, 0, pos, enc) : s);
}
Here is a sample form page.
Save it from your unicode capable text editor specifying the encoding to be windows-31j.
Store it as webapps/ROOT/index.cfm
On current 6.0 when you submit the form, the string marked as ‘bad’ will be corrupted.
<cfcontent type="text/html; charset=Windows-31J">
<cfset setEncoding("Form", "Windows-31J")>
<html>
<head><title>Form encoding test</title></head>
<body>
<h1>Form encoding test</h1>
On the lucee admin app, set Settings/Language/Compiler/Template charset to windows-31j<br>
Store this file as webapps/ROOT/index.cfm using windows-31j as an encoding.
<cfparam name="good" default="矢">
<cfparam name="bad" default="子">
<form action="index.cfm" method="post">
<cfoutput>
<label for="good">good : </label>
<input type="text" name="good" value="#good#"><br>
<label for="bad">bad : </label>
<input type="text" name="bad" value="#bad#"><br>
<input type="submit" name="button" value="送信">
</cfoutput>
</form>
<cfif isdefined("form.good")>
<cfoutput>
<p>good : #form.good#</p>
<p>bad : #form.bad#</p>
<p>button label(bad): #form.button#</p>
</cfoutput>
</cfif>
</body>
</html>
I appreciate if this gets fixed on 6.0 soon, but there are (surprisingly) a lot of PRs ongoing.
I could explain more, or move on and submit an issue on Jira, or submit a PR.
Hi.
I digged a little deep on this and found out that the problem is a bit far reaching.
In lucee code, there is the URLDecoder class as well as the URLDecode() function.
By fixing URLDecoder, I can fix the issue on both HTTP POST payloads and query parameters on the URL. However, to also cover the CFML URLDecode function, which can be called from CFML code, you have to fix URLDecode() also.
URLDecoder implements its own decoding code, but URLDecode() hands the job to java.net.URLDecoder. And that code has the same problem. It is not just Lucee.
In the wider java community this issue has been overcome by the URLCodec class in Apache commons codec.
So based on my findings my suggestion would be, to replace the implementation inside Lucee of both the URLDecoder class and the URLDecode() functions with the URLCodec class.
Does that sound OK? I am slightly concerned about breaking someone’s
application that relied on the behavior or java.net.URLDecoder in some
obscure way.
Hi. I ran the modification through the test suite, and it failed.
I found out that the test case for the URLDecode() function requires that a malformed input is bypassed silently.
test/functions/URLDecode.cfc
assertEquals("%&/", "#URLDecode('%&/')#");
This line exists from the creation of the file in the git repo.
To satisfy this test case, the current URLDecode() implementation first gives java.net.URLDecoder a try. It will throw an exception on such input. Then to give the string a second chance it is passed to Lucee’s URLDecoder() class, which is permissive for such input.
To keep this test passing, you have to keep the custom implementation of the URLDecoder class.
I don’t think that this particular test case was added to confirm that a required feature works. I guess that it was added merely to check that the salvation logic for malformed input works as the author of URLDecoder() intended.
Since I don’t have Adobe ColdFusion at my hand, I can’t confirm the behavior of ACF.
How can I test java classes that are not directly exposed as CF functions?
The URLDecoder class (lucee.commons.net.URLDecoder) that I want to fix is used from Lucee when form post data is converted and passed to the CFML application.
I don’t see any test code, at least in the Lucee repo, doing such tests.
Is there a place for such kind of tests to be placed?
Hi again. Thanks for the test code snippet. I see you can call java code that way… What I was looking for is a way to post form data to Lucee and make sure it gets passed on to the app in the correct encoding. Or something equivalent of testing that. I don’t think there is a quick solution to that.
As for ACF, I noticed that I can call ACF functions from ColdFusion Fiddle (cffiddle.org) which appears here and there in ACF’s documentation.
So I checked the behavior of the URLDecode() funciton over there.
I found that ACF doesn’t have issues decoding partially encoded multi-byte sequences, which is good. And it is also just as permissive for erroneous input as Lucee is. It won’t raise exceptions on ill-formed input such as “%#” or just “%”.
So fixing the partially encoded string problem, while maintaining the permissiveness seems to fix an existing compatibility gap without the risk of breaking someone’s (ACF compatible) code.
I will post a PR as follows:
For the URLDecoder class, replace it with a hand written version.
For the URLDecode() function, maintain the current code structure that first calls the common library, but when that causes an error, give it a second try with URLDecoder(). I will replace the current call to java.net.URLDecoder to a call to apache.commons.codec.net.URLDecoder.