Non-utf-8 form data decoded incorrectly

hideo67 · July 20, 2023, 7:20am

Hello.

I am working on a legacy app running on ColdFusion and I am using Lucee to try to run it
locally and understand it.

The app is using a Japanese encoding(windows-31j / shift_jis) for html and form post data.
windows31-j is an extension to ascii. The extended characters are 2 bytes long.
The first byte is always greater than 0x80, but the second byte can be in the range
of normal ascii.

To escape such characters, the browser (checked with edge and firefox) uses
the “%” escape notation for only those bytes that need escaping. This seems
to cause a problem with Lucee.

When both bytes are escaped, such as “%96%ee” the sequence is correctly decoded.
But when the second byte is not escaped, such as “%8eq”, which might have been
encoded as “%8e%71”, the sequence become converted to a bogus character.

I searched through the Lucee source, and I think URLDecoder.decode() is doing the conversion.
It seems that the function assumes that when % is used, then the remaining bytes that belong
to the extended character are all escaped with % as well.

I think this is a bug. Should I open an issue tracker entry?

thanks.
Hideo.

Don’t forget to tell us about your stack!

Using docker image lucee/lucee:5.3.9.133-tomcat9.0-jdk11-openjdk

OS: the one that comes with the docker image
Java Version: jdk11
Tomcat Version: 9.0
Lucee Version: 5.3.9.133

andreas · July 20, 2023, 8:20am

Can you show us some short code example that does that escaping?

Zackster · July 20, 2023, 8:52am

always please check with the latest version first, 5.3.9 is rather old

hideo67 · July 20, 2023, 9:11am

Hi.

Sorry about the old version I am using. It takes a little effort to replace it.
However I looked at the source of the 6.0 branch and I think the code in that branch (still) has this problem.

Shown below is a standalone class that demonstrates the proper decoding.
(I couldn’t upload as an attachment since my account here is new.)

Note SJIS and Windows-31J are roughly equal.

import java.io.UnsupportedEncodingException;

public class Codec {

    public static byte[] utf8_to_sjis_bytes(String s) throws UnsupportedEncodingException {
        return s.getBytes("Windows-31J");
    }

    public static String sjis_bytes_to_utf8(byte b[]) throws UnsupportedEncodingException {
        return new String(b, "Windows-31J");
    }

    public static String escape_like_a_browser_does(byte[] bytes) {
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < bytes.length; i++) {
            // escape only the non-ascii bytes.
            if ((bytes[i] & 0x80) != 0) {
                sb.append("%" + Integer.toHexString(bytes[i] & 0xff));
            } else {
                sb.append((char) bytes[i]);
            }
        }
        return sb.toString();
    }

    // Compare this from URLDecoder.decode() in Lucee
    public static byte[] decode_escaped_str(String s) {
        byte[] buf = new byte[s.length()];
        int pos = 0;
        for (int i = 0; i < s.length(); i++) {
            if (s.charAt(i) == '%') {
                int code = Integer.parseInt(s.substring(i + 1, i + 3), 16);
                buf[pos++] = (byte) code;
                i += 2;
            } else {
                buf[pos++] = (byte) s.charAt(i);
            }
        }
        byte[] result = new byte[pos];
        for (int i = 0; i < pos; i++) {
            result[i] = buf[i];
        }
        return result;
    }

    public static void demonstrate(String s1) throws UnsupportedEncodingException {
        byte s1_sjis_bytes[] = utf8_to_sjis_bytes(s1);
        String escaped_s1 = escape_like_a_browser_does(s1_sjis_bytes);
        byte[] decoded_s1_bytes = decode_escaped_str(escaped_s1);
        String decoded_s1 = sjis_bytes_to_utf8(decoded_s1_bytes);
        System.out.println(s1 + " -> " + escaped_s1 + " -> " + decoded_s1);
    }

    public static void main(String args[]) {
        try {
            demonstrate("矢"); // 0x96,0xee in Shift_JIS
            demonstrate("子"); // 0x8e,0x71 in Shift_JIS
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
    }
}

When it is run, here is the console output

矢 -> %96%ee -> 矢
子 -> %8eq -> 子

hideo67 · July 20, 2023, 9:13am

In the code I implied that the java strings are utf-8, but I think it was UTF-16. Sorry about that.

hideo67 · July 21, 2023, 12:02am

I made a fix to URLDecoder.java as follows on the 6.0 branch and confirmed it works.
You have to wait until you go through the whole string before feeding it to the String
constructor with encodings specified.

	public static String decode(String s, String enc, boolean force) throws UnsupportedEncodingException {
		if (!force && !ReqRspUtil.needDecoding(s)) return s;

		boolean needToChange = false;
		int numChars = s.length();
		byte bytes[] = new byte[numChars];
		int pos = 0;
		int i = 0;

		while (i < numChars) {
			char c = s.charAt(i);
			switch (c) {
			case '+':
				bytes[pos++] = (byte) ' ';
				i++;
				needToChange = true;
				break;
			case '%':

				try {
					while (((i + 2) < numChars) && (c == '%')) {
						needToChange = true;
						bytes[pos++] = (byte) Integer.parseInt(s.substring(i + 1, i + 3), 16);
						i += 3;
						if (i < numChars) c = s.charAt(i);
					}

					if ((i < numChars) && (c == '%')) {
						needToChange = true;
						bytes[pos++] = (byte) c;
						i++;
						continue;
					}
				}
				catch (NumberFormatException e) {
					needToChange = true;
					bytes[pos++] = (byte) s.charAt(i);
					bytes[pos++] = (byte) s.charAt(i + 1);
					bytes[pos++] = (byte) s.charAt(i + 2);
					i += 3;
				}
				needToChange = true;
				break;
			default:
				bytes[pos++] = (byte) c;
				i++;
				break;
			}
		}

		return (needToChange ? new String(bytes, 0, pos, enc) : s);
	}

hideo67 · July 21, 2023, 12:16am

Here is a sample form page.
Save it from your unicode capable text editor specifying the encoding to be windows-31j.
Store it as webapps/ROOT/index.cfm

On current 6.0 when you submit the form, the string marked as ‘bad’ will be corrupted.

<cfcontent type="text/html; charset=Windows-31J">
<cfset setEncoding("Form", "Windows-31J")>

<html>
<head><title>Form encoding test</title></head>
<body>
<h1>Form encoding test</h1>
On the lucee admin app, set Settings/Language/Compiler/Template charset to windows-31j<br>
Store this file as webapps/ROOT/index.cfm using windows-31j as an encoding.

<cfparam name="good" default="矢">
<cfparam name="bad" default="子">

<form action="index.cfm" method="post">
<cfoutput>
<label for="good">good : </label>
<input type="text" name="good" value="#good#"><br>

<label for="bad">bad : </label>
<input type="text" name="bad" value="#bad#"><br>

<input type="submit" name="button" value="送信">
</cfoutput>
</form>

<cfif isdefined("form.good")>
<cfoutput>
<p>good : #form.good#</p>
<p>bad : #form.bad#</p>
<p>button label(bad): #form.button#</p>
</cfoutput>
</cfif>

</body>
</html>

hideo67 · July 23, 2023, 12:01am

Hi.

I appreciate if this gets fixed on 6.0 soon, but there are (surprisingly) a lot of PRs ongoing.
I could explain more, or move on and submit an issue on Jira, or submit a PR.

Which is preferred?

Zackster · July 23, 2023, 7:24am

A PR with tests would be great, so please go ahead and create an issue in jira and then submit a PR linked to the issue

hideo67 · July 24, 2023, 11:54am

ok. I’ll try.

hideo67 · September 2, 2023, 3:18am

Hi.
I digged a little deep on this and found out that the problem is a bit far reaching.

In lucee code, there is the URLDecoder class as well as the URLDecode() function.

By fixing URLDecoder, I can fix the issue on both HTTP POST payloads and query parameters on the URL. However, to also cover the CFML URLDecode function, which can be called from CFML code, you have to fix URLDecode() also.

URLDecoder implements its own decoding code, but URLDecode() hands the job to java.net.URLDecoder. And that code has the same problem. It is not just Lucee.

In the wider java community this issue has been overcome by the URLCodec class in Apache commons codec.

So based on my findings my suggestion would be, to replace the implementation inside Lucee of both the URLDecoder class and the URLDecode() functions with the URLCodec class.

Does that sound OK? I am slightly concerned about breaking someone’s
application that relied on the behavior or java.net.URLDecoder in some
obscure way.

Suggestions welcome.

Zackster · September 5, 2023, 6:55pm

I’m all for adopting mature upstream libraries, Lucee/Ralio is so old that a lot of these didn’t exist back in the day.

As you rightly point out, we are beholden to existing behaviour and ACF quirks, but it’s worth trying

That’s why we have test cases

I’d say go for it and see how it goes!

hideo67 · September 7, 2023, 8:50am

OK. I will try.

hideo67 · September 7, 2023, 1:02pm

Hi. I ran the modification through the test suite, and it failed.

I found out that the test case for the URLDecode() function requires that a malformed input is bypassed silently.

test/functions/URLDecode.cfc

	assertEquals("%&/", "#URLDecode('%&/')#");

This line exists from the creation of the file in the git repo.

To satisfy this test case, the current URLDecode() implementation first gives java.net.URLDecoder a try. It will throw an exception on such input. Then to give the string a second chance it is passed to Lucee’s URLDecoder() class, which is permissive for such input.

To keep this test passing, you have to keep the custom implementation of the URLDecoder class.

I don’t think that this particular test case was added to confirm that a required feature works. I guess that it was added merely to check that the salvation logic for malformed input works as the author of URLDecoder() intended.

Since I don’t have Adobe ColdFusion at my hand, I can’t confirm the behavior of ACF.

Is it OK to remove that line from the test case?

hideo67 · September 7, 2023, 1:52pm

I have another question.

How can I test java classes that are not directly exposed as CF functions?

The URLDecoder class (lucee.commons.net.URLDecoder) that I want to fix is used from Lucee when form post data is converted and passed to the CFML application.

I don’t see any test code, at least in the Lucee repo, doing such tests.

Is there a place for such kind of tests to be placed?

Zackster · September 8, 2023, 9:33am

@cfmitrah thoughts on the ACF question?

While we try and avoid calling java in our tests cases, sometimes we need to.

You can do the following

github.com

lucee/Lucee/blob/6.0/test/tickets/LDEV0118.cfc#L24


      
          * 
          * You should have received a copy of the GNU Lesser General Public 
          * License along with this library.  If not, see <http://www.gnu.org/licenses/>.
          * 
          ---><cfscript>
          component extends="org.lucee.cfml.test.LuceeTestCase"	{
          
          
          
          	str=chr(49)&chr(57)&chr(49)&chr(56)&chr(49)&chr(52)&chr(48)&chr(124)&chr(69)&chr(110)&chr(103)&chr(105)&chr(110)&chr(101)&chr(101)&chr(114)&chr(105)&chr(110)&chr(103)&chr(44)&chr(32)&chr(66)&chr(101)&chr(100)&chr(114)&chr(111)&chr(99)&chr(107)&chr(124)&chr(51)&chr(49)&chr(51)&chr(57)&chr(51)&chr(46)&chr(55)&chr(53)&chr(124)&chr(57)&chr(124)&chr(51)&chr(124)&chr(51)&chr(53)&chr(53)&chr(49)&chr(48)&chr(48)&chr(124)&chr(52)&chr(55)&chr(55)&chr(46)&chr(53)&chr(48)&chr(124)&chr(49)&chr(50)&chr(53)&chr(57)&chr(46)&chr(55)&chr(53)&chr(124)&chr(52)&chr(46)&chr(48)&chr(48)&chr(13)&chr(10)&chr(49)&chr(57)&chr(52)&chr(53)&chr(49)&chr(53)&chr(52)&chr(124)&chr(77)&chr(117)&chr(108)&chr(104)&chr(111)&chr(108)&chr(108)&chr(97)&chr(110)&chr(100)&chr(44)&chr(32)&chr(71)&chr(97)&chr(98)&chr(114)&chr(105)&chr(101)&chr(108)&chr(108)&chr(97)&chr(124)&chr(51)&chr(48)&chr(53)&chr(50)&chr(53)&chr(46)&chr(48)&chr(48)&chr(124)&chr(49)&chr(50)&chr(124)&chr(51)&chr(124)&chr(49)&chr(52)&chr(51)&chr(51)&chr(54)&chr(52)&chr(48)&chr(124)&chr(50)&chr(52)&chr(54)&chr(46)&chr(48)&chr(48)&chr(124)&chr(49)&chr(50)&chr(50)&chr(53)&chr(46)&chr(48)&chr(48)&chr(124)&chr(52)&chr(46)&chr(48)&chr(48)&chr(13)&chr(10)&chr(50)&chr(48)&chr(48)&chr(51)&chr(54)&chr(48)&chr(52)&chr(124)&chr(82)&chr(101)&chr(101)&chr(115)&chr(44)&chr(32)&chr(75)&chr(97)&chr(116)&chr(104)&chr(114)&chr(121)&chr(110)&chr(124)&chr(50)&chr(50)&chr(51)&chr(57)&chr(50)&chr(46)&chr(50)&chr(50)&chr(124)&chr(49)&chr(50)&chr(124)&chr(51)&chr(124)&chr(54)&chr(53)&chr(55)&chr(50)&chr(57)&chr(55)&chr(124)&chr(49)&chr(48)&chr(52)&chr(46)&chr(49)&chr(53)&chr(124)&chr(49)&chr(48)&chr(49)&chr(50)&chr(46)&chr(49)&chr(53)&chr(124)&chr(52)&chr(46)&chr(53)&chr(48)&chr(13)&chr(10)&chr(49)&chr(48)&chr(48)&chr(51)&chr(52)&chr(56)&chr(55)&chr(124)&chr(83)&chr(117)&chr(44)&chr(32)&chr(89)&chr(117)&chr(110)&chr(32)&chr(67)&chr(104)&chr(101)&chr(110)&chr(103)&chr(124)&chr(50)&chr(48)&chr(55)&chr(57)&chr(48)&chr(46)&chr(48)&chr(48)&chr(124)&chr(56)&chr(124)&chr(52)&chr(124)&chr(54)&chr(56)&chr(50)&chr(49)&chr(124)&chr(50)&chr(48)&chr(56)&chr(57)&chr(46)&chr(48)&chr(48)&chr(124)&chr(50)&chr(48)&chr(56)&chr(57)&chr(46)&chr(48)&chr(48)&chr(124)&chr(49)&chr(48)&chr(46)&chr(48)&chr(48)&chr(13)&chr(10)&chr(51)&chr(52)&chr(55)&chr(56)&chr(54)&chr(57)&chr(124)&chr(68)&chr(105)&chr(120)&chr(111)&chr(110)&chr(44)&chr(32)&chr(68)&chr(111)&chr(114)&chr(105)&chr(115)&chr(32)&chr(67)&chr(124)&chr(49)&chr(57)&chr(50)&chr(53)&chr(56)&chr(46)&chr(53)&chr(48)&chr(124)&chr(49)&chr(51)&chr(124)&chr(50)&chr(124)&chr(50)&chr(57)&chr(54)&chr(56)&chr(56)&chr(50)&chr(124)&chr(55)&chr(55)&chr(52)&chr(46)&chr(51)&chr(52)&chr(124)&chr(55)&chr(55)&chr(52)&chr(46)&chr(51)&chr(52)&chr(124)&chr(52)&chr(46)&chr(48)&chr(48)&chr(13)&chr(10);
          	CSVParser=createObject("java","lucee.runtime.text.csv.CSVParser");
          
          	public void function testWhiteSpaceAtTheEnd(){
          		var qry=CSVParser.toQuery(str, '|', '"', nullValue(), false );
          		assertEquals(5,qry.recordcount());
          		assertEquals(9,qry.columnCount());
          	}
          } 
          </cfscript>

hideo67 · September 8, 2023, 11:32am

Hi again. Thanks for the test code snippet. I see you can call java code that way… What I was looking for is a way to post form data to Lucee and make sure it gets passed on to the app in the correct encoding. Or something equivalent of testing that. I don’t think there is a quick solution to that.

As for ACF, I noticed that I can call ACF functions from ColdFusion Fiddle (cffiddle.org) which appears here and there in ACF’s documentation.
So I checked the behavior of the URLDecode() funciton over there.

I found that ACF doesn’t have issues decoding partially encoded multi-byte sequences, which is good. And it is also just as permissive for erroneous input as Lucee is. It won’t raise exceptions on ill-formed input such as “%#” or just “%”.

So fixing the partially encoded string problem, while maintaining the permissiveness seems to fix an existing compatibility gap without the risk of breaking someone’s (ACF compatible) code.

I will post a PR as follows:

For the URLDecoder class, replace it with a hand written version.
For the URLDecode() function, maintain the current code structure that first calls the common library, but when that causes an error, give it a second try with URLDecoder(). I will replace the current call to java.net.URLDecoder to a call to apache.commons.codec.net.URLDecoder.

Zackster · September 8, 2023, 11:35am

sounds great, please file the patch against 6.0

Zackster · September 11, 2023, 9:06am

thanks for the patch, as per the other path, I have scheduled this for 6.1