Sanitize All Submitted Forms By Default? #XSS

I’m planning to programmatically apply Canonicalize() and SanitizeHTML() to all submitted Form values by default via OnRequest, with perhaps a future exclusion list for specific templates if there happens to be some reason why user input should ever be trusted. :stuck_out_tongue:

Would it be feasible for Lucee to do this out of the box with an option to disable (opt out) similar to Script-protect in Admin - Settings - Request?

Seems like that would make the entire platform more secure against Cross-Site Scripting, especially for new users, and save time for devs not having to add code in Application.cfc.

And that extra security “patch” by default would be great for marketing vs competing app servers? :wink:

I highly recommend against doing this. It’s bad form and sloppy IMO. Values should be encoded at the time of use based on the medium they’re being injected into. Here is a perfect example of this sort of preemptive encoding gone wrong which I found at my local auto parts store:

You can see some clever developer decided to just encode everything the DB directly, which led to incorrect output on their point of sale devices which don’t use HTML! There are many types of encoding-- for URLs, HTML, Javascript, XML, XML attributes, etc and each one has different rules. Don’t ruin your data when it comes in by making guessing about how you’ll use it later.

And secondly, who are you to say is valid data? If I legally changed my name to be Brad <br> Wood and all legal documents pertaining to my name contained exactly that text, then when I type my name into your form fields, it’s not your job to guess what about that string is valid. It’s your job to store exactly what I input, and when you go to output it, you encode it properly based on the location it’s being output.

<h1>#encodeForHTML( customer_name )#</h1>


theURL = " customer_name )#";


<script language="javascript">
  var customer_name = '#encodeForJavascript( customer_name )#';
  alert( customer_name );

You want to keep the user’s data in-tact and only encode it when necessary based on the output medium.

Canonicalizing is another place where data can actually be lost when you don’t expect it to. Take the following string for example


Now, let’s say you need to include it in a URL and still maintain all of its data. URL encoding it will correctly give you this:


which can be successfully decoded back to the original string. But if we encode it with canonicalization enabled, we get this!


which, when decoded again, gives you only


which isn’t at all what the original input was! So by blindly canonicalizing our data, we can actually lose important data.

And finally, depending on your application, you may be expecting HTML meta characters to be submitted if you have any sort of a CMS or comment system that allows users to submit HTML markup. CF’s “script project” feature already gives people fits in this case by replacing some tags in their CMS’s with invalidtag. The correct solution here is using a library like AntiSamy to clean out specific unwanted markup only on the form fields where it makes sense.

If you’re looking for some sort of global protection to help lazy devs who forget about encoding, I would recommend looking into the encodeFor attribute of the CFOutput tag.

Then all variable interpolation inside of that tag will automatically be encoded

<cfoutput encodeFor="html" >

You can even set the default value for this attribute for your entire app in your Application.cfc with something like this:


Just keep in mind there doesn’t appear to be a method in Lucee to override the attribute back to the default for certain tags.


Unfortunatley, the last suggestion I made does not appear to work in Lucee. I have entered two tickets:

Thanks for the thoughtful reply, and I mostly agree now that you’ve made me aware of those use cases, though I’m also considering how CFML is primarily intended for HTML output, especially when used by less experienced developers, and how my idea for an exclusion list and opt out could take care of the rest. The exclusion list could end up being more work than individually sanitizing input, but I’m still intrigued by the idea of more XSS protection out of the box.

However, another thought I had after posting is that it’s probably best to identify hack attempts and prevent them from being stored in the database (and tarpit the ip and/or disable malicious user’s account), instead of storing the sanitized data.

Now I’m also wondering what the use cases are for Canonicalize(). :thinking:

And for that matter, since I just tested SanitizeHTML('&') and found that it converts to &amp;, and & is a legit character that might pop up in any string, when would we ever want to use SanitizeHTML()? Is ColdFusion superior with its Antisamy GetSafeHTML()? Or is SanitizeHTML() also AntiSamy?

Funny that you mention the encodeFor attribute because I was already in the process of applying it to most of my cfoutput tags and using the various EncodeFor functions where appropriate. For similar reasons as what you outlined, I realized encoding output for the entire app would break some things such as links. We could just say that Lucee’s inability to change the default value of the output tag is a feature. :shushing_face:

Found answers to my questions re: AntiSamy:

Via I found that GetSafeHTML() does not convert & to &amp;, BUT it did remove a totally legit <img> tag and presumably other safe HTML which would annoy many a non-malicious user of CKEditor and the like for blogs, email newsletters, etc.

So, what now? We just have to trust certain users? Especially if they’re paying customers? :grinning:

& vs #GetSafeHTML('& xss test: <b>bold</b> <img src=""> <script>alert("TEST")</script>')#
<p>Checking whether trycf allows images: <img src=""> YEP!</p>

I would recommend looking into Fuseguard. It’s a product made by @pfreitag and it has all sorts of heuristics to try and detect malicious activity and then do anything from log it to blocking the request.

Canonicalization is most generally used in the security sense in conjunction with a validation of some sort. Let’s say I have a form and I don’t allow people to post the word “fart”, so I build a regex to search for that word. So a malicious user posts the text f&#97;rt through my form, where &#97; is the HTML encoded version of the letter a. Now my regex doesn’t match the word fart even though the website may be rendering it. The solution is to canonicalize the text first, then process my validation. When you pass f&#97;rt through the canonicalize() BIF, you get fart.

I recall a pretty common back SQL injection attack back around 2008 where the injected payload encoded the SQL to obfuscate it, then passed it into the eval() SQL function. Searching for malicious SQL wouldn’t match by default.

That BIF uses Anti-samy under the covers and has a VERY specific use case. It’s for when you have a form where you totally expect users to enter HTML, but ONLY a specific subset of HTML that you deem safe. This was common on bulletin boards where you could use bold tags or header tags. Also on MySpace, which was the location of the very first mainstream XSS attack by a hacker named Sami, which is where the name comes from! Antisamy should not be used on any other types of input except HTML.

1 Like

Yes I found that yesterday, but it’s $48/mth and currently my two income-producing businesses are barely making a meager profit. Also I prefer open-source, hence Lucee. :smiling_face_with_three_hearts:

Then what’s the best built-in Lucee (or free extension?) solution to identify XSS attacks in form inputs that are intended to be plain text?

Derp. I don’t remember how I came to that conclusion and after double checking … now I see that it does NOT convert to &amp; so nevermind!

And now to possibly answer my own question again, I actually don’t see why SanitizeHTML() isn’t exactly what’s needed for removing malicious code from every form input type, not just HTML.

I’m reminded how for ColdFusion, Charlie Arehart recommended GetSafeHTML() for that very purpose, and SanitizeHTML() does pretty much the same?

And for really plain text, could also apply Raymond Camden’s StripHTML().

As for attempting to police users, I’ll just give them the benefit of the doubt that most of them are simply wanting to embed a legit script and display a warning that scripts are not allowed.

Attempting to use Canonicalize() was a big mistake and caused problems with submitted data to our web applications. Regarding use of SanitizeHTML, I’d highly recommend running it through some unit tests so you can be sure that it’s protecting you that way you think it should and doesn’t invalidate good data.

Have you encountered any attempts to bypass filters using high ASCII, multibyte UTF-8 character or IDN characters? IDN characters render similarly to ASCII, but aren’t easily filterable and will fallback to default ASCII when used in the browser. (NOTE; PHP has had an idn_to_ascii function for at least 9 years now.)

In the past, I’ve had success identifying potential abuse with some CFML functions that I’ve written which leverage existing java libraries. Here are a few things I’ve tested:

  • isCyrillic (UDF that uses java.util.regex.Pattern to detect ASCII u0400-u04FF)
  • Junidecode (converts all UTF-8 to ASCII7)
  • Jsoup (performs sanitization + fixes intentionally invalid HTML)
  • Filter zero-width joiner from obfuscating visual content (&zwnj;)
  • Normalize gmail addresses (remove post-+ text & eliminate . from username)
  • Detect use of popular URL shorteners (Additionally connect to host to verify that YOURLS is not being used.)
  • Detect disposable email addresses. (Also block email w/sanitized version of “noreply” in username)
  • Verify origin IP using IP-API (Treat w/suspicion if website w/local audience is visited from a proxy, hosting or non-US IP.)
  • Parse URL, FORM, COOKIE & HEADER scopes for log4j exploits. (Our cloud-based WAF provider does this, but we’ve detected some that they’ve missed.)
  • Sanitize emojis using emoji-java. (ColdFusion doesn’t like filenames w/emojis. Improperly configured database tables may not store them properly. Values exported via Excel could cause issues when imported into 3rdparty systems.)

I’d highly recommend checking out jsoup. It’s very customizable and can be configured to use different safeList filters to allow necessary HTML tags & attributes. You can define custom rules or use built in ones: none, simpleText, basic, basicWithImages & relaxed. (ColdFusion’s GetSafeHTML() function does not include these options.) We additionally use jsoup to post-process ColdFusion HTML generation to ensure we’re outputting 100% valid HTML while also injecting attributes necessary for WCAG compliance.

1 Like

Are you saying that in your experience, SanitizeHTML alone is insufficient for protecting against the sufficient majority of XSS attacks? I realize the definition of “sufficient” can vary a great deal, but if I end up spending so many hours on security that I probably don’t need when I could be allocating that time to projects that generate income, then the malicious attackers have won by other means.

So far I have detected zero attempts to bypass filters of any kind. I am merely responding to some reports from a “researcher” via claiming that one of my sites is vulnerable to XSS, which I don’t doubt, but if it’s so complicated to properly secure a website, I’d expect 99.999999% of websites are vulnerable. And yet where are all the news reports that 99.999999% of websites have been hacked?

To clarify my question, when I said “SanitizeHTML alone” I meant specifically for cleaning the form input, but of course I’m also (and more importantly) encoding output of all data that originated from user input (except for special cases such as html-enabled blog or email newsletter content from trusted users). There are many other solutions in my todo list as recommended by OWASP and others such as the Content-Security-Policy header with script-nonce.

On a side note, I found this compelling article about the definition of “valid” input:

“Insufficient for protecting”? It was invalidating good data that was being posted. If you are expecting HTML, use it. If you aren’t expecting HTML (and HTML is passed), sanitize it. But I don’t recommend blindly sanitizing every string that is passed in a form post as sanitizing non-HTML can munge valid data.

Are your websites search-engine friendly and can they be fully indexed by spiders? Do they process data (ie, email)? Do they import data from CSV/Excel? Do they use an API? Do they have search features? Are there filters to prevent obscenities in web chat/message boards? Do they process credit cards? Do they have contact forms? Does the website have a large-ish registered membership base? If files are uploaded, do you rename or sanitize filenames? I develop, manage and secure a large assortment of CFML-based web applications and have encountered numerous abuses that attempt to bypass filters by providing visually similar and/or fallback characters.

I still personally recommend using JSOUP over SanitizeHTML based on overall feature set. (I’m working on a CFC library with many built-in features, but haven’t had time to finish it yet.)

As Brad noted, the FuseGuard Trial would also be beneficial for you to evaluate since you stated that website is susceptible to XSS.

1 Like

My original point was to NOT do that. As in, you’re looking at this whole thing backwards. Who cares what your users give you-- it’s just your job to store it, exactly as they input it, and then output the same. All you need to do is encode it properly when you output it. So if their legal name is brad <br> wood then that’s exactly what you store and exactly what you output on your web page. As soon as you try to start guessing what’s good and bad, you wind up with false positives that removes a valid part of someone’s data because it happened to be an HTML meta character.

1 Like

Right, and I quickly dropped my idea for that after you described the issues.

Also, I figured out why I got conflicting results re: SanitizeHTML() converting & to &amp;. It’s because the first time I looked at the page’s rendered source code, whereas the second time I was looking at the browser’s console, which I now realize renders &amp; as &.

I suppose that shouldn’t be too difficult for me since I’ve already been using jsoup for years, parsing links in email newsletters and replacing user urls with my own tracking urls.

But how does jsoup compare to OWASP Java HTML Sanitizer?

The specific job for this topic is to prevent XSS attacks. While encoding output is the most important solution, I still have no interest in storing obviously malicious code in my database. I don’t want to munge legit text such as x < y, but why would I ever want to insert <script src=""> into my database when it’s supposed to be non-html content? And even when the input is intended to be html/css/js that’s later rendered and executed, why not store in its sanitized form?

What if, for example, some day I open up an API allowing access to my data? I have no control over whether devs using my API are encoding their output. Sure it can be argued that’s not my problem, but you know how non-tech people think. They would still blame the data source for containing malicious code.

Also, since I am human and therefore imperfect, what if one day I just happen to forget to encode or sanitize some specific output? Preventing malicious code from being stored in the first place seems like a reasonable extra layer of protection.

As always, I am open to being proven wrong. :stuck_out_tongue:

I get it, but determining exactly what is malicious or not is not very clear cut. Sure, it seems obvious when we sit here and look at examples, but the problem I usually run into with any sort of heuristics-based checks is they get false positives and remove valid things, or miss stuff.

The short answer to your question is that there is no such isObviouslyMalicious() function built into Lucee. You may find some various homemade ones out there, but I’m sure none of them will quite catch everything or not remove some invalid data. The only Enterprise-class libraries out there that you’ll find that are battle tested and are pretty bullet proof will be the ESAPI encoding and Antisamy, but neither of those have a “does this seem safe” check that I’m aware of. They simply encode text or remove HTML based on your output needs.

FWIW, I don’t know if this was covered above, but this is basically what Lucee’s script protect does. It looks for any of the following in the string


and replaces them with


In fact, if you wanted to just tap into this, you can actually call Lucee’s internal method to do this like so:

scriptProtect = createObject( 'java', '' )

cleaned = scriptProtect.translate( 'foo<script>bar' )

which would return the text


I give this sort of approach a low confidence level and there are places where we know it won’t work and places where it can be bypassed pretty easy, but that’s a starting point as far as what’s in the engine.

1 Like

I’m not 100% sure how they compare. I’ve looked at the documentation and they both seem fairly similar with different presets and the ability to set custom rules.

I found some unit test cases here that have some invalid/vulnerable HTML you can try:

That’s also what I was thinking of. If a user enters a name with the content Andy<script src=""> to try XSS and you’re outputting it using #encodeForHTML( name )# then your application will just show Name: Andy<script src="">. I agree that this looks awefully ugly, but its not a threat, or is it somehow a threat?

Additionally, just as I’m adding that particular example javascript code snippet here in the discourse forums thread, it’s a totally valid content to be added to the forums database. Why would I want to always blindly sanitize or strip that code?

However, identifying malicous code in a string is so tough. When I can’t use encodeforHTML(), I’d not totally rely on any library. I’d rather make some sort of whitelisted pre-built code snippets who the user is able to select and add it programatically to his content.

It’s a threat at least in a business sense, meaning that I might lose customers if they see that in a context where it’s not appropriate and they question why such content is allowed. Just the thought that the site might be hacked is enough for many people to run away.

In a forum such as this where sharing examples of bad code is totally appropriate, it can easily be identified by the user and allowed by the app exactly as we do here, by putting it in a code block.

As for how to identify bad code, well if the sanitized form is very much different than the original, that’s enough to raise a big red flag.