Can anyone point me to the Syntactic rules used for cfsearch in Lucee/Lucene or examples of sanitizing the Criteria input field.
for example searching for “–” generates an error and I am sure there is more.
Can anyone point me to the Syntactic rules used for cfsearch in Lucee/Lucene or examples of sanitizing the Criteria input field.
for example searching for “–” generates an error and I am sure there is more.
I am running version 5.3.6.61 and I think I found what I was looking for. I modified a UDF found on cflib.org for text input to make it Solr compatible. it’s based on the information found at Apache Lucene - Query Parser Syntax Special Characters
The UDF requires another one uCaseWordsForSolr aso found at cflib.org.
This seems to do what I wanted now and prevents input of invalid search characters.
<cfscript>
/**
* Like VerityClean, massages text input to make it Solr compatible.
* v1.0 by Sami Hoda
* v2.0 by Daria Norris to deal with wildcard characters used as the first letter of the search
* v2.1 by Paul Alkema - updated list of characters to escape
* v2.2 by Adam Cameron - Merge Paul's & Daria's versions of the function, improve some regexes, fix logic error with input argument (was both required and had a default), converted wholly to script
* v2.3 by Dennis Powers - modified cleantext regex and reBadChars regex for Lucene input
*
* @param input String to run against (Required)
* @return Returns a string.
* @author Sami Hoda (sami@bytestopshere.com)
* @version 2.2, October 2, 2012
* @version 2.3, July 26, 2020
*/
string function solrClean(required string input){
var cleanText = trim(arguments.input);
// List of bad charecters. "+ - && || ! ( ) { } [ ] ^ " ~ * ? : \"
// http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping Special Characters
// var reBadChars = "\+|-|&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\*|\?|\:|\\";
var reBadChars = "\&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\?|\:|\\";
// Replace comma with OR
cleanText = replace(cleanText, "," , " or " , "all");
// Strip bad characters
cleanText = reReplace(cleanText, reBadChars, " ", "all");
// Clean up sequences of space characters
cleanText = reReplace(cleanText, "\s+", " ", "all");
// clean up wildcard characters as first characters
cleanText = reReplace(cleanText, "(^[\+|-|&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\*|\?|\:|\\\-\*\?]{1,})", "");
// uCaseWords - and=AND, etc - lcase rest. if keyword is mixed case - solr treats as case-sensitive!
cleanText = uCaseWordsForSolr(cleanText);
return trim(cleanText);
}
</cfscript>