Syntactic rules for cfsearch

dennis · July 28, 2020, 4:43pm

Can anyone point me to the Syntactic rules used for cfsearch in Lucee/Lucene or examples of sanitizing the Criteria input field.

for example searching for “–” generates an error and I am sure there is more.

cfmitrah · July 29, 2020, 6:49am

@dennis, Any details, and what lucee version you running?

dennis · July 29, 2020, 2:25pm

I am running version 5.3.6.61 and I think I found what I was looking for. I modified a UDF found on cflib.org for text input to make it Solr compatible. it’s based on the information found at Apache Lucene - Query Parser Syntax Special Characters

The UDF requires another one uCaseWordsForSolr aso found at cflib.org.

This seems to do what I wanted now and prevents input of invalid search characters.

<cfscript>
/**
 * Like VerityClean, massages text input to make it Solr compatible.
 * v1.0 by Sami Hoda
 * v2.0 by Daria Norris to deal with wildcard characters used as the first letter of the search
 * v2.1 by Paul Alkema - updated list of characters to escape
 * v2.2 by Adam Cameron - Merge Paul's &amp; Daria's versions of the function, improve some regexes, fix logic error with input argument (was both required and had a default), converted wholly to script
 * v2.3 by Dennis Powers - modified cleantext regex and reBadChars regex for Lucene input
 *
 * @param input      String to run against (Required)
 * @return Returns a string. 
 * @author Sami Hoda (sami@bytestopshere.com) 
 * @version 2.2, October 2, 2012
 * @version 2.3, July 26, 2020
 */
string function solrClean(required string input){
    var cleanText = trim(arguments.input);
    // List of bad charecters. "+ - && || ! ( ) { } [ ] ^ " ~ * ? : \" 
    // http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Escaping Special Characters
	// var reBadChars = "\+|-|&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\*|\?|\:|\\";
    var reBadChars = "\&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\?|\:|\\";
    
    // Replace comma with OR
    cleanText = replace(cleanText, "," , " or " , "all");

    // Strip bad characters
    cleanText = reReplace(cleanText, reBadChars, " ", "all");

    // Clean up sequences of space characters
    cleanText = reReplace(cleanText, "\s+", " ", "all");

    // clean up wildcard characters as first characters
    cleanText = reReplace(cleanText, "(^[\+|-|&&|\|\||!|\(|\)|{|}|\[|\]|\^|\""|\~|\*|\?|\:|\\\-\*\?]{1,})", "");

    // uCaseWords - and=AND, etc - lcase rest. if keyword is mixed case - solr treats as case-sensitive!
    cleanText = uCaseWordsForSolr(cleanText);
    return trim(cleanText);
}
</cfscript>