"Simple" ColdFusion text cleaner/encoder that selectively allows some HTML elements.
This tool is English ASCII only.
This was written for ColdFusion 2018/2021. It has not been tested for earlier versions or for compatibility with Lucee.
When the user enters data we need to "clean" the data to prevent problems from XSS attacks to simply broken page formatting.
encodeForHTML()
and canonicalize()
both do a great job of cleaning user-entered text, but sometimes you want to allow some formatting options for the end user.
CleanText came initially from a need to allow the user to preserve line breaks in their entered text. It was expanded to allow bold and italics and later lists.
It was eventually expanded to support most of the basic editing options provided by the bare-bones CKEditor install. CKEditor includes a downloadable component that will do the cleaning on the client side, but we wanted to minimze the download and processing load on the client and to allow the cleaning to be done on the server side.
We exerimented with Markdown, but it proved too confusing for our end-users to enter and was a problem for our reporting engine (the engine understood HTML, but did not have a Markdown intrepreter option).
Eventually we will probably implement a full anti-sami suite, but for now this simplified tool does the work just fine.
The text_util.cfm package has two main components
MS Word loves replacing basic text with fancy unicode characters. stripWord goes through and replaces many of the special characters with their basic text equivalent and then removes all non-ascii characters from the string.
stripWord()
is used mostly by the cleanText()
function, but is provided as a separate call if needed for other uses (we have been known to call encodeForHTML( stripWord( TEXT_FIELD ) )
)
stripWord()
specifically replaces the following codes:
ANSII 8220 - #chr(8220)# - left quotes with "
ANSII 8221 - #chr(8221)# - right quotes with "
ANSII 8216 - #chr(8216)# - left quote with '
ANSII 8217 - #chr(8217)# - right quote with '
ANSII 8211 - #chr(8211)# - en dash with -
ANSII 8212 - #chr(8212)# - em dash with -
ANSII 8226 - #chr(8226)# - bullet with *
ANSII 8230 - #chr(8230)# - ellipsis with ...
Cleans and formats a string for display on the page.
cleanText first runs stripWord()
to remove MS Word characters.
It then optionally trims the string to maxLength characters. This is a blind trim and it will cut off text.
Then run the CF function encodeForHTML(string)
to remove HTML and other special characters and replace them with their escaped values
After the encodeForHTML()
, the string will contian only screen-ready clean text and escaped special characters.
At this point, we want to go back to the string and replace some of the escaped HTML characters and replace them with real HTML to allow the user to have some formatting options.
Replace escaped strong
, em
, u
, s
, sup
, sub
, blockquote
, ol
, ul
, and li
with their html equivalents.
Replace escaped p, strong, em, u, s, sup, sub, blockquote, ol, ul, and li with their htmlequivalents. It will do minimal checks to ensure that the tags are balanced, but it is not perfect.
If links_ok is set, replace escaped links.
If the string was trimmed earlier, append …' …' to the end.
CleanText will work for two common types of links:
-
<a href="URL">text</a>
- the link must start with the stringa href=
(that's what the search keys on). -
http://bareURL
- a bare URL (starting with http or https) will be converted to<a href="http://bareURL">bareURL</a>
Just do a <cfinclude template="text_util.cfm" />
in your page and then call cleanText on any value you want cleaned. Use it in place of encodeForHTML()
or canonicalize()
.
<cfinclude template="./text_util.cfm" />
<cfoutput>
#cleanText(VARIABLE_WITH_POTENTIALLY_BAD_TEXT)#
</cfoutput>
cleanText()
(or a close veriant of cleanText) has been in use on our Intranet based page for over 10 years with no reported problems. It has also been used on a closed access publically facing Internet page.
Well, I would love to be able to detect web and email addresses and make them clickable, but that is proving to be more trouble than I care for at this time.