Before everyone tells me that I should not do disinfection on the client side (I really intend to do this on the client, although it can work in SSJS as well) let me clarify what I'm trying to do.
I would like something similar to Google Caja or HTMLPurifier , but for JavaScript: a whitelist-based security approach that handles HTML and CSS (of course, not pasted into the DOM, which would be unsafe, but first received as a string) , and then selectively filters out unsafe tags or attributes, ignoring them or optionally including them as escaped text or otherwise allowing them to be communicated to the application for further processing, ideally in context. It would be great if it could reduce any JavaScript to a safe subset, as in Google Caja, but I know that it will be a lot.
My use case refers to unreliable XML / XHTML data obtained through JSONP (data from a Mediawiki widget before processing a wiki, thereby allowing the use of raw but unreliable XML / HTML input) and allows the user to make queries and transformations on this data (XQuery, jQuery, XSLT, etc.), using HTML5 for offline use, IndexedDB repositories, etc., and which then allow the results to be viewed on the same page where the user viewed the source and built or imported their queries.
The user can produce any output that they need, so I will not sanitize what they do - if they want to add JavaScript to the page, they will all be useful to them. But I want to protect users who want to be sure that they can add code that safely copies over target elements from untrusted input, without allowing them to copy insecure input.
This is definitely doable, but I'm wondering if there are any libraries that already do this.
And if I am stuck in implementing this on my own (although I'm curious anyway), I would like to get evidence that using innerHTML or creating / adding a DOM before inserting into a document is safe every time. For example, can events be accidentally triggered if I first ran DOMParser or relied on parsing an HTML browser using innerHTML to add raw HTML to an uninserted div? I believe it should be safe, but not sure if DOM manipulation events can happen in any way before the insert that can be used.
Of course, after that it would be necessary to clean up the built DOM, but I just want to check if I can safely build the DOM object itself to facilitate traversal, and then worry about filtering out unwanted elements, attributes and attribute values.
Thanks!