First, keep in mind that the HTML received when pasting from Word (or any other HTML source) will vary greatly depending on the source. Even different versions of Word will give you a radically different entry. If you have developed code that works fine on content from the version of MS Word that you have, it may not work at all for another version of MS Word.
In addition, some sources will embed HTML-like content, but actually garbage. When you paste HTML content into a rich text area in your browser, your browser has nothing to do with how this HTML code is generated. Do not expect this to be valid in any part of your imagination. In addition, your browser will promote HTML as it is pasted into the DOM area of ββyour rich text.
Since the potential inputs are very different, and because the acceptable outputs are difficult to determine, it is difficult to create a suitable filter for these kinds of things. In addition, you cannot control how future versions of MS Word will process their HTML content, so your code will be difficult for the future.
However, with a heart! If all the world's problems were easy, it would be a rather boring place. There are some potential solutions. You can save the good parts of HTML and discard the bad parts.
It looks like your HTML-based RTE works like most HTML editors do. In particular, it has an iframe, and in the document inside the iframe, he set designMode to "on".
You want to catch the paste event when it occurs in the <body> element of the document inside this iframe. I was very specific here, because it should be: do not catch it on an iframe; do not delay it in the iframe; don't linger it on iframe document. Trap in the <body> element of the document inside the iframe. Very important.
var iframe = your.rich.text.editor.getIframe(),
Note that in my code example, a function called handlePaste was added. We get to this. The insert event is funny: some browsers fire it before pasting, after which some browsers fire it. You want to normalize this, so you always deal with the inserted content after the insert. To do this, use the timeout method.
function handlePaste() { window.setTimeout(filterHTML, 50); }
So, 50 milliseconds after the insert event, the filterHTML function will be called. This is the meat of the job: you need to filter out the HTML code and remove any unwanted styles or elements. You have something to worry about!
I personally saw MSWord paste in these elements:
metalinkstyleo:p (paragraph in another namespace)shapetypeshape- Comments, for example
<!-- comment --> . font- And, of course, the
MsoNormal class.
The filterHTML function should remove them when necessary. You can also remove other items that you consider necessary. Here is an example filterHTML that removes the elements listed above.
// Your favorite JavaScript library probably has these utility functions. // Feel free to use them. I'm including them here so this example will // be library-agnostic. function collectionToArray(col) { var x, output = []; for (x = 0; x < col.length; x += 1) { output[x] = col[x]; } return output; } // Another utility function probably covered by your favorite library. function trimString(s) { return s.replace(/^\s\s*/, '').replace(/\s\s*$/, ''); } function filterHTML() { var iframe = your.rich.text.editor.getIframe(), win = iframe.contentWindow, doc = win.document, invalidClass = /(?:^| )msonormal(?:$| )/gi, cursor, nodes = []; // This is a depth-first, pre-order search of the document body. // While searching, we want to remove invalid elements and comments. // We also want to remove invalid classNames. // We also want to remove font elements, but preserve their contents. nodes = collectionToArray(doc.body.childNodes); while (nodes.length) { cursor = nodes.shift(); switch (cursor.nodeName.toLowerCase()) { // Remove these invalid elements. case 'meta': case 'link': case 'style': case 'o:p': case 'shapetype': case 'shape': case '#comment': cursor.parentNode.removeChild(cursor); break; // Remove font elements but preserve their contents. case 'font': // Make sure we scan these child nodes too! nodes.unshift.apply( nodes, collectionToArray(cursor.childNodes) ); while (cursor.lastChild) { if (cursor.nextSibling) { cursor.parentNode.insertBefore( cursor.lastChild, cursor.nextSibling ); } else { cursor.parentNode.appendChild(cursor.lastChild); } } break; default: if (cursor.nodeType === 1) { // Remove all inline styles cursor.removeAttribute('style'); // OR: remove a specific inline style cursor.style.fontFamily = ''; // Remove invalid class names. invalidClass.lastIndex = 0; if ( cursor.className && invalidClass.test(cursor.className) ) { cursor.className = trimString( cursor.className.replace(invalidClass, '') ); if (cursor.className === '') { cursor.removeAttribute('class'); } } // Also scan child nodes of this node. nodes.unshift.apply( nodes, collectionToArray(cursor.childNodes) ); } } } }
You included some HTML sample that you want to filter, but you did not specify the output sample that you would like to see. If you update your question to show how you want your sample to look after filtering, I will try to configure the filterHTML function to match. For now, consider this feature as a starting point for developing your own filters.
Please note that this code does not attempt to distinguish pasted content from content that existed prior to pasting. It does not need to be done; the things that he removes are considered invalid wherever they appear.
An alternative solution would be to filter out these styles and contents with regular expressions against the innerHTML of the document body. I have taken this path, and I advise against this in favor of the solution that I present here. The HTML that you get when you paste will change so much that regular expression parsing will quickly run into serious problems.
Edit:
I think now I see: you are trying to remove the style attributes themselves, right? If so, you can do this during the filterHTML function by including this line:
cursor.removeAttribute('style');
Or you can target individual inline styles for deletion as follows:
cursor.style.fontFamily = '';
I updated the filterHTML function to show where these lines will go.
Good luck and happy coding!