I have a requirement to avoid the blacklist of HTML tags before displaying it on a web page. The reason for the selectivity is the ability to save formatting (bod, italics, fonts, etc.), but not tags that will “break” the page (scripts, meta, etc.).
After thinking about this for a while, I came up with two approaches:
- RegEx - as almost everyone will tell you, using RegEx to control HTML is a bad idea
- HtmlAgilityPack
I figured my best (and really only) solution was to load the string into the HtmlAgilityPack loop and recursively through the child nodes. For each node, I would check if it was on the specified blacklist. If that were the case, I would avoid opening (and closing, if it existed) the node, then process it InnerHtml. If it was not in the list, then print the node as is, still processing InnerHtml.
So, given the following (very simple) source
The quick <b style='padding: 0 25em;'>brown</b> fox <b>jumped <i>over</i> the <meta http-equiv='refresh' /> moon</b>.
I need the following output
The quick <b style='padding: 0 25em;'>brown</b> fox <b>jumped <i>over</i> the <meta http-equiv='refresh' /> moon</b>.
After many studies, I encountered several problems, questions, and roadblocks.
- Is the
HtmlAgilityPackbest library for this requirement? - ? , ,
.Descendants(), , . , <i>over</i> node InnerHtml b node, node . - , InnerHtml. ( ) , InnerHtml, . , , (Name, Id, Attributes ..), , .
, :
public string EscapeHtmlTags(string value, ICollection<string> tags) {
var doc = new System.Text.StringBuilder();
doc.LoadHtml(doc);
if (tags.Contains(doc.DocumentNode.Name, StringComparer.CurrentCultureIgnoreCase)) {
EscapeHtmlTags(doc.DocumentNode.InnerHtml, tags);
}
else {
EscapeHtmlTags(doc.DocumentNode.InnerHtml, tags);
}
}
, , , NodeTypes - , , StringBuilder .. .
?