HTML cleanup: removing an item conditionally based on its attributes

Question

HTML cleanup: removing an item conditionally based on its attributes

According to the HTML purifier smoketest , invalid URIs are sometimes discarded to leave the anchor tag without an attribute, e.g.

<a href="javascript:document.location='http://www.google.com/'">XSS</a> becomes <a>XSS</a>

..., and also from time to time it is reduced to a protocol, for example

<a href="http://1113982867/">XSS</a> becomes <a href="http:/">XSS</a>

While this is hassle-free, in fact, it's a little ugly. Instead of trying to remove them using regular expressions, I was hoping to use the native HTML capabilities of the Purifier library / injectors / plugins / whathaveyou.

Reference point: attribute handling

Conditionally remove attribute HTMLPurifier is very simple. Here the library offers the HTMLPurifier_AttrTransform class using the confiscateAttr() method.

Although I personally do not use the confiscateAttr() functionality, I use HTMLPurifier_AttrTransform as this thread to add target="_blank" to all anchors.

 // more configuration stuff up here $htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true); $anchor = $htmlDef->addBlankElement('a'); $anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target(); // purify down here

HTMLPurifier_AttrTransform_Target is of course a very simple class.

 class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform { public function transform($attr, $config, $context) { // I could call $this->confiscateAttr() here to throw away an // undesired attribute $attr['target'] = '_blank'; return $attr; } }

This piece works like a charm, naturally.

Controls

Maybe I'm not squinting too much in HTMLPurifier_TagTransform , or I HTMLPurifier_TagTransform looking in the wrong place (s), or I don’t understand at all, but I can’t understand how to conditionally remove <strong> elements.

Say something like:

 // more configuration stuff up here $htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true); $anchor = $htmlDef->addElementHandler('a'); $anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull(); // add target as per 'point of reference' here // purify down here

With a Cull class extending something that has the ability to confiscateElement() , or comparable, in which I could check for the missing href attribute or href attribute with the contents of http:/ .

HTMLPurifier_Filter

I understand that I can create a filter, but the examples (Youtube.php and ExtractStyleBlocks.php) suggest that I will use regular expressions in it, which I would really avoid, if at all possible. I am hoping for a built-in or quasi-board solution that takes advantage of the superior features of the HTML Cleaner parser.

Returning null in the child class HTMLPurifier_AttrTransform , unfortunately, does not shorten it.

Does anyone have any clever ideas, or am I stuck with regexes? :)

+4

html php html-parsing htmlpurifier

pinkgothic Apr 14 '10 at 15:20

source share

3 answers

The fact that you cannot remove elements using TagTransform seems to have been an implementation detail. The classic mechanism for removing nodes (a higher level than just tags) is to use the Injector.

In any case, a certain part of the functionality you are looking for is already implemented as% AutoFormat.RemoveEmpty

+2

Edward Z. Yang Apr 14 '10 at 15:34

source share

For reading, this is my current solution. It works, but completely bypasses the HTML cleaner.

 /** * Removes <a></a> and <a href="http:/"></a> tags from the purified * HTML. * @todo solve this with an injector? * @param string $purified The purified HTML * @return string The purified HTML, sans pointless anchors. */ private function anchorCull($purified) { if (empty($purified)) return ''; // re-parse HTML $domTree = new DOMDocument(); $domTree->loadHTML($purified); // find all anchors (even good ones) $anchors = $domTree->getElementsByTagName('a'); // collect bad anchors (destroying them in this loop breaks the DOM) $destroyNodes = array(); for ($i = 0; ($i < $anchors->length); $i++) { $anchor = $anchors->item($i); $href = $anchor->attributes->getNamedItem('href'); // <a></a> if (is_null($href)) { $destroyNodes[] = $anchor; // <a href="http:/"></a> } else if ($href->nodeValue == 'http:/') { $destroyNodes[] = $anchor; } } // destroy the collected nodes foreach ($destroyNodes as $node) { // preserve content $retain = $node->childNodes; for ($i = 0; ($i < $retain->length); $i++) { $rnode = $retain->item($i); $node->parentNode->insertBefore($rnode, $node); } // actually destroy the node $node->parentNode->removeChild($node); } // strip out HTML out of DOM structure string $html = $domTree->saveHTML(); $begin = strpos($html, '<body>') + strlen('<body>'); $end = strpos($html, '</body>'); return substr($html, $begin, $end - $begin); }

It would be nice for me to have a good HTML cleaning solution for this, so, as heads-up, this answer will not be self-accepted. But in case the best answer does not end, at least it can help those who have similar problems. :)

0

pinkgothic Apr 15 '10 at 17:05

source share

pinkgothic · Accepted Answer · 2010-04-19T08:46:49+0000

Success! Thanks to Ambush Commander and mcgrailm in another question , I am using a fairly simple solution:

 // a bit of context $htmlDef = $this->configuration->getHTMLDefinition(true); $anchor = $htmlDef->addBlankElement('a'); // HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from // all anchor tags (see first post for class detail) $anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp(); // this is the magic! We're making 'href' a required attribute (note the // asterisk) - now HTML Purifier removes <a></a>, as well as // <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp // is through with it! $htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

He works, he works, bahahahaHAHAHAHAHAhh ͥͤͫ̀ ğ ͮ͑̆ͦ - ̓̉ͬ͋ h ́ͧ̆̈́̉ ğ ̈́͐̈ a ̾̈́̑ͨ ô ̔̄̑̇ g ̀̄ h ̘̝͊̐ͩͥ̋ͤ͛ g ̦̣̙̙̒̀ͥ̐̔ͅ o ̤̣ hg ͓̈́͋̇̓́̆ a ͖̩̯̥͕͂̈̐ͮ̒ o ̶ͬ̽̀̍ͮ̾ͮ͢҉̩͉̘͓̙̦̩̹͍̹̠̕ g ̵̡͔̙͉̱̠̙̩͚͑ͥ̎̓͛̋͗̍̽͋͑̈́̚ ... ! * manic laughter, murmur of noises, keels with a smile on his face *

HTML cleanup: removing an item conditionally based on its attributes

Reference point: attribute handling

Controls

HTMLPurifier_Filter

More articles: