Using regex to remove HTML tags
I need to convert
$text = 'We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe';
[Edit] There may be several links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags must be removed, and the href value from the tags <a>must be added as described above.
What would be an effective way to solve this with regex? Any piece of code will be great.
DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[@href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
, , HTML- DomDocument. XPath, SQL XML, href. node innerHTML href . DOM API Xpath.
, , Regex, , , .
, . , , regex, , :
<i> - :
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
( php , replace , - , .)
<a> . . , <a >. </a>
:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
( , , - , . , , , , " ". )
:
It is also very easy to do with the parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And this creates the code you want in the test case.