PHP: strip_tags - remove only certain tags (and their contents)?

Question

PHP: strip_tags - remove only certain tags (and their contents)?

I use the strip_tags() function, but I need to remove some tags (and all of their contents).

eg:

 <div> <p class="test"> Test A </p> <span> Test B </span> <div> Test C </div> </div>

Let's say I need to get rid of the P and SPAN tags and save only:

 <div> <div> Test C </div> </div>

strip_tags expects the tags you want to use as the second parameter.

In this particular example, I could use striptags($html, "<div>"); but the html I am clearing and the tags that need to be removed are always different.

I was looking for a watch for a function that fits my needs, but could not find anything useful.

Any idea?

+6

php web-scraping strip-tags

Dylan Jun 23 '12 at 0:56

source share

2 answers

You say you use the Simple HTML DOM (Good! This is the right way to parse HTML). When I need to remove a tag and its contents, I:

 $rows = $html->find("span"); foreach ($rows as $row) { $row->outertext = ""; } $html->load($html->save());

The last line is required because the DOM gets confused after making the changes, so the whole DOM needs to be collapsed and then parsed again so that the changes are permanent (IMO, error in Simple HTML DOM).

The simple HTML DOM approach is safer and more stable than regex.

+1

Cubiclesoft Jun 27 '12 at 15:25

source share

nickb · Accepted Answer · 2012-06-23T01:04:49+0000

Use regex. Something like this should work:

 $tags = array( 'p', 'span'); $text = preg_replace( '#<(' . implode( '|', $tags) . ')>.*?<\/$1>#s', '', $text);

demo shows that nothing has replaced the required tags.

Note that you may need to tweak it more to compensate for spaces in tags or other unknowns that your example does not demonstrate.

This uses a regular expression to capture tags with or without attributes:

 '#<(' . implode( '|', $tags) . ')(?:[^>]+)?>.*?<\/$1>#s'

PHP: strip_tags - remove only certain tags (and their contents)?

More articles: