You can use this:
$pattern = <<<'LOD' ~ # definitions : (?(DEFINE) (?<tagBL> pre | code | textarea | style | script ) (?<tagContent> < (\g<tagBL>) \b .*? </ \g{-1} > ) (?<tags> < [^>]* > ) (?<cdata> <!\[CDATA .*? ]]> ) (?<exclusionList> \g<tagContent> | \g<cdata> | \g<tags>) ) # pattern : \g<exclusionList> (*SKIP) (*FAIL) | \s+ ~xsi LOD; $html = preg_replace($pattern, ' ', $html);
Please note that this is a general approach, you can easily adapt it to a specific case by adding or removing things to the exclusion list. If you need other types of replacements, you can also adapt it using capture groups and preg_replace_callback() .
Another notice: the html tag remains open until the tag closes. If the closing tag does not exist, all the contents after the tag belong to this tag to the end of the line. To handle this, you can change </ \g{-1} > to (?: </ (?:\g{-1}| head | body | html) > | $) in the definition of the contents of the tag, for example, or make more complex rules.
EDIT:
Some information you can find in the php manual :
Syntax nowdoc is an alternative syntax for defining strings.
It can be very useful to make a more readable multiline string without changing its layout and avoid questions about escaping quotes or not.
& nbsp The syntax nowdoc has the same behavior as single quotes, i.e. variables are not interpreted as escaped tokens, such as \t or \n . If you want to have the same behavior as double quotes, use the heredoc syntax.
You can find some information at http://pcre.org/pcre.txt :
First: pattern delimiter
In most cases, people write their templates using the / separator. /Gnagnagna/ , /blablabla/ixUums , etc.
But when they write a pattern with thousands or millions of slashes, they prefer to escape each of the thousands of slashes one by one to select a different delimiter! With PHP, you can select the desired template if it is not an alphanumeric character. I chose ~ instead of / for three reasons:
- If I choose
~ , I donβt need to hide slashes because there is no ambiguity with a delimiter and an alphabetic character. - I have never seen growth months on this site, someone who asks for a template with a tilde inside.
- I am sure that someday someone will ask a template with a tilde, this is what I had a meeting of the third kind.
Second: how to make a long template more readable?
PCRE (Perl Common Regular Expression, the regex engine used by PHP) has ways to make the code more readable. These methods exactly match the general code:
- You can ignore spaces
- You can add comments.
- You can define subpatterns
For 1 and 2, this is simple, you only need to add the x modifier (this is the reason you find x at the end). The x modifier allows you to use verbose mode, which ignores spaces and where you can add comments like this # comment at the end of a line.
About subpatterns: you can use named groups, for example: instead of writing ~([0-9]+)~ to match and write a number inside group 1, you can write ~(?<number>[0-9]+)~ . Now, using this subpattern, you can refer to the captured content using \g{number} or to the template itself with \g<number> anywhere in the template. Examples:
~^(?<num>[0-9]+)(?<letter>[az]+)\g<num>\g<letter>$~
will match 45ab67cd
~^(?<num>[0-9]+)(?<letter>[az]+)\g{num}\g<letter>$~
will match 45ab45cd but not 45ab67cd
In these two examples, the named subpatterns are part of the main template and correspond to the beginning of the line. But using the syntax (?(DEFINE)...) , you can define them from the main template, because everything you write between these brackets does not match.
~(?(DEFINE)(?<num>[0-9]+)(?<letter>[az]+))^\g<num>\g<letter>$~
does not match 45ab67cd because everything inside the DEFINE part is ignored to match, but:
~(?(DEFINE)(?<num>[0-9]+)(?<letter>[az]+))^\g<num>\g<letter>\g<num>\g<letter>$~
does.
Third: relative backlinks
When you use a capture group in a template, you can use a link to the captured content, for example:
$str = 'cats meow because cats are bad.'; $pattern = '~^(\w+) \w+ \w+ \1 \w+ \w+\.$~'; var_dump(preg_match($pattern, $str));
the current code returns true since the pattern matches the string. In the pattern, \1 refers to the content ( cats ) of the first capture group. Instead of writing \1 you can use the oniguruma syntax and writing \g{1} , which refers to the first capture group, is the same.
Now, if you want to access the content of the last group, but you do not need the number (or name) of the group, you can use the relative link by writing \g{-1} (i.e. the first group on the left)
Fourth: xsi modifiers
The general behavior of the template can be changed by modifiers. Here I used three modifiers:
x # for verbose mode i # make the pattern case insensitive (ie '~CaT~i' will match "cat") s # (singleline mode): by default the . doesn't match newline, with the s modifier it does.
Last: control verbs with return
Tracking verbs are an experimental function obtained from the relx perl engine (the state is experimental in perl too, but if nobody uses it, it will not change).
What is the countdown?
if I try to match "aaaaab" with ~a+ab~ the regex engine, since + is a greedy quantifier, it will catch all a (five a), but only b will remain after it, which does not correspond to the subpattern ab . The only way for the regex engine is to return one a , and then ab can be matched. This is the default behavior for the regex engine.
More on backtracking here and here .
Reverse tracking verbs are tools that make the regex engine have the behavior you want for a subpattern.
Here I used two verbs: (*SKIP) and (*FAIL)
(*FAIL) is the easiest. The subpattern forcibly fails immediately.
(*SKIP) : when the sub-step fails after this verb, the regex mechanism does not have the right to return the characters matched before this verb. And this content cannot be reused for another alternative subpattern.
I understand that all this is not always easy, but I hope that, step by step, one day, all these things will be clear to you.