Regexp: match everything except every <(pre | code | textarea)> (. *?) </ \\ 1> in an html document

It's a challenge!

As the name says, I would like to combine everything except the contents of the <pre>, <code> and <textarea> tags in an HTML document (for example, you can try the following text).

The goal in my case is to compress html with the removal of \ n \ t \ r and other cleaning, except in cases where it is strictly necessary, as in textarea.

As I work in PHP, I also thought about extracting the contents of the tags, related to the rest in PHP, and re-adding them to PHP. But I am very curious how to do this in regexp!

I tried in a great online editor: http://regex101.com/ the expression ((?=.?)((?!<pre>).)) With msg flags, but this is not quite what I want.

Any help would be greatly appreciated!

  Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna <span> aliquam </span> erat volutpat.  Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.

 <pre> Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolisore.  Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.
 Typi non habent claritatem insitam;  est usus legentis in iis qui facit eorum claritatem. </pre>

 Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius.
 Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum.
 <pre> Mirum est notare quam littera gothica, quam nunc putamus parum claram, anteposuerit litterarum formas humanitatis per seacula quarta decima et quinta decima. </pre>
 Eodem modo typi, qui nunc nobis videntur parum clari, fiant sollemnes in futurum. 
+6
source share
3 answers

You can use this:

 $pattern = <<<'LOD' ~ # definitions : (?(DEFINE) (?<tagBL> pre | code | textarea | style | script ) (?<tagContent> < (\g<tagBL>) \b .*? </ \g{-1} > ) (?<tags> < [^>]* > ) (?<cdata> <!\[CDATA .*? ]]> ) (?<exclusionList> \g<tagContent> | \g<cdata> | \g<tags>) ) # pattern : \g<exclusionList> (*SKIP) (*FAIL) | \s+ ~xsi LOD; $html = preg_replace($pattern, ' ', $html); 

Please note that this is a general approach, you can easily adapt it to a specific case by adding or removing things to the exclusion list. If you need other types of replacements, you can also adapt it using capture groups and preg_replace_callback() .

Another notice: the html tag remains open until the tag closes. If the closing tag does not exist, all the contents after the tag belong to this tag to the end of the line. To handle this, you can change </ \g{-1} > to (?: </ (?:\g{-1}| head | body | html) > | $) in the definition of the contents of the tag, for example, or make more complex rules.

EDIT:

Some information you can find in the php manual :

Syntax nowdoc is an alternative syntax for defining strings.
It can be very useful to make a more readable multiline string without changing its layout and avoid questions about escaping quotes or not.
& nbsp The syntax nowdoc has the same behavior as single quotes, i.e. variables are not interpreted as escaped tokens, such as \t or \n . If you want to have the same behavior as double quotes, use the heredoc syntax.

You can find some information at http://pcre.org/pcre.txt :

First: pattern delimiter

In most cases, people write their templates using the / separator. /Gnagnagna/ , /blablabla/ixUums , etc.
But when they write a pattern with thousands or millions of slashes, they prefer to escape each of the thousands of slashes one by one to select a different delimiter! With PHP, you can select the desired template if it is not an alphanumeric character. I chose ~ instead of / for three reasons:

  • If I choose ~ , I don’t need to hide slashes because there is no ambiguity with a delimiter and an alphabetic character.
  • I have never seen growth months on this site, someone who asks for a template with a tilde inside.
  • I am sure that someday someone will ask a template with a tilde, this is what I had a meeting of the third kind.

Second: how to make a long template more readable?

PCRE (Perl Common Regular Expression, the regex engine used by PHP) has ways to make the code more readable. These methods exactly match the general code:

  • You can ignore spaces
  • You can add comments.
  • You can define subpatterns

For 1 and 2, this is simple, you only need to add the x modifier (this is the reason you find x at the end). The x modifier allows you to use verbose mode, which ignores spaces and where you can add comments like this # comment at the end of a line.

About subpatterns: you can use named groups, for example: instead of writing ~([0-9]+)~ to match and write a number inside group 1, you can write ~(?<number>[0-9]+)~ . Now, using this subpattern, you can refer to the captured content using \g{number} or to the template itself with \g<number> anywhere in the template. Examples:

 ~^(?<num>[0-9]+)(?<letter>[az]+)\g<num>\g<letter>$~ 

will match 45ab67cd

 ~^(?<num>[0-9]+)(?<letter>[az]+)\g{num}\g<letter>$~ 

will match 45ab45cd but not 45ab67cd

In these two examples, the named subpatterns are part of the main template and correspond to the beginning of the line. But using the syntax (?(DEFINE)...) , you can define them from the main template, because everything you write between these brackets does not match.

 ~(?(DEFINE)(?<num>[0-9]+)(?<letter>[az]+))^\g<num>\g<letter>$~ 

does not match 45ab67cd because everything inside the DEFINE part is ignored to match, but:

 ~(?(DEFINE)(?<num>[0-9]+)(?<letter>[az]+))^\g<num>\g<letter>\g<num>\g<letter>$~ 

does.

Third: relative backlinks

When you use a capture group in a template, you can use a link to the captured content, for example:

 $str = 'cats meow because cats are bad.'; $pattern = '~^(\w+) \w+ \w+ \1 \w+ \w+\.$~'; var_dump(preg_match($pattern, $str)); 

the current code returns true since the pattern matches the string. In the pattern, \1 refers to the content ( cats ) of the first capture group. Instead of writing \1 you can use the oniguruma syntax and writing \g{1} , which refers to the first capture group, is the same.

Now, if you want to access the content of the last group, but you do not need the number (or name) of the group, you can use the relative link by writing \g{-1} (i.e. the first group on the left)

Fourth: xsi modifiers

The general behavior of the template can be changed by modifiers. Here I used three modifiers:

 x # for verbose mode i # make the pattern case insensitive (ie '~CaT~i' will match "cat") s # (singleline mode): by default the . doesn't match newline, with the s modifier it does. 

Last: control verbs with return

Tracking verbs are an experimental function obtained from the relx perl engine (the state is experimental in perl too, but if nobody uses it, it will not change).

What is the countdown?

if I try to match "aaaaab" with ~a+ab~ the regex engine, since + is a greedy quantifier, it will catch all a (five a), but only b will remain after it, which does not correspond to the subpattern ab . The only way for the regex engine is to return one a , and then ab can be matched. This is the default behavior for the regex engine.

More on backtracking here and here .

Reverse tracking verbs are tools that make the regex engine have the behavior you want for a subpattern.

Here I used two verbs: (*SKIP) and (*FAIL)

(*FAIL) is the easiest. The subpattern forcibly fails immediately.

(*SKIP) : when the sub-step fails after this verb, the regex mechanism does not have the right to return the characters matched before this verb. And this content cannot be reused for another alternative subpattern.


I understand that all this is not always easy, but I hope that, step by step, one day, all these things will be clear to you.

+4
source

If you want to parse html, I would suggest you use PHP DOMxpath or similar, as it meant and specialized for this task. You will find chrome extensions to test your queries.

Also read this answer, this is ridiculous: You cannot parse [X] HTML with regular expression. Since HTML cannot be parsed using regex , it has been voted over 4400 times

edit: With that said, it may be necessary to parse only fragments or invalid html, then I will move on to the β€œsimple” regex approach, as Steve R said above.

+1
source

Assuming you want to commit that between the tags:

 regex = "<((?!pre|code|textarea))>([^<]+)</\1>" 

(?!...) - negative outlook on the future
([^<]+) and capture 1 or more characters that are not <
\1 refers to the original capture group (tag)

This is based on the assumption that < not a valid character between tags, implying that the tags are not nested. If these restrictions are incorrect, you will not be able to parse HTML with a regular expression, see the required post , which all link for a good reason.

0
source

Source: https://habr.com/ru/post/959463/


All Articles