Clear spaces from HTML with RegEx

Is it possible for RegEx to clear the space in HTML?

For instance:

<p><b>foo</b> <i>bar</i></p> <p>foo</p> <p>bar</p> 

In the first line, the space between the closing b and the opening tag is valid (although it can be &nbsp; ), however on the second line it is the space that I want to clear, since it should not have any semantic meaning.

Perhaps this would be better solved with a DOM bypass?

+4
source share
2 answers

It seems that something like HTML Tidy will be better at doing what you are looking for, rather than re-creating all potentially complex rules (such as your first space in the example is significant, but not the second, etc.)

Otherwise, I agree - DOM traversal will be much better than regular expressions, especially if your HTML is already compatible with XHTML and can be easily traversed as XML.

+5
source

First I have to quote;) "asking for regular expressions to parse arbitrary HTML is asking Paris Hilton to write an operating system" Then back to business. You can try various regular expressions for tags (although, I would doubt that this is a valid method):

 sed -e 's/<p>\ </<p></g' 

Removes spaces <p>(whitespace)<(whatever_tag) .

Otherwise, I also agree to bypass the DOM.

0
source

Source: https://habr.com/ru/post/1387752/


All Articles