Match private html tags using regular expressions and php

I use php and regex to search for private html tags in a string:

This is my line:

$s="<div><h2>Hello world<h2><p>It 7Am where I live<p><div>"; 

You can see that all tags are not closed here.

I want to find all private tags, but the problem is that my regex also supports opening tags.

Here is my regex for now

 /<[^>]+>/i 

And this is my preg_match_all () function

 preg_match_all("/<[^>]+>/i",$s,$v); print_r($v); 

What do I need to change in my regex to match only private tags?

  <h2> <p> <div> 
+5
source share
2 answers

You may not be aware of this, but DOMDocument can help you fix the HTML.

 $html = "<div><h2>Hello world<h2><p>It 7Am where I live<p><div>"; libxml_use_internal_errors(true); $dom = new DOMDocument(); $dom->loadHTML('<root>' . $html . '</root>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); $xpath = new DOMXPath($dom); foreach( $xpath->query('//*[not(node())]') as $node ) { $node->parentNode->removeChild($node); } echo substr($dom->saveHTML(), 6, -8); 

Watch the IDEONE demo

Result: <div><h2>Hello world</h2><p>It 7Am where I live</p></div>

Note that clean XPath-based node cleanup is necessary because the DOM contains empty <h2></h2> , <p></p> and <div></div> tags after loading HTML into the DOM.

The <root> element is added at the beginning to make sure that we get the root element in order. Later we can send it using substr .

Flags LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD required so that no DTDs and other garbage are added to the DOM.

+11
source

Finding unsurpassed tags seems too complicated for regular expression. You basically need to put each open tag that you see in the queue and then push it out of the queue when you see the closing tag.

We recommend using a library that performs HTML validation. See the following questions:

Remove unmatched HTML tags in a string

How to find hidden div tag

PHP get all closed HTML tags in a string

+2
source

Source: https://habr.com/ru/post/1236740/


All Articles