RegEx to retrieve all HTML tag attributes, including embedded JavaScript

I found this useful regex code here, looking at the attributes of HTML tags:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

It works great, but I miss one key element that I need. Some attributes are event triggers that have embedded Javascript code in them as follows:

onclick="doSomething(this, 'foo', 'bar');return false;"

Or:

onclick='doSomething(this, "foo", "bar");return false;'

I can’t figure out how to get the original expression so as not to read the quotes from JS (single or double), while it is nested inside a set of quotes containing the attribute value.

I MUST add that this is not used to parse the entire HTML document. It was used as an argument in the old array-to-select function, which I updated. One argument is a tag that can add additional HTML attributes to the form element.

I made an improved function and condemn the old one ... but in case there is a call to an old function somewhere in the code, I need to parse it into a new array format. Example:

// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");

The new version accepts an array of attr => value pairs to create additional tags.

create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));

, OLD , $append_att , regex HTML. , , .

+3
3

, . , . :

(\w+)=("[^<>"]*"|'[^<>']*'|\w+)
+2

regex HTML, .

http://www.w3.org/TR/html-markup/syntax.html

// valid tag names
$tagname = '[0-9a-zA-Z]+';
// valid attribute names
$attr = "[^\s\\x00\"'>/=\pC]+";
// valid unquoted attribute values
$uqval = "[^\s\"'=><`]*";
// valid single-quoted attribute values
$sqval = "[^'\\x00\pC]*";
// valid double-quoted attribute values
$dqval = "[^\"\\x00\pC]*";
// valid attribute-value pairs
$attrval = "(?:\s+$attr\s*=\s*\"$dqval\")|(?:\s+$attr\s*=\s*'$sqval')|(?:\s+$attr\s*=\s*$uqval)|(?:\s+$attr)"; 

    // start tags + all attr formats
    $patt[] = "<(?'starttags'$tagname)(?'tagattrs'($attrval)*)\s*(?'voidtags'[/]?)>";

    // end tags
    $patt[] = "</(?'endtags'$tagname)\s*>"; // end tag

    // full regex pcre pattern
    $patt = implode("|", $patt);
    // search and match
    preg_match_all("#$patt#imuUs",$data,$matches);

, .

+1

It would be even better to use backlinks; in PHP, a regular expression would be as follows:

([a-zA-Z_:][-a-zA-Z0-9_:.]+)=(["'])(.*?)\\2

Where \\2- link to(["'])

Also, this regular expression will match attributes that contain _, -and :, which are allowed according to the W3C, but this expression does not match attributes whose values ​​are not in quotation marks.

0
source

Source: https://habr.com/ru/post/1735895/


All Articles