Regular expression for spaces between html attributes

How to detect a space between attributes. Example:

<div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff> 

fix: 1,3,4 incorrect: 2,5 How to identify incorrect data?

I tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it does not work.

+5
source share
4 answers

You should not use regex for parsing HTML , unless for training purposes.


http://regexr.com/3cge1

 <\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?> 

This regex matches even if you don't have any attribute. It works for self-closing tags, and if the attribute does not matter.


  • <\w+ Match opening < and \w characters.

  • (\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must begin with a space. It consists of two parts:

    • \s+[\w-]+ attribute name after required space
    • (=(['"])[^"']*\3)? optional attribute value
  • \s*/?> an extra space and optional / , followed by a closing > .


Here is the test for the strings:

 var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; ! '<div style="margin:37px;"/></div>'.match(re); false ! '<span title=\'\'style="margin:37px;" /></span>'.match(re); true ! '<span title="" style="margin:37px;" /></span>'.match(re); false ! '<a title="u" hghghgh title="j" >'.match(re); false ! '<a title=""gg ff>'.match(re); true 

Show all invalid tags:

 var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>'; var tagRegex = /<\w+[^>]*\/?>/g; var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; html.match(tagRegex).forEach(function(m) { if(!m.match(validRegex)) { console.log('Incorrect', m); } }); 

Will output

 Incorrect <span title=''style="margin:37px;"/> Incorrect <a title=""gg ff/> Incorrect <a title="u" hghghgh title="j"example> Incorrect <a title=""gg ff> 

Comment update

 <\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?> 
+3
source

Try this regex, I think it will work

 <\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*> 

< - start bracket

\w* - one or more alphanumeric characters

[^=]*= - it will cover the whole character until the appearance of '=' ["'][\w;:]*["'] ;: ["'][\w;:]*["'] - this will correspond to two cases 1. one with a single quote with optional lines 2. one double-quoted with optional strings

[\s/]+ - match a space or '\' to at least one event

[^>]* - this will match all characters until '>' closing the bracket

+1
source

I got this pattern to find the wrong lines 2 and 5 for your query:

 >>> import re >>> p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*' >>> html = """ <div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff> """ >>> bad = re.findall(p, html) >>> print '\n'.join(bad) <span title=''style="margin:37px;" /></span> <a title=""gg ff> 

expressed expression:

 p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*' 

< - starting brackets

[az]+\s - 1 or more lowercase letters followed by a space

[az]+= - 1 or more lowercase letters followed by an equal sign

[\'\"] - single or double quote

[\w;:]* ;: [\w;:]* - matches an alphanumeric character (a-zA-Z0-9_) or a colon or half-colon 0 or more times

[\"\'] - again single or double quote

[\w]+ - matches the alphanumeric character one or more times (this captures the lack of space that you wanted to find) ***

.* - match anything 0 or more times (gets the rest of the line)

+1
source

Not sure about this, I'm not so proficient in regex, but it looks like it works well

Js fiddle

 <([az]+)(\s+[az\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))? 

Currently, <([az]+) will work mostly, but with a web component and <ng-* it will be better \w+

---------------

Output:

 <div style="margin:37px;">div</div> correct <span title=" style="margin:37px;" />span1</span> incorrect <span title="" style="margin:37px;" />span2</span> correct <a title="u" title="j">link</a> correct <a title=""href="" alt="" required>test</a> incorrect <img src="" data-abc="" required> correct <input type=""style="" /> incorrect 
+1
source

Source: https://habr.com/ru/post/1239507/


All Articles