Regular expression for spaces between html attributes
How to detect a space between attributes. Example:
<div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff> fix: 1,3,4 incorrect: 2,5 How to identify incorrect data?
I tried with this:
<(.*?=(['"]).*?\2)([\S].*)|(^/)>
But it does not work.
You should not use regex for parsing HTML , unless for training purposes.
http://regexr.com/3cge1
<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?> This regex matches even if you don't have any attribute. It works for self-closing tags, and if the attribute does not matter.
<\w+Match opening<and\wcharacters.(\s+[\w-]+(=(['"])[^"']*\3)?)*zero or more attributes that must begin with a space. It consists of two parts:\s+[\w-]+attribute name after required space(=(['"])[^"']*\3)?optional attribute value
\s*/?>an extra space and optional/, followed by a closing>.
Here is the test for the strings:
var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; ! '<div style="margin:37px;"/></div>'.match(re); false ! '<span title=\'\'style="margin:37px;" /></span>'.match(re); true ! '<span title="" style="margin:37px;" /></span>'.match(re); false ! '<a title="u" hghghgh title="j" >'.match(re); false ! '<a title=""gg ff>'.match(re); true Show all invalid tags:
var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>'; var tagRegex = /<\w+[^>]*\/?>/g; var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; html.match(tagRegex).forEach(function(m) { if(!m.match(validRegex)) { console.log('Incorrect', m); } }); Will output
Incorrect <span title=''style="margin:37px;"/> Incorrect <a title=""gg ff/> Incorrect <a title="u" hghghgh title="j"example> Incorrect <a title=""gg ff> Comment update
<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?> Try this regex, I think it will work
<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*> < - start bracket
\w* - one or more alphanumeric characters
[^=]*= - it will cover the whole character until the appearance of '=' ["'][\w;:]*["'] ;: ["'][\w;:]*["'] - this will correspond to two cases 1. one with a single quote with optional lines 2. one double-quoted with optional strings
[\s/]+ - match a space or '\' to at least one event
[^>]* - this will match all characters until '>' closing the bracket
I got this pattern to find the wrong lines 2 and 5 for your query:
>>> import re >>> p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*' >>> html = """ <div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff> """ >>> bad = re.findall(p, html) >>> print '\n'.join(bad) <span title=''style="margin:37px;" /></span> <a title=""gg ff> expressed expression:
p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*' < - starting brackets
[az]+\s - 1 or more lowercase letters followed by a space
[az]+= - 1 or more lowercase letters followed by an equal sign
[\'\"] - single or double quote
[\w;:]* ;: [\w;:]* - matches an alphanumeric character (a-zA-Z0-9_) or a colon or half-colon 0 or more times
[\"\'] - again single or double quote
[\w]+ - matches the alphanumeric character one or more times (this captures the lack of space that you wanted to find) ***
.* - match anything 0 or more times (gets the rest of the line)
Not sure about this, I'm not so proficient in regex, but it looks like it works well
<([az]+)(\s+[az\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))? Currently, <([az]+) will work mostly, but with a web component and <ng-* it will be better \w+
---------------
Output:
<div style="margin:37px;">div</div> correct <span title=" style="margin:37px;" />span1</span> incorrect <span title="" style="margin:37px;" />span2</span> correct <a title="u" title="j">link</a> correct <a title=""href="" alt="" required>test</a> incorrect <img src="" data-abc="" required> correct <input type=""style="" /> incorrect