Regular expression for spaces between html attributes

Question

Regular expression for spaces between html attributes

How to detect a space between attributes. Example:

<div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff>

fix: 1,3,4 incorrect: 2,5 How to identify incorrect data?

I tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it does not work.

+5

html regex

wroe12 Dec 30 '15 at 18:35

source share

4 answers

sina · Answer 1 · 2015-12-30T19:48:50+0000

You should not use regex for parsing HTML , unless for training purposes.

http://regexr.com/3cge1

 <\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>

This regex matches even if you don't have any attribute. It works for self-closing tags, and if the attribute does not matter.

<\w+ Match opening < and \w characters.
(\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must begin with a space. It consists of two parts:
- \s+[\w-]+ attribute name after required space
- (=(['"])[^"']*\3)? optional attribute value
\s*/?> an extra space and optional / , followed by a closing > .

Here is the test for the strings:

 var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; ! '<div style="margin:37px;"/></div>'.match(re); false ! '<span title=\'\'style="margin:37px;" /></span>'.match(re); true ! '<span title="" style="margin:37px;" /></span>'.match(re); false ! '<a title="u" hghghgh title="j" >'.match(re); false ! '<a title=""gg ff>'.match(re); true

Show all invalid tags:

 var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>'; var tagRegex = /<\w+[^>]*\/?>/g; var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g; html.match(tagRegex).forEach(function(m) { if(!m.match(validRegex)) { console.log('Incorrect', m); } });

Will output

 Incorrect <span title=''style="margin:37px;"/> Incorrect <a title=""gg ff/> Incorrect <a title="u" hghghgh title="j"example> Incorrect <a title=""gg ff>

Comment update

 <\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>

Stark · Answer 2 · 2015-12-30T19:40:02+0000

Try this regex, I think it will work

 <\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>

< - start bracket

\w* - one or more alphanumeric characters

[^=]*= - it will cover the whole character until the appearance of '=' ["'][\w;:]*["'] ;: ["'][\w;:]*["'] - this will correspond to two cases 1. one with a single quote with optional lines 2. one double-quoted with optional strings

[\s/]+ - match a space or '\' to at least one event

[^>]* - this will match all characters until '>' closing the bracket

Totem · Answer 3 · 2015-12-30T19:41:09+0000

I got this pattern to find the wrong lines 2 and 5 for your query:

 >>> import re >>> p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*' >>> html = """ <div style="margin:37px;"/></div> <span title=''style="margin:37px;" /></span> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j" > <a title=""gg ff> """ >>> bad = re.findall(p, html) >>> print '\n'.join(bad) <span title=''style="margin:37px;" /></span> <a title=""gg ff>

expressed expression:

 p = r'<[az]+\s[az]+=[\'\"][\w;:]*[\"\'][\w]+.*'

< - starting brackets

[az]+\s - 1 or more lowercase letters followed by a space

[az]+= - 1 or more lowercase letters followed by an equal sign

[\'\"] - single or double quote

[\w;:]* ;: [\w;:]* - matches an alphanumeric character (a-zA-Z0-9_) or a colon or half-colon 0 or more times

[\"\'] - again single or double quote

[\w]+ - matches the alphanumeric character one or more times (this captures the lack of space that you wanted to find) ***

.* - match anything 0 or more times (gets the rest of the line)

Mi-creativity · Answer 4 · 2015-12-30T21:49:57+0000

Not sure about this, I'm not so proficient in regex, but it looks like it works well

Js fiddle

 <([az]+)(\s+[az\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?

Currently, <([az]+) will work mostly, but with a web component and <ng-* it will be better \w+

---------------

Output:

 <div style="margin:37px;">div</div> correct <span title=" style="margin:37px;" />span1</span> incorrect <span title="" style="margin:37px;" />span2</span> correct <a title="u" title="j">link</a> correct <a title=""href="" alt="" required>test</a> incorrect <img src="" data-abc="" required> correct <input type=""style="" /> incorrect

Regular expression for spaces between html attributes

http://regexr.com/3cge1

Show all invalid tags:

Comment update

---------------

More articles: