Regular expression for syntax highlighting attributes in HTML tag

Question

Regular expression for syntax highlighting attributes in HTML tag

I am working on regular expressions to highlight syntax in the Sublime / TextMate language file, and this requires me to start with a non-closing html tag and end with the corresponding closing tag:

begin: (<)([a-zA-Z0-9:.]+)[^/>]*(>)
end: (</)(\2)([^>]*>)

So far, so good, I can grab the tag name, and it matches to be able to apply the appropriate patterns to the area between the tags.

 jsx-tag-area: begin: (<)([a-zA-Z0-9:.]+)[^/>]*> beginCaptures: '1': {name: punctuation.definition.tag.begin.jsx} '2': {name: entity.name.tag.jsx} end: (</)(\2)([^>]*>) endCaptures: '1': {name: punctuation.definition.tag.begin.jsx} '2': {name: entity.name.tag.jsx} '3': {name: punctuation.definition.tag.end.jsx} name: jsx.tag-area.jsx patterns: - {include: '#jsx'} - {include: '#jsx-evaluated-code'}

Now I also want to write zero or more html attributes in the opening tag so that I can highlight them.

So, if the tag is <div attr="Something" data-attr="test" data-foo>

It could match on attr , data-attr and data-foo , as well as < and div

Something like (this is very rude):

(<)([a-zA-Z0-9:.]+)(?:\s(?:([0-9a-zA-Z_-]*=?))\s?)*)[^/>]*(>)

No need to be perfect, this is just to highlight the syntax, but it was hard for me to figure out how to achieve multiple capture groups in the tag, should I use look-around, etc. or is it even possible with one expression.

Edit: more about a specific case / question - https://github.com/reactjs/sublime-react/issues/18

+6

regex sublimetext2 syntax-highlighting react-jsx

tgriesser Aug 4 '14 at 14:30

source share

4 answers

Oscar Hermosilla · Answer 1 · 2014-09-10T09:28:50+0000

I can find a possible solution.

This is not ideal because, as @skamazin said in the comments, if you try to capture an arbitrary number of attributes, you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes that you allow.

The regex is pretty scary, but it may work for your purpose. It may be possible to simplify this a bit, or you may need to adjust some things.

Only for one attribute will be the following:

 (<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))

Demo

To get additional attributes you need to add this as many times as you want:

 (?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?

So, for example, if you want to allow a maximum of 3 attributes, your regular expression will look like this:

 (<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?

Demo

Tell me if it suits you and if you need more information.

funkwurm · Answer 2 · 2014-09-10T10:34:05+0000

I am not familiar with sublimetext or response-jsx, but it sounds to me like the case "Regex is your tool, not your solution."

A solution using regex as a tool for this would be something like this JsFiddle (note that regex is a bit confused due to html objects like > for > etc.)

Code that performs the actual replacement:

 blabla.replace(/(&lt;!--(?:[^-]|-(?!-&gt;))*--&gt;)|(&lt;(?:(?!&gt;).)+&gt;)|(\{[^\}]+\})/g, function(m, c, t, a) { if (c!=undefined) return '<span class="comment">' + c + '</span>'; if (t!=undefined) return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>'; if (a!=undefined) return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>'); });

So, here for the first time I am collecting a separate type of group following this general template , adapted for this use case of HTML with approval of -blocks. These bindings are passed to the function, which determines the type of capture we are dealing with, and then we replace the subgroups inside this capture with our own .replace() operators.

There really is no other reliable way to do this. I can’t say how this translates into your environment, but maybe it will help.

mechalynx · Answer 3 · 2014-09-11T13:43:31+0000

Regex alone does not seem good enough, but since you are working with elevated scripts here, there is a way to simplify both the code and the process. Keep in mind that I am a vim user and not familiar with exalted internals. Also, I usually work with javascript regular expressions, not PCRE (which seems to be the format used by exalted or closest).

The idea is this:

use regex to get the tag, attributes (per line) and tag content
use capture groups for further processing and matching if necessary

In this case, I made this regex:

<([az]+)\ ?([az]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?

It starts by finding the opening tag, creates a control group for the tag name, if it finds a space, it matches a lot of attributes (inside the template \"...\" I could use \"[^\"]*?\" For matches only non-quote characters, but I purposefully match any character with greed for a closing quote - this should match most attributes that we can process later), match any text between tags, and then finally match the closing tag.

Creates 4 capture groups:

tag name
attribute string
tag content
closing tag

as you can see in this demo , if there is no closing tag, we do not get a capture group for it, the same for attributes, but we always get a capture group for the contents of the tag. This may be a problem in general (since we cannot assume that the captured function will be in one group), but it is not here, because in case of conflict, when we do not receive any attributes and no content, so the second capture group is empty , we can simply assume that this means the absence of attributes, and the absence of a third group speaks for itself. If there is nothing to disassemble, nothing can be analyzed incorrectly.

Now, to analyze the attributes, we can just do it with

([az]+=\"[^\"]*?\")

demo here . This gives us the attributes for sure. If an exalted script allows you to go this far, it will certainly allow you to continue processing if necessary. You can, of course, always use something like this:

(([az]+)=\"([^\"]*?)\")

which will provide capture groups for the attribute as a whole and its name and value separately.

Using this approach, you should be able to analyze the tags well enough for selection in 2-3 passes and send the content for selection to any marker you want (or just select it as plain text in any way convenient for you).

MustafaG · Answer 4 · 2016-08-08T18:42:33+0000

Your own regex was very helpful in answering your question.

This seems to work well for me:

/ (?: & ; | </) ([A-Za-Z0-9 :.] +) (: \ S (: ([0-9a-Za-Z _-] =) ??) \ s ) [^ / ">] * (?:?> | /">) / g

Regular expressions wrappers are usually required at the beginning and end of a line. In addition, the “g” at the end means global, so it also works for repetitions.

A good tool that I use to figure out what I'm doing wrong with my regex is: http://regexr.com/

Hope this helps!

Regular expression for syntax highlighting attributes in HTML tag

More articles: