RegEx: match a specific string that is not inside the HTML tag

Question

RegEx: match a specific string that is not inside the HTML tag

<tag value='botafogo'> botafogo is the best </tag>

A match is required only with botafogo (... is the best) and not the value of botafogo

my program automatically annotates the term in clear text:

 botafogo is the best to <team attr='best'>botafogo</team> is the best

and when I "replaced everything" with the "best" word, I have a big problem ...

 <team attr='<adjective>best</adjective>'>botafogo</team> is the <adjective>best</adjective>

Ps .: Java language

+4

html regex

celsowm Mar 03 '10 at 2:38

source share

5 answers

polygenelubricants · Answer 1 · 2010-03-03T02:40:11+0000

The best way to achieve this is to NOT use a regular expression and use the correct HTML parser. HTML is not an ordinary language, and doing this with a regular expression will be tedious, difficult to maintain, and most likely still contains various errors.

HTML parsers, on hand, are good for working. Many of them are mature and reliable, and they take care of every little thing for you and make your life easier.

YOU · Answer 2 · 2010-03-03T02:40:41+0000

Do you think DOM functions are used instead of regular expression functions?

 document.getElementsByTagName('tag')[0].innerHTML.match('botafogo')

Matchu · Answer 3 · 2010-03-03T02:42:27+0000

An HTML parser is best, and then loop through text content. (See Other Answers.)

If you're in PHP, you can make a quick decision by doing strip_tags() in the content to remove the HTML first. It depends on if you are performing a replacement, in which case deletion is not an option at first, or if you just match, in which case content that is not part of the match can be deleted without problems.

ghostdog74 · Answer 4 · 2010-03-03T03:26:11+0000

@OP, in your favorite language, break down into </tag> , then do another split into > . e.g. python

 >>> s="<tag value='botafogo'> botafogo is the best </tag>" >>> for item in s.split("</tag>"): ... if "<tag" in item: ... print item.split(">")[-1] ... botafogo is the best

No need for regular expression

Serge · Answer 5 · 2012-01-16T07:12:49+0000

I just looked for a solution to one problem and created one that seems to do the job.

A negative look is the key. To make sure the match is not in the tag, look ahead to make sure that the bracket of the closing corner is not found before opening. Suppose we want to find the word "needle":

 #needle(?![^<]+>)#i

My business is in PHP, and it looks something like this:

 function filter_highlighter($content) { $patterns = array( '#needle(?![^<]+>)#i', '#<b>Need</b>le#', '#<strong>Need</strong>le#' ); $replacement = '<span class="highlighted">Need</span>le'; $content = preg_replace( $patterns, $replacement, $content); return $content; }

While this is working.

RegEx: match a specific string that is not inside the HTML tag

More articles: