Detect email in text using regular expression

I want to detect emails in text format so that I can bind to them a binding labeled mailto. I have a regex, but the code also detects emails that are already encapsulated with the anchor tag or are in the mailto tag anchor parameter.

My regex is:

([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[az]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?) 

But it detects 3 matches in the following text example:

 ttt <a href='mailto: someone@example.com '> someemail@mail.com </a> abc email@email.com 

I want only email@email.com match the regular expression.

0
source share
3 answers

Very similar to my previous answer to your other question, try this

 (?<!(?:href=['"]mailto:|<a[^>]*>))(\b[\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[az]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?) 

The only thing that really differs is the \b word boundary before writing.

See a similar expression here in Regexr , this is not exactly the same, because Regexr does not support striping and infinite length in lookbehind.

+2
source

It’s best to leave the HTML parsing to something suitable for this (e.g. HtmlAgilityPack ) and combine this with regex to update text nodes:

  string sContent = "ttt <a href='mailto: someone@example.com '> someemail@mail.com </a> abc email@email.com "; string sRegex = @"([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[az]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)"; Regex Regx = new Regex(sRegex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(sContent); var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]"); foreach (var node in nodes) { node.InnerHtml = Regx.Replace(node.InnerHtml, @"<a href=""mailto:$0"">$0</a>"); } string fixedContent = doc.DocumentNode.OuterHtml; 

I noticed that you posted the same question to other forums , but did not assign a response in any of them.

+1
source

Just insert \ s + on the right after opening the bracket, for example:

 (\s+[\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[az]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?) 

This way, you will only receive emails after spaces, ignoring them after mailto: or the closing tag ( > ).

-1
source

Source: https://habr.com/ru/post/905813/


All Articles