Regex string error when using plain text urls

Question

Regex string error when using plain text urls

I need a Regex working code in C # that detects simple text urls (http / https / ftp / ftps) in a string and makes them clickable by putting a tag around it with the same url. I already made a Regex template and this code is below.

However, if there is any clickable URL in the input line, then the above code places another anchor tag above it. For example, the existing substring in the code below: the sContent line: "ftp://www.abc.com '> ftp://www.abc.com " has another tag attached to it when the code below is executed. Is there any way to fix this?

string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com"; Regex regx = new Regex("(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); MatchCollection mactches = regx.Matches(sContent); foreach (Match match in mactches) { sContent = sContent.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>"); }

In addition, I want Regex code to make emails as interactive with the mailto tag. I can do it myself, but the aforementioned double-tag tag problem will also appear in it.

+6

c # url regex .net

Computer user Jan 12 '12 at 10:33

source share

4 answers

try it

 Regex regx = new Regex("(?<!(?:href='|>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

It should work for your example.

(?<!(?:href='|>)) is a negative lookbehind, this means that the pattern matches only if it is not preceded by "href ='" or ">".

See images at regular-expressions.info

and especially negative zero width error statement in msdn

Take a look at something similar to Regexr . I had to remove the striping from the look, but .net should be able to handle it.

Update

To ensure that cases like " <p>ftp://www.def.com</p> " are executed correctly, I improved the regular expression

 Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

Now lookbehind (?<!(?:href='|<a[^>]*>)) checks that there is no "href =" or tag starting with "

Test string result

 ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p>ftp://www.def.com</p> abbbbb http://www.ghi.com

with this expression

 ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p><a href='ftp://www.def.com'>ftp://www.def.com</a></p> abbbbb <a href='http://www.ghi.com'>http://www.ghi.com</a>

+5

stema Jan 12 '12 at 10:42

source share

I know that I was late for this party, but there are several problems with regexing that existing answers are not addressed. The first and most annoying, there is that forest of backslashes. If you use C # shorthand lines, you don't need to do all this double escaping. In general, most backslashes are not needed in the first place.

Secondly, there is this bit: ([\\w+?\\.\\w+])+ . The square brackets form a character class, and everything inside them is treated as a literal character or an abbreviated class value like \w . But getting rid of the square brackets is not enough to make it work. I suspect this is what you tried: \w+(?:\.\w+)+ .

Third, the quantifiers at the end of the regular expression - ]*)? - incompatible. * may coincide with zero or more characters, so it makes no sense to make the inclusion group optional. In addition, such a scheme can lead to serious degradation of performance. See this page for more details.

There are other, minor issues, but I will not go into them right now. Here's a new and improved regex:

 @"(?n)(https?|ftps?)://\w+(\.\w+)+([ -a-zA-Z0-9~!@ #$%^&*()_=+/?.:;',\\]*)(?![^<>]*+(>|</a>))"

A negative lookahead - (?![^<>]*+(>|</a>)) is what prevents matches within tags or contents of an anchor element. However, it is still very rude. There are several areas, for example, inside <script> elements where you do not want them to match, but this happens. But trying to cover all the possibilities would lead to a regular expression of a mile long.

+1

Alan moore Jan 18 '12 at 20:30

source share

Check out: Detect emails in text using regular expressions and Regex Replace URLs, ignore images and existing links , just replace the regular expression for links, it will never replace the link inside the tag in the content only.

http://html-agility-pack.net/?z=codeplex

Sort of:

 string textToBeLinkified = "... your text here ..."; const string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])"; Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(textToBeLinkified); var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection(); foreach (var node in nodes) { node.InnerHtml = urlExpression.Replace(node.InnerHtml, @"<a href=""$0"">$0</a>"); } string linkifiedText = doc.DocumentNode.OuterHtml;

0

jessehouwing Feb 22 '12 at 10:36

source share

Kev ritchie · Accepted Answer · 2012-01-12T12:12:00+0000

I noticed in your example a test line that if there is a duplicate link, for example. ftp://www.abc.com is in the line and is already connected, then the result will be to double the link of this link. The regular expression that you already have and what @stema provided will work, but you need to approach how you change the matches in the sContent variable differently.

The following code example should provide you with what you want:

 string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com"; Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); MatchCollection matches = regx.Matches(sContent); for (int i = matches.Count - 1; i >= 0 ; i--) { string newURL = "<a href='" + matches[i].Value + "'>" + matches[i].Value + "</a>"; sContent = sContent.Remove(matches[i].Index, matches[i].Length).Insert(matches[i].Index, newURL); }

Regex string error when using plain text urls

More articles: