Regular expression for parsing hyperlinks and descriptions

C #: what is Regex for parsing hyperlinks and describing them?

Please consider case insensitivity, white space, and the use of single quotes (instead of double quotes) around the HREF tag.

Also pay attention to the possibility of obtaining hyperlinks that contain <a> tags, for example, <b> and <i> .

+3
source share
6 answers

While there are no nested tags (and no line breaks), the following option works well:

 <a\s+href=(?:"([^"]+)"|'([^']+)').*?>(.*?)</a> 

Once nested tags enter the game, regular expressions are unsuitable for parsing. However, you can still use them by using the more complex functions of modern translators (depending on your regular expression machine). For instance. .NET regular expressions use the stack; I found this:

 (?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>) 

Source: http://weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx

+6
source

See this example from fooobar.com/questions/7978 / ...

Using the HTML Agility Pack , you can parse html and retrieve data using HTML semantics, instead of a broken regular expression.

+3
source

I found this , but apparently these guys had problems with it.

Edit: (It works!)
I already conducted my own testing and found that it works, I do not know C #, so I can not give you the answer C #, but I know PHP, and here is the array of matches that I got from this:

 <a href="pages/index.php" title="the title">Text</a> array(3) { [0]=> string(52) "Text" [1]=> string(15) "pages/index.php" [2]=> string(4) "Text" } 
+1
source

I have a regex that handles most cases, although I believe it matches HTML in multi-line comments.

It is written using .NET syntax, but should be easily translatable.

+1
source

Just gonna drop this snippet there now that I have a job. This is the less greedy version suggested earlier. The original would not work if there were several hyperlinks at the input. This code below will allow you to view all the hyperlinks:

 static Regex rHref = new Regex(@"<a.*?href=[""'](?<url>[^""^']+[.]*?)[""'].*?>(?<keywords>[^<]+[.]*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Compiled); public void ParseHyperlinks(string html) { MatchCollection mcHref = rHref.Matches(html); foreach (Match m in mcHref) AddKeywordLink(m.Groups["keywords"].Value, m.Groups["url"].Value); } 
0
source

Here is a regex that will match balanced tags.

(?: "'[" "'] *>.?) ((> () | (& L; ???. -Depth>) |) +)? ((Depth) (?!)) (?: )

0
source

Source: https://habr.com/ru/post/899742/


All Articles