RegEx to get href and src from HTML content?

I am trying to extract href and src links from an HTML string. According to this post , I was able to get part of the image. Can anyone help setting up a regex to include the href url in the collection?

public List<string> GetLinksFromHtml(string content) { string regex = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>"; var matches = Regex.Matches(content, regex, RegexOptions.IgnoreCase | RegexOptions.Singleline); var links = new List<string>(); foreach (Match item in matches) { string link = item.Groups[1].Value; links.Add(link); } return links; } 
+4
source share
5 answers

Okie Dock! Without the "extra library" and the "quick and easy" here you go:

 <(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:"(?<URL>(?:\\"|[^"])*)"|'(?<URL>(?:\\'|[^'])*)') 

or as a C # line:

 @"<(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)')" 

This captures the tag name ( a or img ) in the Tag_Name group, the URL type ( href or src ) in the URL_Type group, and the URL in the URL group (I know, I got a few ads with group names).

It handles any type of quotation mark ( " or ' ), and although any type of quotation in the URL should already be encoded in essence, it will ignore any character with quotation marks \' and \" .

It does not ignore closed tags (therefore garbled HTML), it will find an opening for one of the tags, such as <a or img, and then continue to ignore everything except the larger (>) up until it finds the corresponding attribute URL type ( href for tags a and src for img tags), then match the contents. Then it exits and does not worry about the rest of the tag!

Let me know if you want me to break it for you, but here is a selection of matches made for this very page:

 <Match> 'Tag' 'URL_Type' 'URL' ---------------------------------------- ----- ---------- ----------------------------- <a href="http://meta.stackoverflow.com" a href http://meta.stackoverflow.com <a href="/about" a href /about <a href="/faq" a href /faq <a href="/" a href / <a id="nav-questions" href="/questions" a href /questions ... <img src="/posts/8066248/ivc/d499" img src /posts/8066248/ivc/d499 

Total of 140 tags found (I assume that additional posters will increase slightly)

+8
source

I just sketched this real expression of Regex Expression, but tested and working, tell me if it suits your needs. (url and img are grouped by name, so it will be easy to get them)

 <a(.*?)href="(?P<url>.*?)"(.*?)><img(.*)src="(?P<img>.*?)"(.*?)></a> 

You can also do this to capture images without a link by adding? character for the <a> and </a> tags as follows:

 (<a(.*?)href="(?P<url>.*?)"(.*?)>)?(<img(.*)src="(?P<img>.*?)"(.*?)>)(</a>)? 

Shay

0
source

So monstrous! Since parsing html with regular expressions is evil

  <img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?href\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?> 
0
source

Below code can help you get each link in html, after receiving them you can get a more detailed element in the link:

 string html = "123<a href=\"http://www.codeios.com/home.php\">123123</a>789"; Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>"); foreach (Match match in r.Matches(html)) { string url = match.Groups["href"].Value; string text = match.Groups["value"].Value; Response.Write(url + text); } 
0
source

There are several places where you can find the link and image.

 -Link -href (?<AttributeName>(?:href))\s*=\s*["'](?<AttributeValue>(?:[^"'])*) for c# = (?<AttributeName>(?:href))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*) 

check here

 -Image -Image_DirectSource -src -background (?<AttributeName>(?:src|background))\s*=\s*["'](?<AttributeValue>(?:[^"'])*) for c# = (?<AttributeName>(?:src|background))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*) 

check here

  _Image_IndirectSource -style -background:url() background\s*:\s*url\s*\(\s*(?<AttributeValue>(?:[^)])*) 

check here

-1
source

Source: https://habr.com/ru/post/1380395/


All Articles