RegEx to get href and src from HTML content?

Question

RegEx to get href and src from HTML content?

I am trying to extract href and src links from an HTML string. According to this post , I was able to get part of the image. Can anyone help setting up a regex to include the href url in the collection?

public List<string> GetLinksFromHtml(string content) { string regex = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>"; var matches = Regex.Matches(content, regex, RegexOptions.IgnoreCase | RegexOptions.Singleline); var links = new List<string>(); foreach (Match item in matches) { string link = item.Groups[1].Value; links.Add(link); } return links; }

+4

html c # regex html-parsing

TruMan1 Nov 09 '11 at 14:10

source share

5 answers

I just sketched this real expression of Regex Expression, but tested and working, tell me if it suits your needs. (url and img are grouped by name, so it will be easy to get them)

 <a(.*?)href="(?P<url>.*?)"(.*?)><img(.*)src="(?P<img>.*?)"(.*?)></a>

You can also do this to capture images without a link by adding? character for the <a> and </a> tags as follows:

 (<a(.*?)href="(?P<url>.*?)"(.*?)>)?(<img(.*)src="(?P<img>.*?)"(.*?)>)(</a>)?

Shay

0

Shai mishali Nov 09 '11 at 14:27

source share

So monstrous! _{Since parsing html with regular expressions is evil}

  <img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?href\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>

0

Vitaly slobodin Nov 09 '11 at 14:27

source share

Below code can help you get each link in html, after receiving them you can get a more detailed element in the link:

 string html = "123<a href=\"http://www.codeios.com/home.php\">123123</a>789"; Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>"); foreach (Match match in r.Matches(html)) { string url = match.Groups["href"].Value; string text = match.Groups["value"].Value; Response.Write(url + text); }

0

Wilson wu Jun 17 '13 at 10:06

source share

There are several places where you can find the link and image.

 -Link -href (?<AttributeName>(?:href))\s*=\s*["'](?<AttributeValue>(?:[^"'])*) for c# = (?<AttributeName>(?:href))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*)

check here

 -Image -Image_DirectSource -src -background (?<AttributeName>(?:src|background))\s*=\s*["'](?<AttributeValue>(?:[^"'])*) for c# = (?<AttributeName>(?:src|background))\s*=\s*[""'](?<AttributeValue>(?:[^""'])*)

check here

  _Image_IndirectSource -style -background:url() background\s*:\s*url\s*\(\s*(?<AttributeValue>(?:[^)])*)

check here

-1

Frank myat thu Jul 01 '14 at 9:06

source share

Code jockey · Accepted Answer · 2011-11-09T15:48:44+0000

Okie Dock! Without the "extra library" and the "quick and easy" here you go:

 <(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:"(?<URL>(?:\\"|[^"])*)"|'(?<URL>(?:\\'|[^'])*)')

or as a C # line:

 @"<(?<Tag_Name>(a)|img)\b[^>]*?\b(?<URL_Type>(?(1)href|src))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)')"

This captures the tag name ( a or img ) in the Tag_Name group, the URL type ( href or src ) in the URL_Type group, and the URL in the URL group (I know, I got a few ads with group names).

It handles any type of quotation mark ( " or ' ), and although any type of quotation in the URL should already be encoded in essence, it will ignore any character with quotation marks \' and \" .

It does not ignore closed tags (therefore garbled HTML), it will find an opening for one of the tags, such as <a or img, and then continue to ignore everything except the larger (>) up until it finds the corresponding attribute URL type ( href for tags a and src for img tags), then match the contents. Then it exits and does not worry about the rest of the tag!

Let me know if you want me to break it for you, but here is a selection of matches made for this very page:

 <Match> 'Tag' 'URL_Type' 'URL' ---------------------------------------- ----- ---------- ----------------------------- <a href="http://meta.stackoverflow.com" a href http://meta.stackoverflow.com <a href="/about" a href /about <a href="/faq" a href /faq <a href="/" a href / <a id="nav-questions" href="/questions" a href /questions ... <img src="/posts/8066248/ivc/d499" img src /posts/8066248/ivc/d499

Total of 140 tags found (I assume that additional posters will increase slightly)

RegEx to get href and src from HTML content?

More articles: