Use curl to parse the XML, get the image url and load it

Question

Use curl to parse the XML, get the image url and load it

I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'

I use this to capture the first appearance of the image URL in the file. Now I want this URL in the variable to use cURL again to load the image. Any help appreciated! (You can also give tips on how best to remove everything from the string with the URL. This is the string:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />

Perhaps there is some better regex for removing everything except the url than my solution.) Thanks in advance!

+3

shell perl curl download

tzippy Aug 2 '10 at 20:10

source share

5 answers

DVK · Answer 1 · 2010-08-02T20:17:32+0000

HTML/XML - . .

Perl, Perl XML HTML , :

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser; 
    $parser = HTML::TokeParser->new(\*STDIN); 
    $img = $parser->get_tag('img') ; 
    print "$img->[1]->{src}\n"; 
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
    $twig=XML::Twig->new(twig_handlers =>{img => sub { 
       print $_[1]->att("src")."\n"; exit 0;}}); 
    open(my $fh, "-");
    $twig->parse($fh);
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

ghostdog74 · Answer 2 · 2010-08-03T00:47:47+0000

wget curl,

#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
  gsub(/.*<img src=\"/,"")
  gsub(/\".[^>]*>/,"")
  print
}'  |  xargs -i wget "{}"

meder omuraliev · Answer 3 · 2010-08-02T20:18:13+0000

DOM img, getElementsByTagName. /, .

Python, DOM-.

tzippy · Answer 4 · 2010-08-02T20:19:17+0000

#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL

! ?

Jesse dhillon · Answer 5 · 2010-08-02T20:47:44+0000

Python:

from BeautifulSoup import BeautifulSoup
from os import sys

soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']

:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`

This works like a charm and does not force you to look for a magical regular expression that will parse random HTML (hint: there is no such expression, especially if you have a greedy socket, for example sed.)

Use curl to parse the XML, get the image url and load it

More articles: