Use curl to parse the XML, get the image url and load it

I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'

I use this to capture the first appearance of the image URL in the file. Now I want this URL in the variable to use cURL again to load the image. Any help appreciated! (You can also give tips on how best to remove everything from the string with the URL. This is the string:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />

Perhaps there is some better regex for removing everything except the url than my solution.) Thanks in advance!

+3
source share
5 answers

HTML/XML - . .

Perl, Perl XML HTML , :

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser; 
    $parser = HTML::TokeParser->new(\*STDIN); 
    $img = $parser->get_tag('img') ; 
    print "$img->[1]->{src}\n"; 
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
    $twig=XML::Twig->new(twig_handlers =>{img => sub { 
       print $_[1]->att("src")."\n"; exit 0;}}); 
    open(my $fh, "-");
    $twig->parse($fh);
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif
+2

wget curl,

#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
  gsub(/.*<img src=\"/,"")
  gsub(/\".[^>]*>/,"")
  print
}'  |  xargs -i wget "{}"
+1

DOM img, getElementsByTagName. /, .

Python, DOM-.

0
#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL

! ?

0

Python:

from BeautifulSoup import BeautifulSoup
from os import sys

soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']

:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`

This works like a charm and does not force you to look for a magical regular expression that will parse random HTML (hint: there is no such expression, especially if you have a greedy socket, for example sed.)

0
source

Source: https://habr.com/ru/post/1757725/


All Articles