How can I extract the HREF value from an HTML link?
My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="yahoo.com.jp/">yahoo.com.jp/</A>
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my conclusion is as follows:
Output 1: yahoo.com.jp
Output 2: ><HR>
I am trying to get my Perl script to automatically extract a string inside <A Href="">
As I am very new to regex, I want to ask if my regex is bad? If so, can someone offer some suggestion to make it look better?
Secondly, I donβt know why my second conclusion "><HR>", I thought that the expected behavior is that output2 will be skipped since it does not contain href = ". Obviously, I am very wrong.
Thanks for the help.
+3
4 answers
, ( ): HTML::TreeBuilder::XPath
XPath HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";
0