How can I extract the HREF value from an HTML link?

Question

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:

<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="yahoo.com.jp/">yahoo.com.jp/</A>
</PRE><HR>

In my Perl script, I have:

my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";

and my conclusion is as follows:

Output 1: yahoo.com.jp

Output 2: ><HR>

I am trying to get my Perl script to automatically extract a string inside <A Href="">

As I am very new to regex, I want to ask if my regex is bad? If so, can someone offer some suggestion to make it look better?

Secondly, I don’t know why my second conclusion "><HR>", I thought that the expected behavior is that output2 will be skipped since it does not contain href = ". Obviously, I am very wrong.

Thanks for the help.

+3

html regex perl

freshWoWer May 29 '09 at 16:01

source share

4 answers

HTML , . , , - HTML::Parser .

+8

Michael Carman 29 '09 16:14

, ( ): HTML::TreeBuilder::XPath

XPath HTML.

use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";

0

Axeman 29 '09 21:34

HTML ( XML) . - . start - , , . Gumbo, [^ "] * , . . - . :

/HREF="([^"]*)"[^>]*>/i

.

-1

Stephan 29 '09 16:33

Chris Simmons · Accepted Answer · 2009-05-29T16:38:18+0000

, , .*, "" - , . , .*?, , . , [^"]* , , , , , .

, - , -, HTML, - . , Perl 5.10 ( ), .

How can I extract the HREF value from an HTML link?

More articles: