How can I extract the HREF value from an HTML link?

My text file contains 2 lines:

<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="yahoo.com.jp/">yahoo.com.jp/</A>
</PRE><HR>

In my Perl script, I have:

my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";

and my conclusion is as follows:

Output 1: yahoo.com.jp

Output 2: ><HR>

I am trying to get my Perl script to automatically extract a string inside <A Href="">

As I am very new to regex, I want to ask if my regex is bad? If so, can someone offer some suggestion to make it look better?

Secondly, I don’t know why my second conclusion "><HR>", I thought that the expected behavior is that output2 will be skipped since it does not contain href = ". Obviously, I am very wrong.

Thanks for the help.

+3
source share
4 answers

, , .*, "" - , . , .*?, , . , [^"]* , , , , , .

, - , -, HTML, - . , Perl 5.10 ( ), .

+8

HTML , . , , - HTML::Parser .

+8

, ( ): HTML::TreeBuilder::XPath

XPath HTML.

use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";
0

HTML ( XML) . - . start - , , . Gumbo, [^ "] * , . . - . :

/HREF="([^"]*)"[^>]*>/i

.

-1

Source: https://habr.com/ru/post/1709397/


All Articles