Regex URL not working

Using Perl, I am trying to parse a bunch of XML files and try to find any form of URL in XML and print. My regex doesn't seem to work, and it does not return any match. What am I missing?

sub findURL{
local($inputLine, $outText);
$inputLine = $_[1];
 while (length($inputLine) > 0)
 {
 if ($inputLine =~ /^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$/ )

 {
 $outText .= $&;
 $inputLine = $';
 }
 else
 {
  $inputLine = "";
  $outText .= "";
 }
 }
 return $outText;
}
+3
source share
5 answers

use regexp :: common

use Regexp::Common qw /URI/;

while (<>) {
    /$RE{URI}{HTTP}/       and  print "Contains an HTTP URI.\n";
}
+12
source

Your code has seven different shades:

, , , , -, ( XML). , URL- .

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Common qw/URI/;

sub find_urls {
    my $text = shift;
    return $text =~ /$RE{URI}{-keep}/g;
}

my $xml = do { local $/; <DATA> };

for my $url (find_urls($xml)) {
    print "$url\n";
}

__DATA__
<root>
    this is some text
    and a URL: http://foo.com/foo.html
    this isn't a URL http:notgrabbed.com
    <img src="http://example.com/img.jpg" />
    <!-- oops, shouldn't grab this one: ftp://bar.com/donotgrab -->
</root>
+8

URI:: Find URI:: Find:: Schemeless, CPAN.

#! /usr/bin/perl

use warnings;
use strict;

use URI::Find;
use URI::Find::Schemeless;

my $xml = join "" => <DATA>;
URI::Find            ->new(sub { print "$_[1]\n" })->find(\$xml);
URI::Find::Schemeless->new(sub { print "$_[1]\n" })->find(\$xml);

__DATA__
<foo>
  <bar>http://stackoverflow.com/</bar>
  <baz>www.perl.com</baz>
</foo>

:

http://stackoverflow.com/
www.perl.com
+2

, , , . - , - , .

use strict;
use warnings;
use re 'debug';

my $re = qr/[[a-zA-Z0-9]\-\.]/;

( use re 'debug') :

Compiling REx "[[a-zA-Z0-9]\-\.]"
Final program:
   1: ANYOF[0-9A-[a-z][] (12)
  12: EXACT <-.]> (14)
  14: END (0)
anchored "-.]" at 1 (checking anchored) stclass ANYOF[0-9A-[a-z][] minlen 4 

, '-.]' "". , '.-] ', . , , ']'.

- - .

, . . :

[a-zA-Z0-9.-]

.

, , :

[\p{IsAlnum}.-]
  • , , , ']' -, . , , , , . , '[[' '[' , '[', .
0

, , .

  • , , . , my local.
  • $inputLine = $_[1] , , , URL $inputline. , ?

:

: , [[a-zA-Z0-9]\-\.] [-a-zA-Z0-9.] ( - , , ).

/^(((http|https|ftp):\/\/)?([-a-zA-Z0-9.])+(\.)([a-zA-Z0-9]){2,4}([-a-zA-Z0-9+=%&_.~?\/]*))*$/ .

RFC3986 Appendix B provides the best regular expression of course.

0
source

Source: https://habr.com/ru/post/1762459/


All Articles