How to get started with cleaning a web page using Perl?

I am interested in learning Perl. I use books for learning Perl and cpan websites for reference.

I look forward to making a web text search application using Perl to apply everything that I have learned.

Please offer me some good options to get you started.

(this is not homework. want to do something in Perl that will help me use the basic functions of Perl)

+3
source share
5 answers

If the web pages you want to clear require JavaScript to function correctly, you will need more than WWW :: Mechanize can provide you with, You may even have to resort to managing a specific browser through Perl (for example, using Win32 :: IE :: Mechanize or WWW :: Mechanize :: Firefox ).

I have not tried, but there is WWW :: Scripter with WWW :: Scripter :: Plugin :: JavaScript .

+10
source

- Perl - WWW:: , , , , , . . , HTML ( ), , HTML::TreeBuilder, .

, , , WWW::Scripter. WWW:: Mechanize, Javascript AJAX, HTML::DOM, .

+8

, WWW:: Mechanize - -; , , . -, : " , , " "" ", ...".

Scrappy - - :


    my $spidy = Scrappy->new;

    $spidy->crawl('http://search.cpan.org/recent', {
        '#cpansearch li a' => sub {
            print shift->text, "\n";
        }
    });

Scrappy Web:: Scraper , .

, HTML, HTML:: TableExtract - , , , , , :


    use HTML::TableExtract;
    $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
    $te->parse($html_string) or die "Didn't find table";
    foreach $row ($te->rows) {
        print join(',', @$row), "\n";
    }
+8

Perl Web-Scraper. .

, .

+2

You can also watch my new Perl wrapper on top of Java HtmlUnit. It is very easy to use, for example. Check out the quick tutorial here:

http://code.google.com/p/spidey/wiki/QuickTutorial

By tomorrow I will publish some detailed installation instructions and the first release. Unlike Mechanize, you get some JavaScript support, and it is much faster and requires less memory than cleaning the screen.

+1
source

Source: https://habr.com/ru/post/1789196/


All Articles