• I do not get HTML tag when parsing

    The snippet of HTML that I want to parse is as follows:

    <ul class="authors">
        <li class="author" itemprop="author" itemscope="itemscope" itemtype="http://schema.org/Person">
            <a href="/search?facet-creator=%22Charles+L.+Fefferman%22" itemprop="name">Charles L. Fefferman</a>,
        </li>
        <li class="author" itemprop="author" itemscope="itemscope" itemtype="http://schema.org/Person">
            <a href="/search?facet-creator=%22Jos%C3%A9+L.+Rodrigo%22" itemprop="name">José L. Rodrigo</a>
        </li>
    

    I want to highlight whole elements <a>, but while I try to parse it with WWW::Mechanize::TreeBuilder, the only content that I get are the names of the authors. So:

    Content I Expect:

    <a href="/search?facet-creator=%22Charles+L.+Fefferman%22" itemprop="name">Charles L. Fefferman</a>,
    
    <a href="/search?facet-creator=%22Jos%C3%A9+L.+Rodrigo%22" itemprop="name">José L. Rodrigo</a>
    

    Content I get:

    Charles L. Fefferman,
    José L. Rodrigo
    

    Here is the code responsible for parsing this:

    my $mech = WWW::Mechanize->new();
    WWW::Mechanize::TreeBuilder->meta->apply($mech);
    $mech->get($addressdio);
    
    my @authors = $mech->look_down('class', 'author');
    
    print "Authors: <br />";
    foreach ( @authors ) {
        say $_->as_text(), "<br />";
    }
    

    I thought that this could be due to as_text(), and also that when CGI receives HTML, it does not perceive it as text.

    +4
    source share
    1 answer

    I processed it, but in a completely different way - using HTML :: TagParser:

    my $html = HTML::TagParser->new("overwrite.xml");
    my @li = $html->getElementsByAttribute('class','author');
    
    foreach(@li){
        my $a = $_->firstChild();
        my $link = $a->getAttribute('href');
        say $_->innerText;
    
        say $link;
    }
    
    +3
    source

    Source: https://habr.com/ru/post/1598320/


    All Articles