Perl XPath statement with conditional - is this possible?

This question has been rephrased. I use CPAN Perl modules WWW :: Mechanize to navigate a website, HTML :: TreeBuilder-XPath to capture content, and xacobeo to test my XPath code for HTML / XML. The goal is to call this Perl script from a PHP-based website and load the contents of the scraper into the database. Therefore, if the content is "missing", it should still be considered.

Below is a tested, shortened code sample depicting my task. Note:

  • This page is dynamically populated and contains various ITEMS displayed for different stores; for each store there will be a different number of Products* . And these product lists may or may not have a table underneath.
  • The captured data must be in arrays and any detailed list (if one exists) must be associated with the Product listing.

Below, the xml example changes to the repository (as described above), but for brevity I show only one "type" of output. I understand that all data can be written into one array, and then the regular expression is used to decrypt the content in order to load it into the database. I am looking for better XPath knowledge to simplify this (and future) solution (s).

 <!DOCTYPE XHTML> <table id="8jd9c_ITEMS"> <tr><th style="color:red">The Products we have in stock!</th></tr> <tr><td><span id="Product_NUTS">We have nuts!</span></td></tr> <tr><td> <!--Table may or may not exist --> <table> <tr><td style="color:blue;text-indent:10px">Almonds</td></tr> <tr><td style="color:blue;text-indent:10px">Cashews</td></tr> <tr></tr> </table> </td></tr> <tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr> <tr><td> <!--Table may or may not exist --> <table> <tr><td style="color:blue;text-indent:10px">Carrots</td></tr> <tr><td style="color:blue;text-indent:10px">Celery</td></tr> <tr></tr> </table> </td></tr> <tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr> <!--In this case, the table does not exist --> </table> 

XPath instruction:

 '//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text()' 

will find:

 We have nuts! we have veggies! We have booze! 

And the XPath instruction:

 '//table[contains(@id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()' 

will find:

 Almonds Cashews Carrots Celery 

Two XPath statements can be combined:

 '//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()' 

To find:

 We have nuts! Almonds Cashews We have veggies! Carrots Celery We have booze! 

Again, the aforementioned array can be decrypted (in real code) to merge it between the list using regex. But can an array be constructed using XPath in such a way as to preserve this connection?

For example (pseudo-speaking, this does not work):

 '//table[contains(@id, "ITEMS")]/tr[position()>1]/td/span/text() | if exists('//table[contains(@id, "ITEMS")]/tr[position() >1]/table)) then ("NoTable") else ("TableRef") | Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')' 

It is not possible to build multidimensional arrays (in the traditional sense) in Perl, see perldoc perlref But, hopefully, a solution similar to the above can create something like:

 @ITEMS[0] => We have nuts! @ITEMS[1] => nutsREF <-- say, the last word of the span value + REF @ITEMS[2] => We have veggies! @ITEMS[3] => veggiesREF <-- say, the last word of the span value + REF @ITEMS[4] => We have booze! @ITEMS[5] => NoTable <-- value accounts for the missing info @nutsREF[0] => Almonds @nutsREF[1] => Cashews @veggiesREF[0] => Carrots @veggiesREF[1] => Celery 

The products are known in real code, so my @veggiesREF and my @nutsREF can be defined in anticipation of XPath.

I understand that XPath if / else / then functionality is in XPath 2.0. I am on an ubuntu system and working locally, but I still don't understand if my server is using apache2 or version 1.0. How can I check this?

Finally, if you can show how to call a Perl script from a PHP submit form, and how to pass a Perl array to the calling PHP function, then this will go in the way of receiving the reward. :)

Thanks!

COMPLETION:

The comments directly under this post were directed to the initial post, which was too vague. Subsequent re-posting (and generosity) was answered by ikegs with a very creative use that resolved the pseudo-problem, but it was difficult for me to understand and reuse in my real application, which entailed reuse on different html pages. About the 18th comment in our dialog, I finally discovered its meaning and use ($ cat) - the undocumented Perl syntax that he used. For new readers, understanding this syntax allows us to understand (and reformat) its intellectual solution to the problem. His post, of course, meets the basic requirements that are sought in the OP, but does not use HTML :: TreeBuilder :: XPath for this.

jpalecek uses HTML :: TreeBuilder :: XPath, but does not put the captured data into arrays to pass back the PHP function and load it into the database.

I learned from both respondents and hope this post will help others who are not familiar with Perl, like me. Any final contributions would be highly appreciated.

+4
source share
2 answers

If I could guess, your question is: "How do I get the following from the provided input?"

 my $categorized_items = { 'We have nuts!' => [ 'Almonds', 'Cashwes' ], 'We have veggies!' => [ 'Carrots', 'Celery' ], 'We have booze!' => [ ], }; 

If so, how would I do it:

 use Data::Dumper qw( Dumper ); use XML::LibXML qw( ); my $root = XML::LibXML->load_xml(IO=>\*DATA)->documentElement; my %cat_items; for my $cat_tr ($root->findnodes('//table[contains(@id, "ITEMS")]/tr[td/span]')) { my ($cat) = map $_->textContent(), $cat_tr->findnodes('td/span'); my @items = map $_->textContent(), $cat_tr->findnodes('following-sibling::tr[position()=1]/td/table/tr/td'); $cat_items{$cat} = \@items; } print(Dumper(\%cat_items)); __DATA__ ...xml... 

PS - You have invalid HTML.

  • The TABLE element cannot be placed directly inside the TR element. There is no TD element.
  • TR element cannot be empty. It must have at least one TH or TD element.
+5
source
  • How to make sure something exists before running query . For instance. if //p[@class='red'] exists, return //table :

     /.[//p[@class='red']]//table 
  • x[3 and 4 and 5] : 3 and 4 and 5 is a Boolean expression that gives true . Therefore, he will get you all x s. For 3rd, 4th and 5th you want

     x[position() >= 3 and position() <= 5] 

Answer to the editable question:

Why don't you use XML::XPathEngine with multiple queries?

 my $xp = XML::XPathEngine->new; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse (something); 

Then you can request:

 my $shops = $xp->findnodes('//table[contains(@id, "ITEMS")]/tr[position() >1]/td[@span]', $tree); for($shops->get_nodelist) { print "Name of shop is ".$xp->findvalue('span/text()', $_)."\n"; # <- query relative to $_ print "The shop sells:\n". join("\n", $xp->findvalues('parent::*/following-sibling::tr[1][not(span)]/td/table/tr/td', $_)); } 

This does the same as @ikegami answer ( XML::XPathEngine used by HTML::TreeBuilder::XPath ). BTW, if stores can have more lines with products after them, this should be updated.

+2
source

Source: https://habr.com/ru/post/1393851/


All Articles