This question has been rephrased. I use CPAN Perl modules WWW :: Mechanize to navigate a website, HTML :: TreeBuilder-XPath to capture content, and xacobeo to test my XPath code for HTML / XML. The goal is to call this Perl script from a PHP-based website and load the contents of the scraper into the database. Therefore, if the content is "missing", it should still be considered.
Below is a tested, shortened code sample depicting my task. Note:
- This page is dynamically populated and contains various
ITEMS displayed for different stores; for each store there will be a different number of Products* . And these product lists may or may not have a table underneath. - The captured data must be in arrays and any detailed list (if one exists) must be associated with the Product listing.
Below, the xml example changes to the repository (as described above), but for brevity I show only one "type" of output. I understand that all data can be written into one array, and then the regular expression is used to decrypt the content in order to load it into the database. I am looking for better XPath knowledge to simplify this (and future) solution (s).
<!DOCTYPE XHTML> <table id="8jd9c_ITEMS"> <tr><th style="color:red">The Products we have in stock!</th></tr> <tr><td><span id="Product_NUTS">We have nuts!</span></td></tr> <tr><td> <table> <tr><td style="color:blue;text-indent:10px">Almonds</td></tr> <tr><td style="color:blue;text-indent:10px">Cashews</td></tr> <tr></tr> </table> </td></tr> <tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr> <tr><td> <table> <tr><td style="color:blue;text-indent:10px">Carrots</td></tr> <tr><td style="color:blue;text-indent:10px">Celery</td></tr> <tr></tr> </table> </td></tr> <tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr> </table>
XPath instruction:
'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text()'
will find:
We have nuts! we have veggies! We have booze!
And the XPath instruction:
'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()'
will find:
Almonds Cashews Carrots Celery
Two XPath statements can be combined:
'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()'
To find:
We have nuts! Almonds Cashews We have veggies! Carrots Celery We have booze!
Again, the aforementioned array can be decrypted (in real code) to merge it between the list using regex. But can an array be constructed using XPath in such a way as to preserve this connection?
For example (pseudo-speaking, this does not work):
'//table[contains(@id, "ITEMS")]/tr[position()>1]/td/span/text() | if exists('//table[contains(@id, "ITEMS")]/tr[position() >1]/table)) then ("NoTable") else ("TableRef") | Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')'
It is not possible to build multidimensional arrays (in the traditional sense) in Perl, see perldoc perlref But, hopefully, a solution similar to the above can create something like:
@ITEMS[0] => We have nuts! @ITEMS[1] => nutsREF <-- say, the last word of the span value + REF @ITEMS[2] => We have veggies! @ITEMS[3] => veggiesREF <-- say, the last word of the span value + REF @ITEMS[4] => We have booze! @ITEMS[5] => NoTable <-- value accounts for the missing info @nutsREF[0] => Almonds @nutsREF[1] => Cashews @veggiesREF[0] => Carrots @veggiesREF[1] => Celery
The products are known in real code, so my @veggiesREF and my @nutsREF can be defined in anticipation of XPath.
I understand that XPath if / else / then functionality is in XPath 2.0. I am on an ubuntu system and working locally, but I still don't understand if my server is using apache2 or version 1.0. How can I check this?
Finally, if you can show how to call a Perl script from a PHP submit form, and how to pass a Perl array to the calling PHP function, then this will go in the way of receiving the reward. :)
Thanks!
COMPLETION:
The comments directly under this post were directed to the initial post, which was too vague. Subsequent re-posting (and generosity) was answered by ikegs with a very creative use that resolved the pseudo-problem, but it was difficult for me to understand and reuse in my real application, which entailed reuse on different html pages. About the 18th comment in our dialog, I finally discovered its meaning and use ($ cat) - the undocumented Perl syntax that he used. For new readers, understanding this syntax allows us to understand (and reformat) its intellectual solution to the problem. His post, of course, meets the basic requirements that are sought in the OP, but does not use HTML :: TreeBuilder :: XPath for this.
jpalecek uses HTML :: TreeBuilder :: XPath, but does not put the captured data into arrays to pass back the PHP function and load it into the database.
I learned from both respondents and hope this post will help others who are not familiar with Perl, like me. Any final contributions would be highly appreciated.