Extracting the contents of the text contents of a table using xpath in rows for consumption?

I have something along the following lines in terms of HTML. I would like to extract the different contents of the table cells, however, I found that sometimes there are built-in divs in the cells and maybe other oddities that I'm not sure about:

<p align="center"> <img src="some_image.gif" alt="Some Title"> </p> <TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0> <TR> <TD colspan=4 ALIGN=center><b>Title</b></TD> </TR> <TR> <TD ALIGN=center>Title</TD> <TD ALIGN=center>date</TD> <TD ALIGN=center>value</TD> <TD ALIGN=center>value</TD> </TR><TR> <TD ALIGN=center>Title2</TD> <TD ALIGN=center></TD> <TD ALIGN=center><div class=redtext>----</div></TD> <TD>&nbsp;</TD> </TR><TR> <TD ALIGN=center>Title3</TD> <TD ALIGN=center><div class=yellowtext>value</div></TD> <TD ALIGN=center><div class=redtext>value</div></TD> <TD ALIGN=center>value<SUP>6</SUP></TD> </TR><TR> <TD ALIGN=center>Title4</TD> <TD ALIGN=center><div class=bluetext>value</div></TD> <TD ALIGN=center><div class=redtext>value</div></TD> <TD>&nbsp;</TD> </TR></TABLE> <blockquote> <p class="textstyle"> Text. </p> </blockquote> 

My first impulse was to extract ALL the texts of the elements and simply cut them programmatically. I would watch Title1, Title2, etc., to know when the line starts, and then if the found value "----" does not matter, just skip this line and go to it. However, I realized that probably the best way to handle this with xpath directly.

How can this be solved with xpath in order to essentially provide each child text content of each cell, and also go to each div if it exists? Or is there more xpath as a way to get closer to this?

Obviously, I'm trying to have the most flexible solution, which will not be fragile if other unexpected elements appear, even if they are unlikely.

+2
source share
3 answers

The provided text is not a well-formed XML document, therefore XPath is not applicable .

If you correct and close it into a well-formed XML document, as shown below, the following expression may be useful:

 /*/TABLE//TD//text() 

or even:

 //TABLE//TD//text() 

Here is a well-formed XML document built from the provided HTML:

 <html> <p align="center"> <img src="some_image.gif" alt="Some Title"/> </p> <TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0"> <TR> <TD colspan="4" ALIGN="center"> <b>Title</b> </TD> </TR> <TR> <TD ALIGN="center">Title</TD> <TD ALIGN="center">date</TD> <TD ALIGN="center">value</TD> <TD ALIGN="center">value</TD> </TR> <TR> <TD ALIGN="center">Title2</TD> <TD ALIGN="center"></TD> <TD ALIGN="center"> <div class="redtext">----</div> </TD> <TD>&#xA0;</TD> </TR> <TR> <TD ALIGN="center">Title3</TD> <TD ALIGN="center"> <div class="yellowtext">value</div> </TD> <TD ALIGN="center"> <div class="redtext">value</div> </TD> <TD ALIGN="center">value <SUP>6</SUP> </TD> </TR> <TR> <TD ALIGN="center">Title4</TD> <TD ALIGN="center"> <div class="bluetext">value</div> </TD> <TD ALIGN="center"> <div class="redtext">value</div> </TD> <TD>&#xA0;</TD> </TR> </TABLE> <blockquote> <p class="textstyle"> Text. </p> </blockquote> </html> 
+3
source

So maybe you don't want to go into divs, but here is my solution using lxml, which I highly recommend:

 import re from cStringIO import StringIO from lxml import etree def getTable(html, table_xpath, rows_xpath, cells_xpath): """Get a table on a webpage""" parser = etree.HTMLParser() # Build document tree and get table root = etree.parse(StringIO(html), parser) table = root.find(table_xpath) if table == None: print 'No table.' return [] rows = table.findall(rows_xpath) document = [] def cleanText(text): """Clean up text by replacing line breaks and tabs. """ return re.sub(r'[\r\n\t]+','',str(text).strip()) # iterate over the table rows and collect text from each cell. for r in rows: cells = r.findall(cells_xpath) rowdata = [] for c in cells: text = '' it = c.itertext() for i in it: text += cleanText(i) + ' ' rowdata.append(text) document.append(rowdata) return document html = """ <html><head><title></title></head><body> <p align="center"> <img src="some_image.gif" alt="Some Title"> </p> <TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0> <TR> <TD colspan=4 ALIGN=center><b>Title</b></TD> </TR> <TR> <TD ALIGN=center>Title</TD> <TD ALIGN=center>date</TD> <TD ALIGN=center>value</TD> <TD ALIGN=center>value</TD> </TR><TR> <TD ALIGN=center>Title2</TD> <TD ALIGN=center></TD> <TD ALIGN=center><div class=redtext>----</div></TD> <TD>&nbsp;</TD> </TR><TR> <TD ALIGN=center>Title3</TD> <TD ALIGN=center><div class=yellowtext>value</div></TD> <TD ALIGN=center><div class=redtext>value</div></TD> <TD ALIGN=center>value<SUP>6</SUP></TD> </TR><TR> <TD ALIGN=center>Title4</TD> <TD ALIGN=center><div class=bluetext>value</div></TD> <TD ALIGN=center><div class=redtext>value</div></TD> <TD>&nbsp;</TD> </TR></TABLE> </body> </html> """ tp = "//table[@width='500']" rt = "tr" cp = "td[@align='center']" doc = getTable(html, tp, rt, cp) print repr(doc) 
0
source

I believe that your program will face many problems, as input data is entered - what if the β€œname” changes, or is there a typo?

It is not possible to make a strict decision to scrap another site, as they cannot completely change everything. It is usually better to write a tolerant and flexible code that at least tries to verify that its output is normal. In this case, it is best to iterate over the results of "// table / tr", then process td elements inside this loop:

 import lxml.etree tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>") stringify = lambda x : "".join(x.xpath(".//text()")) for x in tree.xpath("//table/tr"): print "New row" for y in x.xpath("td"): print stringify(y) 

Output:

 New row test New row test2 

The following code, however, will get the list you request:

 print map(stringify, tree.xpath("//table/tr/td")) 

Output:

 ['test', 'test2'] 

Here all text elements that generally come from td, which is a direct descendant of tr, which, in turn, is a direct descendant of the table, will be found.

(A simple query for all text () elements will create some funny errors when launched in HTML that contains "<td> Foo <b> bar </b> </td>" or similar.)

0
source

Source: https://habr.com/ru/post/1397371/


All Articles