BeautifulSoup: delete empty tables while saving partially or empty tables

I have an old website, originally created in MS Frontpage, which I am trying to create for protection. I wrote a BeautifulSoup script that does most of this. It remains only to delete empty tables, for example, tables without textual content or data in any of the td tags.

The problem I ran into is that what I have tried so far deletes the table if at least one of its td tags contains no data, even if others do. This deletes all tables in the entire document, including the ones I want to keep.

 tags = soup.findAll('table',text=None,recursive=True) [tag.extract() for tag in tags] 

Any suggestions for deleting tables in which none of the td tags contain any data? (I don't care if they contain img or empty anchor tags if there is no text).

+4
source share
1 answer

Use the .text property. It extracts all textual content (recursive) inside this element.

Example:

 from BeautifulSoup import BeautifulSoup as BS html = """ <table id="empty"> <tr><td></td></tr> </table> <table id="with_text"> <tr><td>hey!</td></tr> </table> <table id="with_text_in_one_row"> <tr><td></td></tr> <tr><td>hey!</td></tr> </table> <table id="no_text_but_img"> <tr><td><img></td></tr> </table> <table id="no_text_but_a"> <tr><td><a></a></td></tr> </table> <table id="text_in_a"> <tr><td><a>hey!</a></td></tr> </table> """ soup = BS(html) for table in soup.findAll("table" ,text=None,recursive=True): if table.text: print table["id"] 

Outputs:

 with_text with_text_in_one_row text_in_a 
+4
source

Source: https://habr.com/ru/post/1394078/


All Articles