How to get "subsoups" and merge / join them?

I have an HTML document that I need to process. I use beautifoulsoup for this. Now I would like to get a few โ€œsubpointsโ€ from this document and combine them into one soup, so that later I can use it as a parameter for a function that expects a soup object.

If this is not clear, I will give you an example ...

from bs4 import BeautifulSoup my_document = """ <html> <body> <h1>Some Heading</h1> <div id="first"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <p>A paragraph.</p> </div> <div id="second"> <p>A paragraph.</p> <p>A paragraph.</p> </div> <div id="third"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <a href="yet_another_doc.html">A link</a> </div> <p id="loner">A paragraph.</p> </body> </html> """ soup = BeautifulSoup(my_document) # find the needed parts first = soup.find("div", {"id": "first"}) third = soup.find("div", {"id": "third"}) loner = soup.find("p", {"id": "loner"}) subsoups = [first, third, loner] # create a new (sub)soup resulting_soup = do_some_magic(subsoups) # use it in a function that expects a soup object and calls its methods function_expecting_a_soup(resulting_soup) 

The goal is to have an object in resulting_soup that / behaves like a soup with the following contents:

 <div id="first"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <p>A paragraph.</p> </div> <div id="third"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <a href="yet_another_doc.html">A link</a> </div> <p id="loner">A paragraph.</p> 

Is there any convenient way to do this? If there is a better way to get "subselects" than find() , I can use it instead. Thanks.

Update

There is a solution recommended by Wondercricket that combines the strings containing the found tags and parses them again in the new BeautifulSoup object. Although this is a possible way to solve the problem, reanalysis may take longer than I would like, especially when I want to get most of them, and there are many documents that I need to process. find() returns a bs4.element.Tag . Isn't there a way to combine multiple Tag into one soup without converting Tag to string and parsing a string?

+5
source share
2 answers

SoupStrainer will do exactly what you ask for, and, as a bonus, you will get a performance boost, because it will analyze exactly what you want it to parse - not a complete document tree:

 from bs4 import BeautifulSoup, SoupStrainer parse_only = SoupStrainer(id=["first", "third", "loner"]) soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only) 

Now the soup object will contain only the necessary elements:

 <div id="first"> <p> A paragraph. </p> <a href="another_doc.html"> A link </a> <p> A paragraph. </p> </div> <div id="third"> <p> A paragraph. </p> <a href="another_doc.html"> A link </a> <a href="yet_another_doc.html"> A link </a> </div> <p id="loner"> A paragraph. </p> 

Can I also specify not only identifiers, but also tags? For example, if I want to filter all paragraphs using class = "someclass, but not divs with the same class?

In this case, you can make a search function to join several criteria for SoupStrainer :

 from bs4 import BeautifulSoup, SoupStrainer, ResultSet my_document = """ <html> <body> <h1>Some Heading</h1> <div id="first"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <p>A paragraph.</p> </div> <div id="second"> <p>A paragraph.</p> <p>A paragraph.</p> </div> <div id="third"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <a href="yet_another_doc.html">A link</a> </div> <p id="loner">A paragraph.</p> <p class="myclass">test</p> </body> </html> """ def search(tag, attrs): if tag == "p" and "myclass" in attrs.get("class", []): return tag if attrs.get("id") in ["first", "third", "loner"]: return tag parse_only = SoupStrainer(search) soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only) print(soup.prettify()) 
+4
source

You can use findAll with passing ids elements you want to use.

 import bs4 soup = bs4.BeautifulSoup(my_document) #EDIT -> I discovered you do not need regex, you can pass in a list of `ids` sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']}) #EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags. sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser') print(sub) >>> <div id="first"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <p>A paragraph.</p> </div> <div id="third"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <a href="yet_another_doc.html">A link</a> </div> <p id="loner">A paragraph.</p> 
+3
source

Source: https://habr.com/ru/post/1239494/


All Articles