SoupStrainer will do exactly what you ask for, and, as a bonus, you will get a performance boost, because it will analyze exactly what you want it to parse - not a complete document tree:
from bs4 import BeautifulSoup, SoupStrainer parse_only = SoupStrainer(id=["first", "third", "loner"]) soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
Now the soup object will contain only the necessary elements:
<div id="first"> <p> A paragraph. </p> <a href="another_doc.html"> A link </a> <p> A paragraph. </p> </div> <div id="third"> <p> A paragraph. </p> <a href="another_doc.html"> A link </a> <a href="yet_another_doc.html"> A link </a> </div> <p id="loner"> A paragraph. </p>
Can I also specify not only identifiers, but also tags? For example, if I want to filter all paragraphs using class = "someclass, but not divs with the same class?
In this case, you can make a search function to join several criteria for SoupStrainer :
from bs4 import BeautifulSoup, SoupStrainer, ResultSet my_document = """ <html> <body> <h1>Some Heading</h1> <div id="first"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <p>A paragraph.</p> </div> <div id="second"> <p>A paragraph.</p> <p>A paragraph.</p> </div> <div id="third"> <p>A paragraph.</p> <a href="another_doc.html">A link</a> <a href="yet_another_doc.html">A link</a> </div> <p id="loner">A paragraph.</p> <p class="myclass">test</p> </body> </html> """ def search(tag, attrs): if tag == "p" and "myclass" in attrs.get("class", []): return tag if attrs.get("id") in ["first", "third", "loner"]: return tag parse_only = SoupStrainer(search) soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only) print(soup.prettify())
source share