BeautifulSoup `find_all` generator

Question

BeautifulSoup `find_all` generator

Is there a way to turn find_all into a generator with more memory? For instance:

Given:

 soup = BeautifulSoup(content, "html.parser") return soup.find_all('item')

I would like to use instead:

 soup = BeautifulSoup(content, "html.parser") while True: yield soup.next_item_generator()

(assuming the correct transfer of the final StopIteration exception)

There are several generators built-in, but not for the next search result. find returns only the first element. With thousands of elements, find_all sucks in a lot of memory. For 5792 elements, I see a surge of just over 1 GB of RAM.

I am well aware that there are more efficient parsers like lxml that can do this. Suppose there are other business restrictions that prevent me from using anything else.

How can I turn find_all into a generator to iterate in a more efficient way.

+5

python parsing beautifulsoup

Jamie counsell Dec 29 '16 at 2:06

source share

3 answers

The easiest way is to use find_next :

 soup = BeautifulSoup(content, "html.parser") def find_iter(tagname): tag = soup.find(tagname) while tag is not None: yield tag tag = tag.find_next(tagname)

+5

ekhumoro Dec 29 '16 at 2:29

source share

Document :

I gave the PEP generators 8-compatible names and converted them to Properties:

 childGenerator() -> children nextGenerator() -> next_elements nextSiblingGenerator() -> next_siblings previousGenerator() -> previous_elements previousSiblingGenerator() -> previous_siblings recursiveChildGenerator() -> descendants parentGenerator() -> parents

There is a chapter in the document called Generators , you can read it.

SoupStrainer will analyze only part of the html, it can save memory, but only eliminates the unnecessary tag, if you have html tags of the desired tag, this will lead to the same memory problem.

0

宏杰李 Dec 29 '16 at 2:29

source share

alecxe · Accepted Answer · 2016-12-29T02:14:59+0000

There is no generator to “find” in BeautifulSoup , from what I know, but we can combine the use of SoupStrainer and .children generator .

Suppose we have this HTML sample:

 <div> <item>Item 1</item> <item>Item 2</item> <item>Item 3</item> <item>Item 4</item> <item>Item 5</item> </div>

from which we need to get the text of all item nodes.

We can use SoupStrainer to analyze only item tags, and then iterate over the .children generator and get the texts:

 from bs4 import BeautifulSoup, SoupStrainer data = """ <div> <item>Item 1</item> <item>Item 2</item> <item>Item 3</item> <item>Item 4</item> <item>Item 5</item> </div>""" parse_only = SoupStrainer('item') soup = BeautifulSoup(data, "html.parser", parse_only=parse_only) for item in soup.children: print(item.get_text())

Print

 Item 1 Item 2 Item 3 Item 4 Item 5

In other words, the idea is to shorten the tree to the desired tags and use one of the available generators , for example .children , you can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, for example. sort of:

 def generate_items(soup): for tag in soup.descendants: if tag.name == "item": yield tag.get_text()

.descendants generates children recursively, while .children will only consider direct node children.

BeautifulSoup `find_all` generator

More articles: