BeautifulSoup `find_all` generator

Is there a way to turn find_all into a generator with more memory? For instance:

Given:

 soup = BeautifulSoup(content, "html.parser") return soup.find_all('item') 

I would like to use instead:

 soup = BeautifulSoup(content, "html.parser") while True: yield soup.next_item_generator() 

(assuming the correct transfer of the final StopIteration exception)

There are several generators built-in, but not for the next search result. find returns only the first element. With thousands of elements, find_all sucks in a lot of memory. For 5792 elements, I see a surge of just over 1 GB of RAM.

I am well aware that there are more efficient parsers like lxml that can do this. Suppose there are other business restrictions that prevent me from using anything else.

How can I turn find_all into a generator to iterate in a more efficient way.

+5
source share
3 answers

There is no generator to โ€œfindโ€ in BeautifulSoup , from what I know, but we can combine the use of SoupStrainer and .children generator .

Suppose we have this HTML sample:

 <div> <item>Item 1</item> <item>Item 2</item> <item>Item 3</item> <item>Item 4</item> <item>Item 5</item> </div> 

from which we need to get the text of all item nodes.

We can use SoupStrainer to analyze only item tags, and then iterate over the .children generator and get the texts:

 from bs4 import BeautifulSoup, SoupStrainer data = """ <div> <item>Item 1</item> <item>Item 2</item> <item>Item 3</item> <item>Item 4</item> <item>Item 5</item> </div>""" parse_only = SoupStrainer('item') soup = BeautifulSoup(data, "html.parser", parse_only=parse_only) for item in soup.children: print(item.get_text()) 

Print

 Item 1 Item 2 Item 3 Item 4 Item 5 

In other words, the idea is to shorten the tree to the desired tags and use one of the available generators , for example .children , you can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, for example. sort of:

 def generate_items(soup): for tag in soup.descendants: if tag.name == "item": yield tag.get_text() 

.descendants generates children recursively, while .children will only consider direct node children.

+6
source

The easiest way is to use find_next :

 soup = BeautifulSoup(content, "html.parser") def find_iter(tagname): tag = soup.find(tagname) while tag is not None: yield tag tag = tag.find_next(tagname) 
+5
source

Document :

I gave the PEP generators 8-compatible names and converted them to Properties:

 childGenerator() -> children nextGenerator() -> next_elements nextSiblingGenerator() -> next_siblings previousGenerator() -> previous_elements previousSiblingGenerator() -> previous_siblings recursiveChildGenerator() -> descendants parentGenerator() -> parents 

There is a chapter in the document called Generators , you can read it.

SoupStrainer will analyze only part of the html, it can save memory, but only eliminates the unnecessary tag, if you have html tags of the desired tag, this will lead to the same memory problem.

0
source

Source: https://habr.com/ru/post/1262008/


All Articles