How to get BeautifulSoup to parse the contents of textarea tags as HTML?

Question

How to get BeautifulSoup to parse the contents of textarea tags as HTML?

Prior to version 3.0.5, BeautifulSoup is used to treat <textarea> content as HTML. Now he sees it as text. The document I'm processing has HTML inside the textarea tags, and I'm trying to process it.

I tried:

for textarea in soup.findAll('textarea'): contents = BeautifulSoup.BeautifulSoup(textarea.contents) textarea.replaceWith(contents.html(text=True))

But I get errors. I cannot find this in the documentation, and alternative parsers do not help. Does anyone know how I can parse text fields like HTML?

Edit:

HTML example:

 <textarea class="ks-lazyload-custom"> <div class="product-view product-view-rug"> Foobar Womble <div class="product-view-head"> <img src="tps/i1/fo-25.gif" /> </div> </div> </textarea>

Error:

 File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) TypeError: expected string or buffer

I am looking for a way to take an element, extract the contents, parse them with BeautifulSoup, collapse it into text and then replace the contents of the original element (or replace the whole element) with that text.

As for the real world versus specifications, this is actually not particularly relevant here. Data needs to be analyzed, I'm looking for a way to do this.

+4

python html-parsing beautifulsoup

brofield Apr 19 '10 at 5:49

source share

2 answers

Now I use the following code, which works mostly. Your movement may vary.

 def _extractText(self, data, encoding): if self.isDebug: self._output("_extractText") soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding) comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment)) [comment.extract() for comment in comments] [script.extract() for script in soup.findAll('script')] [css.extract() for css in soup.findAll('style')] for textarea in soup.findAll('textarea'): textarea.string = self._extractText(textarea.renderContents(), 'UTF-8') text = unicode('') for line in soup.findAll(text=True): line = line.replace('&nbsp;', ' ').strip() if line == '': continue if line.startswith('doctype'): continue if line.startswith('DOCTYPE'): continue text = text + line + '\n' return text

0

brofield Apr 19 '10 at 8:01

source share

Justin peel · Accepted Answer · 2010-04-19T17:45:30+0000

This seems to work quite well (if I understood correctly what you wanted):

 for textarea in soup.findAll('textarea'): contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents() textarea.replaceWith(contents)

How to get BeautifulSoup to parse the contents of textarea tags as HTML?

More articles: