Remove HTML tags in AppEngine Python Env (Ruby Sanitize equivalent)

Question

Remove HTML tags in AppEngine Python Env (Ruby Sanitize equivalent)

I am looking for a python module that will help me get rid of HTML tags, but keep the text values. I tried BeautifulSoup before, and I could not figure out how to do this. I tried to find Python modules that could do this, but they all seem to depend on other libraries that don't work on AppEngine.

The following is sample code from the Ruby sanitize library and what I get in Python:

require 'rubygems' require 'sanitize' html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' Sanitize.clean(html) # => 'foo'

Thanks for your suggestions.

-e

+1

python google-app-engine html-sanitizing

Ecognium Mar 10 '10 at 6:44

source share

5 answers

If you do not want to use separate libraries, you can import the standard django utilities. For instance:

 from django.utils.html import strip_tags html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg' stripped = strip_tags(html) print stripped # you got: foo

It is also already included in Django templates, so you do not need anything else, just use a filter, for example:

 {{ unsafehtml|striptags }}

Btw, this is one of the fastest ways.

+4

Mikhail Kashkin Mar 10 '10 at 16:42

source share

Using lxml:

 htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' from lxml.html import fromstring mySearchTree = fromstring(htmlstring) for item in mySearchTree.cssselect('a'): print item.text

+1

bigredbob Mar 10 '10 at 6:59

source share

 #!/usr/bin/python from xml.dom.minidom import parseString def getText(el): ret = '' for child in el.childNodes: if child.nodeType == 3: ret += child.nodeValue else: ret += getText(child) return ret html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image' dom = parseString('<root>' + html + '</root>') print getText(dom.documentElement)

Print

this is a link and some bold text followed by an image

+1

Amarghosh Mar 10 '10 at 7:00

source share

Late but.

You can use Jinja2.Markup ()

http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags

 from jinja2 import Markup Markup("<div>About</div>").striptags() u'About'

+1

Lauro oliveira Dec 02 '13 at 13:29

source share

Alex martelli · Accepted Answer · 2010-03-10T06:59:51+0000

 >>> import BeautifulSoup >>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' >>> bs = BeautifulSoup.BeautifulSoup(html) >>> bs.findAll(text=True) [u'foo']

This gives you a list of strings (Unicode). If you want to turn it into a single line, use ''.join(thatlist) .

Remove HTML tags in AppEngine Python Env (Ruby Sanitize equivalent)

More articles: