Remove HTML tags in AppEngine Python Env (Ruby Sanitize equivalent)

I am looking for a python module that will help me get rid of HTML tags, but keep the text values. I tried BeautifulSoup before, and I could not figure out how to do this. I tried to find Python modules that could do this, but they all seem to depend on other libraries that don't work on AppEngine.

The following is sample code from the Ruby sanitize library and what I get in Python:

require 'rubygems' require 'sanitize' html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' Sanitize.clean(html) # => 'foo' 

Thanks for your suggestions.

-e

+1
source share
5 answers
 >>> import BeautifulSoup >>> html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' >>> bs = BeautifulSoup.BeautifulSoup(html) >>> bs.findAll(text=True) [u'foo'] 

This gives you a list of strings (Unicode). If you want to turn it into a single line, use ''.join(thatlist) .

+5
source

If you do not want to use separate libraries, you can import the standard django utilities. For instance:

 from django.utils.html import strip_tags html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg' stripped = strip_tags(html) print stripped # you got: foo 

It is also already included in Django templates, so you do not need anything else, just use a filter, for example:

 {{ unsafehtml|striptags }} 

Btw, this is one of the fastest ways.

+4
source

Using lxml:

 htmlstring = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />' from lxml.html import fromstring mySearchTree = fromstring(htmlstring) for item in mySearchTree.cssselect('a'): print item.text 
+1
source
 #!/usr/bin/python from xml.dom.minidom import parseString def getText(el): ret = '' for child in el.childNodes: if child.nodeType == 3: ret += child.nodeValue else: ret += getText(child) return ret html = '<b>this is <a href="http://foo.com/">a link </a> and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image' dom = parseString('<root>' + html + '</root>') print getText(dom.documentElement) 

Print

this is a link and some bold text followed by an image

+1
source

Late but.

You can use Jinja2.Markup ()

http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags

 from jinja2 import Markup Markup("<div>About</div>").striptags() u'About' 
+1
source

Source: https://habr.com/ru/post/1303676/


All Articles