You need unescape HTML entities and URL unquote.
In the standard HTMLParser
and urllib2
, help with these tasks.
import HTMLParser, urllib2 markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com"> <em>ada@graphics.maestro.com</em> <em>mel@graphics.maestro.com</em>''' result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup)) for line in result.split("\n"): print(line)
Result:
<a href="mailto:lad at maestro dot com"> <em> ada@graphics.maestro.com </em> <em> mel@graphics.maestro.com </em>
Edit:
If your pages can contain characters other than ASCII, you need to take care of decoding at the input and encode at the output.
The downloaded sample file has the cp-1252
character set, so try decoding it using Unicode:
import codecs with codecs.open(filename, encoding="cp1252") as fin: decoded = fin.read() result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded)) with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou: fou.write(result)
Edit2:
If you don't like non-ASCII characters, you can simplify things a bit:
with open(filename) as fin: decoded = fin.read().decode('ascii','ignore') ...
source share