Encode python string decoding

Question

Encode python string decoding

I have a list of html pages that may contain certain encoded characters. Some examples are given below:

<a href="mailto:lad%20at%20maestro%20dot%20com"> <em>ada&#x40;graphics.maestro.com</em> <em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm not sure of the current terminology) these lines for -

  <a href="mailto:lad at maestro dot com"> <em> ada@graphics.maestro.com </em> <em> mel@graphics.maestro.com </em>

Please note: HTML pages are in string format. Also, I DO NOT want to use any external library like BeautifulSoup or lxml, only python native libraries are ok.

Edit -

The solution below is not perfect. HTML parser unescaping with urllib2 throws

 UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

mistake in some cases.

+6

python character-encoding decode encode

Dexter Mar 25 '12 at 0:43

source share

1 answer

bernie · Accepted Answer · 2012-03-25T00:58:47+0000

You need unescape HTML entities and URL unquote.
In the standard HTMLParser and urllib2 , help with these tasks.

 import HTMLParser, urllib2 markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com"> <em>ada&#x40;graphics.maestro.com</em> <em>mel&#x40;graphics.maestro.com</em>''' result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup)) for line in result.split("\n"): print(line)

Result:

 <a href="mailto:lad at maestro dot com"> <em> ada@graphics.maestro.com </em> <em> mel@graphics.maestro.com </em>

Edit:
If your pages can contain characters other than ASCII, you need to take care of decoding at the input and encode at the output.
The downloaded sample file has the cp-1252 character set, so try decoding it using Unicode:

 import codecs with codecs.open(filename, encoding="cp1252") as fin: decoded = fin.read() result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded)) with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou: fou.write(result)

Edit2:
If you don't like non-ASCII characters, you can simplify things a bit:

 with open(filename) as fin: decoded = fin.read().decode('ascii','ignore') ...

Encode python string decoding

More articles: