HTMLParser.HTMLParser (). Unescape () does not work

Question

HTMLParser.HTMLParser (). Unescape () does not work

I would like to convert HTML objects back to my human-readable format, for example. '£' to '£', '°' to '°', etc.

I read some posts on this subject

Convert html source content to readable format using Python 2.x

Decode HTML objects in Python string?

Convert XML / HTML objects to Unicode string in Python

and according to them, I decided to use the undocumented unescape () function, but it does not work for me ...

My sample code is similar:

 import HTMLParser htmlParser = HTMLParser.HTMLParser() decoded = htmlParser.unescape('&copy; 2013') print decoded

When I ran this python script, the output anyway:

 &copy; 2013

instead

 © 2013

I use Python 2.X while running on the Windows 7 and Cygwin console. I googled and did not find similar problems. Can anyone help me with this?

+4

python html unicode

Dq Jul 19 '13 at 16:48

source share

2 answers

Drmeers · Answer 1 · 2014-04-04T10:26:26+0000

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6 .

Python 2.5:

 >>> import HTMLParser >>> HTMLParser.HTMLParser().unescape('&copy;') '&copy;'

Python 2.6 / 2.7:

 >>> import HTMLParser >>> HTMLParser.HTMLParser().unescape('&copy;') u'\xa9'

See 2.5 implementation and 2.6 implementation / 2.7 implementation

epiphone · Answer 2 · 2013-07-19T17:15:50+0000

This site contains some solutions, here is one of them:

 from xml.sax.saxutils import escape, unescape html_escape_table = { '"': "&quot;", "'": "&apos;", "©": "&copy;" # etc... } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_unescape(text): return unescape(text, html_unescape_table)

Not the prettiest thing, because you have to manually display each escaped character.

EDIT:

How about this?

 import htmllib def unescape(s): p = htmllib.HTMLParser(None) p.save_bgn() p.feed(s) return p.save_end()

HTMLParser.HTMLParser (). Unescape () does not work

More articles: