HTMLParser.HTMLParser (). Unescape () does not work

I would like to convert HTML objects back to my human-readable format, for example. '£' to '£', '°' to '°', etc.

I read some posts on this subject

Convert html source content to readable format using Python 2.x

Decode HTML objects in Python string?

Convert XML / HTML objects to Unicode string in Python

and according to them, I decided to use the undocumented unescape () function, but it does not work for me ...

My sample code is similar:

 import HTMLParser htmlParser = HTMLParser.HTMLParser() decoded = htmlParser.unescape('© 2013') print decoded 

When I ran this python script, the output anyway:

 © 2013 

instead

 © 2013 

I use Python 2.X while running on the Windows 7 and Cygwin console. I googled and did not find similar problems. Can anyone help me with this?

+4
source share
2 answers

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6 .

Python 2.5:

 >>> import HTMLParser >>> HTMLParser.HTMLParser().unescape('©') '©' 

Python 2.6 / 2.7:

 >>> import HTMLParser >>> HTMLParser.HTMLParser().unescape('©') u'\xa9' 

See 2.5 implementation and 2.6 implementation / 2.7 implementation

+6
source

This site contains some solutions, here is one of them:

 from xml.sax.saxutils import escape, unescape html_escape_table = { '"': """, "'": "'", "©": "©" # etc... } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_unescape(text): return unescape(text, html_unescape_table) 

Not the prettiest thing, because you have to manually display each escaped character.

EDIT:

How about this?

 import htmllib def unescape(s): p = htmllib.HTMLParser(None) p.save_bgn() p.feed(s) return p.save_end() 
+1
source

Source: https://habr.com/ru/post/1492347/


All Articles