Convert character references to numeric characters to a Unicode string

Is there a standard, preferably Pythonic, way to convert notation &#xxxx; to the correct unicode string?

For instance,

 מפגשי 

Must be converted to:

 מפגשי 

This can be done - quite easily - using string manipulations, but I wonder if there is a standard library for this.

+6
source share
1 answer

Use HTMLParser.HTMLParser() :

 >>> from HTMLParser import HTMLParser >>> h = HTMLParser() >>> s = "מפגשי" >>> print h.unescape(s) מפגשי 

This is part of the standard standard library .


However, if you are using Python 3, you need to import from html.parser :

 >>> from html.parser import HTMLParser >>> h = HTMLParser() >>> s = 'מפגשי' >>> print(h.unescape(s)) מפגשי 
+9
source

Source: https://habr.com/ru/post/946924/


All Articles