How to convert characters like ":" to ":" in python?

Possible duplicate:
Convert XML / HTML objects to Unicode string in Python

Html sources have tons of characters like "& # 58;" or "& # 46;" (you need to put a space between the # characters and numbers, or these characters will be considered ":" or "."), my questions are: how do you convert them to what they should be in python? Is there a built-in method or something else?

Hope someone can help me. Thanks

+3
source share
2 answers

I'm not sure if there is a built-in library or not, but here is a quick and dirty way to do with regex

>>> import re
>>> re.sub("&#(\d+);",lambda x:unichr(int(x.group(1),10)),": or .")
u': or .'
+5

- ( Python 2.x). , , htmlentitydefs.

import re
from htmlentitydefs import name2codepoint
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    if isinstance(s, str):
        s = s.decode(encoding)
    return EntityPattern.sub(unescape, s)
+2

Source: https://habr.com/ru/post/1793026/


All Articles