Decoding ampersand hash strings (& # 124 & # 120 & # 97), etc.

The solutions in the other answers do not work when I try them, the same lines are output when I try to use these methods.

I am trying to clean up web pages using Python 2.7. I have a webpage loaded and it has some characters that are in the form xwhere 120 seems to represent ascii code. I tried using methods HTMLParser()and decode()but nothing works. Please note that what I have on the webpage in the format is only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please help me decode these strings using Python. I read other answers, but the solutions do not seem to work for me.

+5
source share
3 answers

Depending on what you are doing, you can convert this data into valid HTML symbolic links so that you can parse it in context with the correct HTML parser.

However, it is easy enough to extract strings of numbers and convert them to equivalent ASCII characters. For instance,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

Output

Blasterjaxx 

if uskips the initial empty line that we get, because it sstarts with a split line '&#'. In addition, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])
+4
source

The correct format for the symbolic link &#nnnn;is therefore missing from your example ;. You can add ;and then use HTMLParser.unescape ():

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this result:

Blasterjaxx 
Blasterjaxx 
+5

Python 3 html:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

0
source

Source: https://habr.com/ru/post/1648495/


All Articles