Decoding ampersand hash strings (& # 124 & # 120 & # 97), etc.

Question

Decoding ampersand hash strings (& # 124 & # 120 & # 97), etc.

The solutions in the other answers do not work when I try them, the same lines are output when I try to use these methods.

I am trying to clean up web pages using Python 2.7. I have a webpage loaded and it has some characters that are in the form xwhere 120 seems to represent ascii code. I tried using methods HTMLParser()and decode()but nothing works. Please note that what I have on the webpage in the format is only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please help me decode these strings using Python. I read other answers, but the solutions do not seem to work for me.

+5

python html decode

Ivankovich Jul 20 '16 at 11:21

source share

3 answers

The correct format for the symbolic link &#nnnn;is therefore missing from your example ;. You can add ;and then use HTMLParser.unescape ():

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this result:

&#66;&#108;&#97;&#115;&#116;&#101;&#114;&#106;&#97;&#120;&#120;&#32;
Blasterjaxx

+5

Fabich 20 . '16 12:30

Python 3 html:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

0

frnhr May 04 '19 at 18:23

source share

PM 2Ring · Accepted Answer · 2016-07-20T13:11:32+0000

Depending on what you are doing, you can convert this data into valid HTML symbolic links so that you can parse it in context with the correct HTML parser.

However, it is easy enough to extract strings of numbers and convert them to equivalent ASCII characters. For instance,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

Output

Blasterjaxx

if uskips the initial empty line that we get, because it sstarts with a split line '&#'. In addition, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])

Decoding ampersand hash strings (& # 124 & # 120 & # 97), etc.

More articles: