Convert html source content to readable format using Python 2.x

Question

Convert html source content to readable format using Python 2.x

Python 2.7

I have a program that receives video headers from the source code of a web page, but the headers are encoded in some HTML format.

This is what I have tried so far:

>>> import urllib2 >>> urllib2.unquote('&pound;') '&pound;'

So this did not work ... Then I tried:

 >>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.unescape('&pound;') u'\xa3'

as you can see that this does not work with any combination of the two.

I managed to figure out that '£' is the name of the entity of the HTML symbol. "\ Xa3" I could not find out.

Does anyone know how to do this, how to convert HTML content to readable format in python?

+1

python html

RussW Jul 28 '12 at 20:11

source share

4 answers

Why is this not working?

 In [1]: s = u'\xa3' In [2]: s Out[2]: u'\xa3' In [3]: print s £

When it comes to unescaping html entities, I always used: http://effbot.org/zone/re-sub.htm#unescape-html .

+1

dav1d Jul 28 '12 at 21:13

source share

Video title lines use HTML objects to encode special characters such as ampersands and pound signs.

\xa3 is the letter Python Unicode letter for the pound sign (£). In your example, Python displays __ repr __ () Unicode strings, so you see screens. If you print this line, you will see that it represents the pound sign:

 >>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.unescape('&pound;') u'\xa3' >>> print h.unescape('&pound;') £

+1

Josh rosen Jul 28 '12 at 21:15

source share

lxml, BeautifulSoup or PyQuery does the job pretty well. Or a combination of them;)

0

starenka Jul 29 '12 at 20:53

source share

Francis avila · Accepted Answer · 2012-07-28T21:28:35+0000

£ is the html character character for POUND SIGN , which is the Unicode character U + 00A3. You can see this if you print it:

 >>> print u'\xa3' £

When you use unescape() , you have converted the character of the character to its own unicode character, which means u'\xa3' - one Unicode character U + 00A3.

If you want to encode this in a different format (for example, utf-8), you will do this using the string encode method:

 >>> u'\xa3'.encode('utf-8') '\xc2\xa3'

You get a double-byte string representing a single "POUND SIGN" character.

I suspect you did not understand a bit how string encodings work. You need to convert the string from bytes to unicode (see this answer for one way to do this with urllib2), then unescape html, then (possibly) convert unicode to any desired output encoding.

Convert html source content to readable format using Python 2.x

More articles: