Convert html source content to readable format using Python 2.x

Python 2.7

I have a program that receives video headers from the source code of a web page, but the headers are encoded in some HTML format.

This is what I have tried so far:

>>> import urllib2 >>> urllib2.unquote('£') '£' 

So this did not work ... Then I tried:

 >>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.unescape('£') u'\xa3' 

as you can see that this does not work with any combination of the two.

I managed to figure out that '£' is the name of the entity of the HTML symbol. "\ Xa3" I could not find out.

Does anyone know how to do this, how to convert HTML content to readable format in python?

+1
source share
4 answers

£ is the html character character for POUND SIGN , which is the Unicode character U + 00A3. You can see this if you print it:

 >>> print u'\xa3' Β£ 

When you use unescape() , you have converted the character of the character to its own unicode character, which means u'\xa3' - one Unicode character U + 00A3.

If you want to encode this in a different format (for example, utf-8), you will do this using the string encode method:

 >>> u'\xa3'.encode('utf-8') '\xc2\xa3' 

You get a double-byte string representing a single "POUND SIGN" character.

I suspect you did not understand a bit how string encodings work. You need to convert the string from bytes to unicode (see this answer for one way to do this with urllib2), then unescape html, then (possibly) convert unicode to any desired output encoding.

+1
source

Why is this not working?

 In [1]: s = u'\xa3' In [2]: s Out[2]: u'\xa3' In [3]: print s Β£ 

When it comes to unescaping html entities, I always used: http://effbot.org/zone/re-sub.htm#unescape-html .

+1
source

Video title lines use HTML objects to encode special characters such as ampersands and pound signs.

\xa3 is the letter Python Unicode letter for the pound sign (Β£). In your example, Python displays __ repr __ () Unicode strings, so you see screens. If you print this line, you will see that it represents the pound sign:

 >>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> h.unescape('£') u'\xa3' >>> print h.unescape('£') Β£ 
+1
source

lxml, BeautifulSoup or PyQuery does the job pretty well. Or a combination of them;)

0
source

Source: https://habr.com/ru/post/1492357/


All Articles