Decoding HTML Objects Using Python

Question

Decoding HTML Objects Using Python

The following Python code uses BeautifulStoneSoup to obtain information on the LibraryThing API for Tolkien "The Children of Húrin".

import urllib2 from BeautifulSoup import BeautifulStoneSoup URL = ("http://www.librarything.com/services/rest/1.0/" "?method=librarything.ck.getwork&id=1907912" "&apikey=2a2e596b887f554db2bbbf3b07ff812a") soup = BeautifulStoneSoup(urllib2.urlopen(URL), convertEntities=BeautifulStoneSoup.ALL_ENTITIES) title_field = soup.find('field', attrs={'name': 'canonicaltitle'}) print title_field.find('fact').string

Unfortunately, instead of “Húrin,” he types “HÃºrin.” This is obviously an encoding problem, but I can't decide what I need to do to get the expected result. Help would be greatly appreciated.

0

python encoding unicode utf-8 beautifulsoup

Daniel watkins Mar 09 '09 at 10:47

source share

2 answers

A web page may lie about its encoding. The result looks like UTF-8. If you get the string at the end, you will need to decode it as UTF-8. If you have unicode then you need to encode as Latin-1 first.

+1

Ignacio Vazquez-Abrams Mar 09 '09 at 10:53

source share

sth · Accepted Answer · 2009-03-09T23:05:28+0000

In the source of the webpage, it looks like this: The Children of HÃºrin . Thus, the encoding is already broken somewhere on their side before it is even converted to XML ...

If this is a common problem with all books, and you need to get around it, this works:

 unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

Decoding HTML Objects Using Python

More articles: