Decoding HTML Objects Using Python

The following Python code uses BeautifulStoneSoup to obtain information on the LibraryThing API for Tolkien "The Children of Húrin".

import urllib2 from BeautifulSoup import BeautifulStoneSoup URL = ("http://www.librarything.com/services/rest/1.0/" "?method=librarything.ck.getwork&id=1907912" "&apikey=2a2e596b887f554db2bbbf3b07ff812a") soup = BeautifulStoneSoup(urllib2.urlopen(URL), convertEntities=BeautifulStoneSoup.ALL_ENTITIES) title_field = soup.find('field', attrs={'name': 'canonicaltitle'}) print title_field.find('fact').string 

Unfortunately, instead of “Húrin,” he types “Húrin.” This is obviously an encoding problem, but I can't decide what I need to do to get the expected result. Help would be greatly appreciated.

0
source share
2 answers

In the source of the webpage, it looks like this: The Children of Húrin . Thus, the encoding is already broken somewhere on their side before it is even converted to XML ...

If this is a common problem with all books, and you need to get around it, this works:

 unicode(title_field.find('fact').string).encode("latin1").decode("utf-8") 
+4
source

A web page may lie about its encoding. The result looks like UTF-8. If you get the string at the end, you will need to decode it as UTF-8. If you have unicode then you need to encode as Latin-1 first.

+1
source

Source: https://habr.com/ru/post/920093/


All Articles